The calculations were done on the Corpus of Historical American English from Brigham Young University. You can see from their content list that they are heavy on books in the 19th century, then gradually newspapers take up more and more of the corpus. The overall amount of fiction is relatively stable, but this trend analysis is quite sensitive to corpus composition in each decade.
Until the 1920s, every popular word comes from books, usually a character name. For example, in the 1870s, there were at least five major books by different authors with a major character (whose name, therefore, got repeated a lot) named Elsie. It appears (at first glance, anyway) that character names had a bandwagon effect, much like baby names do 100 years later.
Also present are deliberately misspelled words like "uv" for "of" and "ter" for "to" (like "Ah oughts ter uv dun somethin") . This was a style of satirical writing at the time, not all of it racist, but certainly some of it.
In the 20th century, President's names dominate, except for "planes" during World War II and, surprisingly to me, EPA (for Environmental Protection Agency) in the 1990s. The reason it beat out "Clinton" is that his name kept being used throughout the next decade, and "Bush" because that name has a common meaning as well.
I've used a chromatography peak technique (popular in analytical chemistry and astronomy) to analyze non-hard-science data before, here's a quick visual of how it works:
|word||trendiness||peak year||height (% popularity|
at peak year)
|width at 50%|
A few observations: most of the words have a peak width of 10 years (the minimum, since COHA's resolution is at the decade level). Notable exceptions are Roosevelt (FDR was president during two decades, one wartime) and Soviet (a 40 year peak, which means the peak height was quite high to make it on the list). Some words of note: failed presidential candidates Dewey, Stevenson and Goldwater; Newt (but not Gingrich); Hitler and Nazi; Watergate; Ronald (the only presidential first name on the list).
The code used is on my GitHub, but here's the gist of it (no pun intended):
- The COHA 1-gram corpus is restricted, but I have an academic licence. Thanks to BYU for that. On my GitHub, I have summary data, but not the dataset itself.
- COHA is arranged in decades; I assigned each word the year in the middle of the decade (e.g. 1970s, which covers 1970-1979, was assigned 1975).
- For each word, I interpolated by simple mean the popularity for years ending in "0". For example, if a word was a 0.0024% in 1975 and 0.0026% in 1985, I assigned 0.0025% in 1980. This was so peak widths for words that appeared only in one decade could be calculated (otherwise they would have a peak width of zero, and have infinite 'trendiness').
- Instead of interpolating further to calculate peak widths (which would be entering overfitting territory), I used a simple Boolean test to calculate the start and end of each peak. The first time a point at a five-year interval exceeded 50% maximum peak height, the counter started, and then the first time it sank below 50%, it stopped. This means if a word was bimodal (two peaks in different years) with a point below 50% of maximum between the two peaks, only the larger peak was counted. This was not a common occurrence, and it ensured words only ever appeared once each.
- "Trendiness" was calculated by the peak height (in % of corpus during that year) divided by peak width (in years, always a multiple of five for reasons explained in the previous step)