Monday, September 29, 2014

The trendiest words in American English for each decade of 19th & 20th c. (determined by a chemistry/astronomy technique)

While "trend" has a clear mathematical definition, "trendiness" does not; I've chosen a method that is equally sensitive to the nerdy sense of the word (rapidity of rise/fall) AND the common meaning ("trendy" = "popular"). More explanation later; here's the chart (click to enlarge):


The calculations were done on the Corpus of Historical American English from Brigham Young University. You can see from their content list that they are heavy on books in the 19th century, then gradually newspapers take up more and more of the corpus. The overall amount of fiction is relatively stable, but this trend analysis is quite sensitive to corpus composition in each decade.

Until the 1920s, every popular word comes from books, usually a character name. For example, in the 1870s, there were at least five major books by different authors with a major character (whose name, therefore, got repeated a lot) named Elsie. It appears (at first glance, anyway) that character names had a bandwagon effect, much like baby names do 100 years later.

Also present are deliberately misspelled words like "uv" for "of" and "ter" for "to" (like "Ah oughts ter uv dun somethin") . This was a style of satirical writing at the time, not all of it racist, but certainly some of it.

In the 20th century, President's names dominate, except for "planes" during World War II and, surprisingly to me, EPA (for Environmental Protection Agency) in the 1990s. The reason it beat out "Clinton" is that his name kept being used throughout the next decade, and "Bush" because that name has a common meaning as well.

I've used a chromatography peak technique (popular in analytical chemistry and astronomy) to analyze non-hard-science data before, here's a quick visual of how it works:


Here is a list of the 100 trendiest words overall:

wordtrendinesspeak yearheight (% popularity
at peak year)
width at 50%
height (years)
reagan0.0033819850.033810
nixon0.0028319750.028310
uv0.0027718650.027710
kennedy0.0027519650.027510
eisenhower0.0022419550.022410
ter0.0016918850.016910
communist0.0016619550.024915
planes0.0012219450.012210
jimmie0.0011819150.011810
coolidge0.0011119250.011110
elsie0.0010718750.010710
bradshaw0.0010718350.010710
korea0.0010619550.010610
rollo0.0010418550.010410
vietnam0.0010319650.015415
roosevelt0.0010319350.020520
katy0.0009818650.009810
graeme0.0009418650.009410
eleanor0.0009319250.009310
winthrop0.0009318550.009310
jeff0.0009119550.009110
madeleine0.0008918650.008910
dave0.0008819150.008810
communists0.0008819550.013215
lanny0.0008619450.008610
dulles0.0008419550.008410
pa0.0008218850.008210
amy0.0008118650.008110
jimbo0.0008019750.008010
isabella0.0007818350.007810
kissinger0.0007819750.007810
soviet0.0007719550.030740
redwood0.0007618250.007610
dewey0.0007619450.007610
stitch0.0007518750.007510
gypsy0.0007418650.007410
hev0.0007318650.007310
hitler0.0007219450.010815
elvira0.0007118250.007110
mcs0.0007119550.007110
atomic0.0007119550.010615
cuba0.0006919650.006910
alessandro0.0006818850.006810
wilford0.0006818650.006810
truman0.0006819550.010215
malone0.0006719650.006710
magdalen0.0006718750.006710
korean0.0006619550.006610
rowland0.0006618750.006610
stevenson0.0006519550.006510
mabel0.0006518550.009715
beulah0.0006518850.006510
goldwater0.0006319650.006310
tommy0.0006319350.006310
gaulle0.0006219650.006210
jessie0.0006119450.006110
ramona0.0006118850.006110
vasco0.0006118350.006110
bunny0.0006019250.006010
newt0.0005918650.005910
gubb0.0005819150.005810
epa0.0005819950.005810
ivan0.0005819050.005810
christie0.0005718750.005710
madonna0.0005718950.005710
banneker0.0005719250.005710
hammond0.0005618250.005610
viet0.0005619650.008415
hed0.0005618650.005610
harding0.0005519250.005510
dorothy0.0005519050.005510
subcommittee0.0005419550.005410
elnora0.0005419050.005410
teddy0.0005418950.005410
id0.0005219950.005210
seor0.0005218350.005210
lulu0.0005218850.005210
downing0.0005218350.005210
lucia0.0005118250.005110
montague0.0005119050.005110
lemuel0.0005118750.005110
wich0.0005118650.005110
christy0.0005118950.007615
israeli0.0005119750.007615
bertha0.0005118650.005110
nazi0.0005119450.007615
heyward0.0005018250.005010
watergate0.0005019750.005010
ms0.0005019550.005010
puffer0.0005018450.005010
didn0.0004919850.004910
purl0.0004918750.004910
maroney0.0004918750.004910
nunez0.0004918350.004910
trina0.0004919250.004910
ronald0.0004819850.004810
randy0.0004819050.004810
georgie0.0004818750.004810
castro0.0004819650.004810
lottie0.0004718750.004710

A few observations: most of the words have a peak width of 10 years (the minimum, since COHA's resolution is at the decade level). Notable exceptions are Roosevelt (FDR was president during two decades, one wartime) and Soviet (a 40 year peak, which means the peak height was quite high to make it on the list). Some words of note: failed presidential candidates Dewey, Stevenson and Goldwater; Newt (but not Gingrich); Hitler and Nazi; Watergate; Ronald (the only presidential first name on the list).

The code used is on my GitHub, but here's the gist of it (no pun intended):
  1. The COHA 1-gram corpus is restricted, but I have an academic licence. Thanks to BYU for that. On my GitHub, I have summary data, but not the dataset itself.
  2. COHA is arranged in decades; I assigned each word the year in the middle of the decade (e.g. 1970s, which covers 1970-1979, was assigned 1975).
  3. For each word, I interpolated by simple mean the popularity for years ending in "0". For example, if a word was a 0.0024% in 1975 and 0.0026% in 1985, I assigned 0.0025% in 1980. This was so peak widths for words that appeared only in one decade could be calculated (otherwise they would have a peak width of zero, and have infinite 'trendiness').
  4. Instead of interpolating further to calculate peak widths (which would be entering overfitting territory), I used a simple Boolean test to calculate the start and end of each peak. The first time a point at a five-year interval exceeded 50% maximum peak height, the counter started, and then the first time it sank below 50%, it stopped. This means if a word was bimodal (two peaks in different years) with a point below 50% of maximum between the two peaks, only the larger peak was counted. This was not a common occurrence, and it ensured words only ever appeared once each.
  5. "Trendiness" was calculated by the peak height (in % of corpus during that year) divided by peak width (in years, always a multiple of five for reasons explained in the previous step)

0 comments:

Post a Comment

Please leave comments & corrections here. Courtesy is appreciated.

Popular Posts

Scroll To Top