Most decade-specific words in Billboard popular song titles, 1890-2014


Chart first, then explanation (click to enlarge):
The inspiration for this post came from my being too lazy to set my iPod to shuffle, and then noticing it played a bunch of songs in a row from the 1930s and '40s that started with the letters "in" ("In the Wee Small Hours of the Morning," "In the Still of the Night", etc.) Naturally, being a data nerd, my first thought was to quantify the phenomenon.

The data comes not from Billboard itself, but from  www.bullfrogspond.com; I don't know much about the data source, but it certainly looks thorough and painstaking, and up to date. If you'd like to know a little more about my methodology (like a quick explanation of the metric, "keyness"), see the code I used and/or see the actual songs that correspond to these words, head on over to my other, nerdier blog, prooffreaderplus.

Observations about the results:
  • The 2010s seem both more vulgar ("hell" and "fuck") and more inclusive ("we" instead of the "you", "ya" and "u" of the 1990s and 2000s).
  • The 1990s and 2000s were the decades of neologisms, with "U", "Ya" and "Thang". "U" was so popular it occurred twice (but see the note on decade-binning on prooffreaderplus.)
  • Fun! Lots of the decades can be made into intelligible five-word sentences. For example: "Hell Yeah, We Die, Fuck!" (2010s). "Ya Breathe It Like U" (2000s), "You Get Up, U Thang" (1990s), "Don't Rock On Fire, Love" (1980s), "Sing, Moon, In A Swing" (1930s)
  • As anyone who listens to the radio in December knows, all the Christmas songs are oldies, and that shows in the results for the 1950s, with "Christmas" and "Red-nosed".
  • You can track genres with the keywords: "Rag" (1910s), "Blues" (1920s), "Swing" (1930s), "Boogie", "Polka" (1940s), "Mambo" (1950s), "Twist" (1960s), "Disco" (1970s), "Rock" (1970s and 1980s). After that, people realized you don't have to actually name the genre in the song title, people can figure it out by listening. (N'Sync must not have gotten that memo for 2001's "Pop".)
  • Who knew Billboard song rankings went back to the 1890s? It was a surprise to me. That fact, and the fact that there are fewer songs then, but not so few as to be negligible, influenced a lot of the choices into how I presented this data (read more here if you want). But those early decades seem to be more focused on first names ("Michael", "Reuben", "Casey"), familial relationships ("Uncle", "Mammy")
  • The first two decades -- the oldest ones compared to now -- both have the keyword "old". I blame time travel.
  • I find it interesting that there are short, common articles, adverbs, prepositions and pronouncs in the list; these have a higher bar for keyness, since they're present in other decades: "When" (1900s), "A" (1930s), "In" (1930s), "On" (1980s), "Up" (1990s), "It" (2000s)
Now if you'll excuse me, I'm going to hunt through my iPod to see if there's even one song with "gems" in the title; it seems to have been popular in the 1910s.

Projections of White Christmases until the year 2100, based on a climate model

Below is a climate model projection of what areas of North America will be snow-covered on December 25 of each year between 2014 and 2100:

A few things should be pointed out. First and foremost is:

  1. Further to point #1 above, the point of this kind of climate model is not to accurately predict the weather every single day for 87 years, even though that's what the model contains. The point is to experiment, and experimental science is built on prediction. Evaluating those predictions makes for better models down the road. I'm no climatologist, so I'll let the Oregon Climate Change Research Institute explain Why We Use Climate Models.
  2. In the map above, white is 100% snow coverage, and the white becomes more and more transparent at the fringes from 99% to 1% snow coverage, until the bare background is 0% snow coverage. The resolution of the climate model is only 0.44 degrees, so the fit isn't exact at the coastlines.
  3. The data is from the Canadian Centre for Climate Modeling and Analysis, hosted at Environment Canada. The exact model is the Fourth Generation Canadian Regional Climate Model (CanRCM4), RCP 8.5
  4. That global warming kinda sneaks up on you, doesn't it? It's gradual, but when it loops back down to 2014, it's pretty obvious. I imagine people in Grande Prairie, Alberta are looking forward to the end of the 21st century.
  5. Here is a post on my other, nerdier blog about how to make maps in Python based on the CCCma's NetCDF files. There are plenty of examples out there on plotting these files, but not with the format CCCma uses.
  6. There's also code on my GitHub, with links to nbviewer notebooks
  7. Tools used: Python with IPython, netCDF4, Matplotlib, Basemap and PIL; Photoshop; Gfycat.

An Introduction to Data Visualization

An introduction to the practice of data visualization, with theory, examples, and good humor.

This is a studio rerecording of a presentation I was asked to give to McGill University graduate students from many disciplines in Montreal, Canada in November 2014.

Here's a link to the slideshare version if you'd rather read than listen to me.

I give shoutouts to Alberto Cairo, Nathan Yau and Edward Tufte, without whom I'd be much less well informed.






Visualizing word and letter frequencies in Gadsby, a novel without the letter 'e'

In 1939, Ernest Vincent Wright published the novel Gadsby (gee, I wonder where he came up with that name...), 58,124 words (by my count), none of which contain the letter 'e'.

The cover is actually more colorful than the plot.

Here are a few of the features of the English language Wright was deliberately ruling out by avoiding its most common letter:
  • "The": the most common word in English (about 5% of all words in most books);
  • The pronouns "he", "she", "we", "they", "me", "her", "them";
  • The common functional words "when", "where", "these", "those", "every";
  • Most past-tense verbs, "walked", "went", "loved";
  • "Sleeplessness". Hey, I like that word.
The copyright to Gadsby expired because Wright's estate didn't apply for a renewal, so you can find the entire lipogram (that's the term for this kind of writing) here or here. I tried to read it , but it was just too difficult. Not entirely due to the missing letter, but because it's really, really uninteresting. You want a good lipogram, try A Void (which I couldn't find an electronic version of to analyze), a lipogrammatic translation of a French lipogramatic novel. How impressive is that?

As data-centric soft of fellow, my immediate thought was to wonder how this constraint affected the word and letter frequency compared to 'normal' English. Obviously (I posited, correctly), word choice would be affected much more profoundly. On reading it, I saw there were a lot of Anglo-Saxon words and irregular verbs ("said", "had", "was", etc.). So, using Python's Natural Language Toolkit, I calculated word frequencies and compared them to the Brown corpus -- after I'd removed every word containing the letter 'e' from the latter. I used the standard technique of Log Likelihood keyness (basically, it's the confidence that a difference in frequency is 'real' instead of random) to determine the significance of word frequency differences (I've put the frequencies and comparisons using different metrics of all 3934 unique words in Gadsby in a Google Doc if you're interested):


The overrepresented words contain character names, but also "big" (as a replacement for "large"? "enormous"?) and "folks" ("people"?). The underrepresented words are the ones I found interesting, however: "of", "to" and "in" are very common words in English, and to have their usage reduced that much implies that even though they do not contain the letter "e", they are used in tandem with words containing "e" -- such as "the". So I analyzed how often each word has a neighboring "e"-word in Brown, and made a quasi-volcano plot (the area of the circles is the frequency in Gadsby):


You can see there's a palpable tendency for the over- and underrepresented words to be adjacent to "e"-containing words, whereas in that mishmash in the middle (words that have comparable frequencies in Gadsby and Brown), the probability of e-adjacency is far more spread out.

How much of this is simply due to the word "the"? Here's a volcano-ish plot restricting the analysis to frequency of "the"-adjacence in the full Brown:


We basically see the same pattern, but lower down the graph because we're using a more restrictive metric. The spike in the top middle shows that words that are often "the"-adjacent in Brown are, unsurprisingly, rare in Gadsby.

Letter frequencies

Well, that was fun. Next item: what happens to letter frequencies (again, here's a Google Doc)? Let's compare Gadsby to Brown-without-e-words:


That was unexpected (by me, anyway). The only vowel to be more frequent in Gadsby is the relatively little-used "u"! The others, "a", "e", and "i", are all less used in Gadsby. It appears that the slack must be taken up by moderate-frequency consonants. Let's have a look at the log-likelihoods:


I would never have predicted "g" and "f" to be the biggest winner and loser, respectively! Here are the top 10 g-containing words:

  Rank  Word    Gadsby freq.  Brown freq.*
    26  gadsby      364             0
    32  big         297            32
    59  young       187            35
    74  good        129            73
    84  got         113            44
    86  long        108            68
    87  girls       106            13
    90  go          104            57
    92  girl        100            20
    94  right        99            56

* Brown without 'e'-containing words,
normalized to same length as Gadsby.

Of course, there's the main character, Mr. Gadsby himself (interesting aside: Wright never calls him that, because even though "Mr" contains no "e", it's short for a word that does, and that would be cheating). Now let's see the "f"-containing words in the Brown corpus:

 Rank  Word    Frequency
  2    of        36406
  9    f         12431
  14   for        9485
  39   from       4370
  64   if         2199
  88   first      1359
 102   after      1070
 107   before     1011
 144   life        709
 157   off         637

Interestingly, most of these words do not contain "e", but they are often "e"-adjacent; "of", for example, is preceded or followed by an "e"-containing word 85.7% of the time in Brown.

Okay, let me us try this: I hope am hoping that you enjoyed liked did cotton to this blog post; it was an intriguing subject topic to look at, for its linguistic traits. Gadsby is a fun study!

Whew! That was exactly as hard as it looked!

Methodology: here is a Github repo of the analysis, and nbviewer docs of the word and letter frequency analyses. Tools: Python with IPython, NLTK, Pandas, Matplotlib, Seaborn and Plot.ly; Microsoft Excel; Adobe Photoshop

The Most Decade-Specific Words of the Past Two Centuries

Click to to see a zoomable standalone image:

This is from an analysis of Brigham Young University's Corpus of Historical American English, sort of a way-better-curated and easier-to-search version of Google Ngram Viewer. It covers a selected corpus of English from different genres and sources from 1810 to 2009.

Of course, the analysis is biased towards words at the beginning or end of the date range. We haven't stopped using the top word, 'soviet' (and we probably never will); as the decades pass, its frequency per decade metric will decline and decline, barring an unexpected return of the USSR. 'Soviet' also gets a boost because it's both a common and a proper noun, and I only used words that appeared in the Moby Scrabble list, which excludes proper nouns. I decided not to leave this word be, since it's totally different in usage from the top proper nouns that were excluded, which you can see in my Github repo here.

Almost all of these words are modern one, showing that the English vocabulary has been more in flux in modern times (the results are normalized per decade, so the terms do indeed take up a higher percentage of all words in the corpus from that decade). There are only five words that were not used in the first decade of the 21st century, and some of them are common-and-proper like 'soviet'. They are also the only words used in only six or fewer decades.

Words that were used in 16 decades or more were omitted; they were mostly uninteresting words like articles, prepositions, etc. that would have been removed by a common stoplist anyway.


Why Vox.com's 'First World War' ngrams article is kinda sloppy, in one chart.

This morning (Sept. 29, 2014), Dylan Matthews at Vox.com wrote an article based on a tweet by Jared Keller of MicNews that purports to show what Keller calls 'The exact moment 'The Great War' became 'World War I', based on the following Google Ngrams search:



Matthews speculates further than Keller, claiming the following:
What's intriguing is that references to World War I began increasing even before World War II began in Europe. The big growth obviously came as the war began and after its conclusion, but this suggests that in at least some texts, "World War II" was used in the same ominous, premonitory way that "World War III" is today.
There are problems with the methodology and conclusions by both parties, as almost always happens when people use Google Ngram Viewer without understanding how it works and what its limitations are (which Culturomics -- yes, that's really the name of the organization in charge of Google Ngram Viewer -- does not at all go out of its way to acknowledge).

First of all, the title (and the tweet) mentions 'World War I', and the search is for 'first world war'; these aren't the same phrase, and that's what first set my alarm bell ringing. Also, the default setting in Google Ngrams Viewer is for a smoothing of 3, which means, for example, results for 1936 are a combination of the results from 1933 to 1939. So seeing the graph start to rise in 1936 is pretty much indicative of nothing. Also, you say the '(All') next to each search term? That means the search is agglomerating every combination of upper- and lowercase results.

Let's have a look at unsmoothed, case-sensitive searches for the most common ways to refer to both wars ('WWI' and 'WWII' are much less common than these, BTW):

Firstly, the 'exact moment' that Keller referred to is a little more complicated when the results are case-sensitive; you can see the big hump at the beginning of what he implies is 'First World War' is actually 'first World War' -- a description, not a name, if 'first' isn't capitalized.

Secondly, virtually all of these terms start being used in 1939 or later; pretty much the opposite of Matthews's conclusion. There's a little bit of 1930-1938 activity, but it pales in comparison to the mentions these terms get in 1893.

Wait, what?

This is something people who use Google Ngram Viewer regularly are very familiar with. It's based on automatic processing of Google Books, which is based on automatic processing of library books. Lots of the dates of books are just plain wrong, lots of books are given the date of their first edition even though the scan is of a later edition with an introduction written with vocabulary from decades after the fact, and there are lots of OCR errors (that last one doesn't seem to be relevant in this case).

What are these 1893 books use these terms? Luckily, Google Ngram Viewer links directly to Google Books search, so we can find out easily -- as, really, should anyone who uses this tool, as a basic 'sanity check' on their conclusions (click to enlarge):


Obviously, none of the sentences with the search terms were written in 1893. That small level of activity pre-1939 does not pass this threshold of activity to be sure they were written contemporaneously to the listed date. (The original graphs started at 1900, BTW, so this bit of counterfactual information was not visible.)

Just for fun, let's check Matthews's statement about how 'World War III' is used today (along with 'Third World War':



It appears that the big rise in these terms was during World War II -- which kind of makes sense, we've just started calling them I and II, III is an obvious point of speculation. The high point was in the 1950s, but then it's calmed down a lot to 'today' (the graph stops at 2000 because data from 2000 to 2008, where the database stops, are profoundly changed by the advent of digital media and can't easily be compared to the corpus before this data).

The trendiest words in American English for each decade of 19th & 20th c. (determined by a chemistry/astronomy technique)

While "trend" has a clear mathematical definition, "trendiness" does not; I've chosen a method that is equally sensitive to the nerdy sense of the word (rapidity of rise/fall) AND the common meaning ("trendy" = "popular"). More explanation later; here's the chart (click to enlarge):


The calculations were done on the Corpus of Historical American English from Brigham Young University. You can see from their content list that they are heavy on books in the 19th century, then gradually newspapers take up more and more of the corpus. The overall amount of fiction is relatively stable, but this trend analysis is quite sensitive to corpus composition in each decade.

Until the 1920s, every popular word comes from books, usually a character name. For example, in the 1870s, there were at least five major books by different authors with a major character (whose name, therefore, got repeated a lot) named Elsie. It appears (at first glance, anyway) that character names had a bandwagon effect, much like baby names do 100 years later.

Also present are deliberately misspelled words like "uv" for "of" and "ter" for "to" (like "Ah oughts ter uv dun somethin") . This was a style of satirical writing at the time, not all of it racist, but certainly some of it.

In the 20th century, President's names dominate, except for "planes" during World War II and, surprisingly to me, EPA (for Environmental Protection Agency) in the 1990s. The reason it beat out "Clinton" is that his name kept being used throughout the next decade, and "Bush" because that name has a common meaning as well.

I've used a chromatography peak technique (popular in analytical chemistry and astronomy) to analyze non-hard-science data before, here's a quick visual of how it works:


Here is a list of the 100 trendiest words overall:

wordtrendinesspeak yearheight (% popularity
at peak year)
width at 50%
height (years)
reagan0.0033819850.033810
nixon0.0028319750.028310
uv0.0027718650.027710
kennedy0.0027519650.027510
eisenhower0.0022419550.022410
ter0.0016918850.016910
communist0.0016619550.024915
planes0.0012219450.012210
jimmie0.0011819150.011810
coolidge0.0011119250.011110
elsie0.0010718750.010710
bradshaw0.0010718350.010710
korea0.0010619550.010610
rollo0.0010418550.010410
vietnam0.0010319650.015415
roosevelt0.0010319350.020520
katy0.0009818650.009810
graeme0.0009418650.009410
eleanor0.0009319250.009310
winthrop0.0009318550.009310
jeff0.0009119550.009110
madeleine0.0008918650.008910
dave0.0008819150.008810
communists0.0008819550.013215
lanny0.0008619450.008610
dulles0.0008419550.008410
pa0.0008218850.008210
amy0.0008118650.008110
jimbo0.0008019750.008010
isabella0.0007818350.007810
kissinger0.0007819750.007810
soviet0.0007719550.030740
redwood0.0007618250.007610
dewey0.0007619450.007610
stitch0.0007518750.007510
gypsy0.0007418650.007410
hev0.0007318650.007310
hitler0.0007219450.010815
elvira0.0007118250.007110
mcs0.0007119550.007110
atomic0.0007119550.010615
cuba0.0006919650.006910
alessandro0.0006818850.006810
wilford0.0006818650.006810
truman0.0006819550.010215
malone0.0006719650.006710
magdalen0.0006718750.006710
korean0.0006619550.006610
rowland0.0006618750.006610
stevenson0.0006519550.006510
mabel0.0006518550.009715
beulah0.0006518850.006510
goldwater0.0006319650.006310
tommy0.0006319350.006310
gaulle0.0006219650.006210
jessie0.0006119450.006110
ramona0.0006118850.006110
vasco0.0006118350.006110
bunny0.0006019250.006010
newt0.0005918650.005910
gubb0.0005819150.005810
epa0.0005819950.005810
ivan0.0005819050.005810
christie0.0005718750.005710
madonna0.0005718950.005710
banneker0.0005719250.005710
hammond0.0005618250.005610
viet0.0005619650.008415
hed0.0005618650.005610
harding0.0005519250.005510
dorothy0.0005519050.005510
subcommittee0.0005419550.005410
elnora0.0005419050.005410
teddy0.0005418950.005410
id0.0005219950.005210
seor0.0005218350.005210
lulu0.0005218850.005210
downing0.0005218350.005210
lucia0.0005118250.005110
montague0.0005119050.005110
lemuel0.0005118750.005110
wich0.0005118650.005110
christy0.0005118950.007615
israeli0.0005119750.007615
bertha0.0005118650.005110
nazi0.0005119450.007615
heyward0.0005018250.005010
watergate0.0005019750.005010
ms0.0005019550.005010
puffer0.0005018450.005010
didn0.0004919850.004910
purl0.0004918750.004910
maroney0.0004918750.004910
nunez0.0004918350.004910
trina0.0004919250.004910
ronald0.0004819850.004810
randy0.0004819050.004810
georgie0.0004818750.004810
castro0.0004819650.004810
lottie0.0004718750.004710

A few observations: most of the words have a peak width of 10 years (the minimum, since COHA's resolution is at the decade level). Notable exceptions are Roosevelt (FDR was president during two decades, one wartime) and Soviet (a 40 year peak, which means the peak height was quite high to make it on the list). Some words of note: failed presidential candidates Dewey, Stevenson and Goldwater; Newt (but not Gingrich); Hitler and Nazi; Watergate; Ronald (the only presidential first name on the list).

The code used is on my GitHub, but here's the gist of it (no pun intended):
  1. The COHA 1-gram corpus is restricted, but I have an academic licence. Thanks to BYU for that. On my GitHub, I have summary data, but not the dataset itself.
  2. COHA is arranged in decades; I assigned each word the year in the middle of the decade (e.g. 1970s, which covers 1970-1979, was assigned 1975).
  3. For each word, I interpolated by simple mean the popularity for years ending in "0". For example, if a word was a 0.0024% in 1975 and 0.0026% in 1985, I assigned 0.0025% in 1980. This was so peak widths for words that appeared only in one decade could be calculated (otherwise they would have a peak width of zero, and have infinite 'trendiness').
  4. Instead of interpolating further to calculate peak widths (which would be entering overfitting territory), I used a simple Boolean test to calculate the start and end of each peak. The first time a point at a five-year interval exceeded 50% maximum peak height, the counter started, and then the first time it sank below 50%, it stopped. This means if a word was bimodal (two peaks in different years) with a point below 50% of maximum between the two peaks, only the larger peak was counted. This was not a common occurrence, and it ensured words only ever appeared once each.
  5. "Trendiness" was calculated by the peak height (in % of corpus during that year) divided by peak width (in years, always a multiple of five for reasons explained in the previous step)