Why Vox.com's 'First World War' ngrams article is kinda sloppy, in one chart.

This morning (Sept. 29, 2014), Dylan Matthews at Vox.com wrote an article based on a tweet by Jared Keller of MicNews that purports to show what Keller calls 'The exact moment 'The Great War' became 'World War I', based on the following Google Ngrams search:



Matthews speculates further than Keller, claiming the following:
What's intriguing is that references to World War I began increasing even before World War II began in Europe. The big growth obviously came as the war began and after its conclusion, but this suggests that in at least some texts, "World War II" was used in the same ominous, premonitory way that "World War III" is today.
There are problems with the methodology and conclusions by both parties, as almost always happens when people use Google Ngram Viewer without understanding how it works and what its limitations are (which Culturomics -- yes, that's really the name of the organization in charge of Google Ngram Viewer -- does not at all go out of its way to acknowledge).

First of all, the title (and the tweet) mentions 'World War I', and the search is for 'first world war'; these aren't the same phrase, and that's what first set my alarm bell ringing. Also, the default setting in Google Ngrams Viewer is for a smoothing of 3, which means, for example, results for 1936 are a combination of the results from 1933 to 1939. So seeing the graph start to rise in 1936 is pretty much indicative of nothing. Also, you say the '(All') next to each search term? That means the search is agglomerating every combination of upper- and lowercase results.

Let's have a look at unsmoothed, case-sensitive searches for the most common ways to refer to both wars ('WWI' and 'WWII' are much less common than these, BTW):

Firstly, the 'exact moment' that Keller referred to is a little more complicated when the results are case-sensitive; you can see the big hump at the beginning of what he implies is 'First World War' is actually 'first World War' -- a description, not a name, if 'first' isn't capitalized.

Secondly, virtually all of these terms start being used in 1939 or later; pretty much the opposite of Matthews's conclusion. There's a little bit of 1930-1938 activity, but it pales in comparison to the mentions these terms get in 1893.

Wait, what?

This is something people who use Google Ngram Viewer regularly are very familiar with. It's based on automatic processing of Google Books, which is based on automatic processing of library books. Lots of the dates of books are just plain wrong, lots of books are given the date of their first edition even though the scan is of a later edition with an introduction written with vocabulary from decades after the fact, and there are lots of OCR errors (that last one doesn't seem to be relevant in this case).

What are these 1893 books use these terms? Luckily, Google Ngram Viewer links directly to Google Books search, so we can find out easily -- as, really, should anyone who uses this tool, as a basic 'sanity check' on their conclusions (click to enlarge):


Obviously, none of the sentences with the search terms were written in 1893. That small level of activity pre-1939 does not pass this threshold of activity to be sure they were written contemporaneously to the listed date. (The original graphs started at 1900, BTW, so this bit of counterfactual information was not visible.)

Just for fun, let's check Matthews's statement about how 'World War III' is used today (along with 'Third World War':



It appears that the big rise in these terms was during World War II -- which kind of makes sense, we've just started calling them I and II, III is an obvious point of speculation. The high point was in the 1950s, but then it's calmed down a lot to 'today' (the graph stops at 2000 because data from 2000 to 2008, where the database stops, are profoundly changed by the advent of digital media and can't easily be compared to the corpus before this data).

The trendiest words in American English for each decade of 19th & 20th c. (determined by a chemistry/astronomy technique)

While "trend" has a clear mathematical definition, "trendiness" does not; I've chosen a method that is equally sensitive to the nerdy sense of the word (rapidity of rise/fall) AND the common meaning ("trendy" = "popular"). More explanation later; here's the chart (click to enlarge):


The calculations were done on the Corpus of Historical American English from Brigham Young University. You can see from their content list that they are heavy on books in the 19th century, then gradually newspapers take up more and more of the corpus. The overall amount of fiction is relatively stable, but this trend analysis is quite sensitive to corpus composition in each decade.

Until the 1920s, every popular word comes from books, usually a character name. For example, in the 1870s, there were at least five major books by different authors with a major character (whose name, therefore, got repeated a lot) named Elsie. It appears (at first glance, anyway) that character names had a bandwagon effect, much like baby names do 100 years later.

Also present are deliberately misspelled words like "uv" for "of" and "ter" for "to" (like "Ah oughts ter uv dun somethin") . This was a style of satirical writing at the time, not all of it racist, but certainly some of it.

In the 20th century, President's names dominate, except for "planes" during World War II and, surprisingly to me, EPA (for Environmental Protection Agency) in the 1990s. The reason it beat out "Clinton" is that his name kept being used throughout the next decade, and "Bush" because that name has a common meaning as well.

I've used a chromatography peak technique (popular in analytical chemistry and astronomy) to analyze non-hard-science data before, here's a quick visual of how it works:


Here is a list of the 100 trendiest words overall:

wordtrendinesspeak yearheight (% popularity
at peak year)
width at 50%
height (years)
reagan0.0033819850.033810
nixon0.0028319750.028310
uv0.0027718650.027710
kennedy0.0027519650.027510
eisenhower0.0022419550.022410
ter0.0016918850.016910
communist0.0016619550.024915
planes0.0012219450.012210
jimmie0.0011819150.011810
coolidge0.0011119250.011110
elsie0.0010718750.010710
bradshaw0.0010718350.010710
korea0.0010619550.010610
rollo0.0010418550.010410
vietnam0.0010319650.015415
roosevelt0.0010319350.020520
katy0.0009818650.009810
graeme0.0009418650.009410
eleanor0.0009319250.009310
winthrop0.0009318550.009310
jeff0.0009119550.009110
madeleine0.0008918650.008910
dave0.0008819150.008810
communists0.0008819550.013215
lanny0.0008619450.008610
dulles0.0008419550.008410
pa0.0008218850.008210
amy0.0008118650.008110
jimbo0.0008019750.008010
isabella0.0007818350.007810
kissinger0.0007819750.007810
soviet0.0007719550.030740
redwood0.0007618250.007610
dewey0.0007619450.007610
stitch0.0007518750.007510
gypsy0.0007418650.007410
hev0.0007318650.007310
hitler0.0007219450.010815
elvira0.0007118250.007110
mcs0.0007119550.007110
atomic0.0007119550.010615
cuba0.0006919650.006910
alessandro0.0006818850.006810
wilford0.0006818650.006810
truman0.0006819550.010215
malone0.0006719650.006710
magdalen0.0006718750.006710
korean0.0006619550.006610
rowland0.0006618750.006610
stevenson0.0006519550.006510
mabel0.0006518550.009715
beulah0.0006518850.006510
goldwater0.0006319650.006310
tommy0.0006319350.006310
gaulle0.0006219650.006210
jessie0.0006119450.006110
ramona0.0006118850.006110
vasco0.0006118350.006110
bunny0.0006019250.006010
newt0.0005918650.005910
gubb0.0005819150.005810
epa0.0005819950.005810
ivan0.0005819050.005810
christie0.0005718750.005710
madonna0.0005718950.005710
banneker0.0005719250.005710
hammond0.0005618250.005610
viet0.0005619650.008415
hed0.0005618650.005610
harding0.0005519250.005510
dorothy0.0005519050.005510
subcommittee0.0005419550.005410
elnora0.0005419050.005410
teddy0.0005418950.005410
id0.0005219950.005210
seor0.0005218350.005210
lulu0.0005218850.005210
downing0.0005218350.005210
lucia0.0005118250.005110
montague0.0005119050.005110
lemuel0.0005118750.005110
wich0.0005118650.005110
christy0.0005118950.007615
israeli0.0005119750.007615
bertha0.0005118650.005110
nazi0.0005119450.007615
heyward0.0005018250.005010
watergate0.0005019750.005010
ms0.0005019550.005010
puffer0.0005018450.005010
didn0.0004919850.004910
purl0.0004918750.004910
maroney0.0004918750.004910
nunez0.0004918350.004910
trina0.0004919250.004910
ronald0.0004819850.004810
randy0.0004819050.004810
georgie0.0004818750.004810
castro0.0004819650.004810
lottie0.0004718750.004710

A few observations: most of the words have a peak width of 10 years (the minimum, since COHA's resolution is at the decade level). Notable exceptions are Roosevelt (FDR was president during two decades, one wartime) and Soviet (a 40 year peak, which means the peak height was quite high to make it on the list). Some words of note: failed presidential candidates Dewey, Stevenson and Goldwater; Newt (but not Gingrich); Hitler and Nazi; Watergate; Ronald (the only presidential first name on the list).

The code used is on my GitHub, but here's the gist of it (no pun intended):
  1. The COHA 1-gram corpus is restricted, but I have an academic licence. Thanks to BYU for that. On my GitHub, I have summary data, but not the dataset itself.
  2. COHA is arranged in decades; I assigned each word the year in the middle of the decade (e.g. 1970s, which covers 1970-1979, was assigned 1975).
  3. For each word, I interpolated by simple mean the popularity for years ending in "0". For example, if a word was a 0.0024% in 1975 and 0.0026% in 1985, I assigned 0.0025% in 1980. This was so peak widths for words that appeared only in one decade could be calculated (otherwise they would have a peak width of zero, and have infinite 'trendiness').
  4. Instead of interpolating further to calculate peak widths (which would be entering overfitting territory), I used a simple Boolean test to calculate the start and end of each peak. The first time a point at a five-year interval exceeded 50% maximum peak height, the counter started, and then the first time it sank below 50%, it stopped. This means if a word was bimodal (two peaks in different years) with a point below 50% of maximum between the two peaks, only the larger peak was counted. This was not a common occurrence, and it ensured words only ever appeared once each.
  5. "Trendiness" was calculated by the peak height (in % of corpus during that year) divided by peak width (in years, always a multiple of five for reasons explained in the previous step)

Mythological names are on the rise; Pokémon, not so much.

 (Click to enlarge graphs)

As you can see, mythological boys' names were pretty negligible until the mid-1990s, after which they've had quite an explosion, with boys named Phoenix, Odin and Ares leading the pack. Girls having mythological names was more common than boys in the past, but they've increased as well, and the composition of the names has changed dramatically. In 1940, Minerva and Vesta were the most popular (a virgin Greek warrior goddess and a virgin Roman goddess of the hearth ... I'll let you draw your own conclusions from this). Now it's Athena and Isis, unfortunately for those who watch the news from Iraq these days. (Note: an earlier version of the girls' graph omitted the name Athena; thanks to reader John for noticing it.)

Categorizing baby names is not straightforward; there's a judgment call involved. You can cast a broad net, and accept those called Amon, which is coincidentally both an Egyptian God's name and a Hebrew name, which pretty much makes the list meaningless because it's dominated by such names. Or you try to judge through semi-quantitative methods whether a name would, by a reasonable person in American society, be thought of as mythological.

So I limited the mythologies to Greek, Norse, Egyptian and Roman, because they're the most well-known mythologies in this culture. I had to pass on Celtic, because so many of their names are both mythological and common (like Brigid or Dylan). I used nameberry.com's database of name origin's to see whether names were mythological in origin or shared the name with other, more popular, traditions.

The graphs start at 1940 because even though the Social Security Administration publishes them back to 1880, the data is extremely unrepresentative in the early years.

The list of names I started with and which I eliminated at each step are in my GitHub repo, along with an IPython notebook of the code I used to analyze the data and make the graphs.


Gotta catch 'em all... okay, a few of 'em.



Were people named after Pokémon? Obviously, the reverse happens, since there's a Pokémon named Casey. These graphs are more jagged because the y axis is less than 10% that of the mythological names charts, and some names are rising above and dipping below the dataset's minimum of five babies in a given year.

It appears that only the boys' name Yadon appeared post-Pokémon; rather surprisingly, Lizardo appears once in 1970 before reappearing once in 2010. This dataset is riddled with errors, however, especially before digitial data entry, so it's quite possibly apocryphal.

Girls named Eevee, Amaura, Kimon and Kameil only appeared post-Pokémon; Abra has been around since the mid-'50s and enjoyed a brief surge around the time the Steve Miller Band's "Abracadabra" was playing on the radio. I remember those days; the lyric "I wanna reach out and grab ya" was pretty racy for Top 40 back then.

How often does a given letter follow another in English?

The following chart is an interactive heat map of the probability that, given the letter on the vertical axis in an English word, the next letter will be the letter on the horizontal axis.
Conditional probabilities can give you a headache; that's why the Monty Hall problem is so difficult. The best way to grasp it is by example. Look at the darkest point, for QU. This shows that, GIVEN Q, the next letter is U 98.7% of the time. Similarly, the dark spot on the bottom left shows that GIVEN Y, the most probable event is that there is NO letter following it (signified by "_"), i.e. it's at the end of a word.

The second graph below shows the reverse probabilities, e.g. given U, what's the probability that a Q precedes it, and given we're at the end of a word, what's the probability that the last letter is Y. Trust me, I know, it requires a little mental agility, I've been working on this for a week and I still get mixed up. If you can understand why, in the top graph, the horizontal rows add up to 100% but the vertical columns don't, you've totally got it.

A while back, I posted a series of charts about letter positions in English words. Nathan Yau of FlowingData was kind enough to write about it, and he suggested I look at letter proximity.

The source data is the COHA corpus of Historical American English; each word was analyzed and weighted as to their frequency (so the "th" in "the" influenced the probability of H following T way more than the "th" in theremin.)

Here's a GitGub repo with the code used to produce the data; after experimentation, both Plotly and Bokeh had serious drawbacks when it came to presenting heatmaps of this sort (which will presumably be addressed by later releases), so I went with Tableau Public, took about 20 minutes tops. Note that with this app, you can click things and hide things and have all kinds of fun.

Here's the graph of the probabilities of letters preceding, not following, one another. There are also static graphic versions at the very end. Enjoy!


Static versions (Click to enlarge):

   

Buck naked to butt naked, arms to anus, 19th century iPhones and other Google Ngram oddities

I've posted a couple times about the Google Books Ngrams Viewer before:
This data set is a rich vein for data mining. Plus it's almost completely uncurated, so it's a good target for data spelunking (that's my own idiolect for testing the boundaries of a dataset, to see what false conclusions it can appear to support). However, it's slow going because the metadata alone is really, really enormous (I've only got a fraction of it, and it's more than 3 terabytes). But as I peruse, I've come across some items of interest, totally non-systematically:

1. Butt Naked appears to be taking the place of Buck Naked
The etymology of the phrase "buck naked" is shrouded in mystery; some even think it's a Bowdlerization, and "butt naked" was the original term. But it's clear that in this corpus, anyway, "butt naked" is becoming more and more popular. I hypothesize that it's an example of elision (the k sound followed by the n sound is difficult to say, whereas the t becomes a glottal stop and rolls right off the... er, glottis.) Plus it does make a certain semantic sense: if you're naked, one can see your butt, no?
    Speaking of naked (my searches for this word seem to have influenced my Google AdWords profile, so I'm getting much racier suggestions online), this was a little surprising to me:

Does this mean we're getting more prurient, and less willing to discuss the absence of clothes? Probably not. The Google Books corpus is heavily weighted with 'Library Bias'; it reflects the contents of books it was able to scan in the mid-2000s. I believe a higher proportion of 19th-century books in the corpus are biblical or scientific compared to later books, and use the word less ashamedly.

2. OCR sometimes misreads 'arms' as 'anus'

I didn't come up with this observation of the fallible nature of Optical Character Recognition, but I haven't seen any Ngrams of it. This story got wide media coverage in May 2014, when someone noticed some old romance novels in Google Books contained phrases like this (click to enlarge):

Most of these Google Books examples are difficult to find individually in Google Ngrams viewer (but they're there, you just have to dig), because the exact search phrase has to appear more than 40 times in a year to be listed in their metadata. I first became aware of this risible phenomenon in 2009 thanks to this blog post, but it didn't get much traction at the time. So it goes.

3. 19th century iPhone?

I'm reasonably certain U.S. President Martin Van Buren didn't have an iPhone in the 1830s. Anachronisms in this data set sometimes come from documents being assigned the wrong pulibcation year, but that bias usually works in the opposite direction: A book writien in 1848 is reprinted in 1964, so it shows up in the database as a later year. In general, it's been my experience that modern terms in the past come from OCR errors (sometimes, as in this case, errors in word boundaries; there's a species of snake and a character in the Aeneid called Tisiphone that sometimes is rendered "tis iphone"... and then there are errors that are much less understandable, such as the following:
The OCR thinks "There" is "iPhone", with proper trademark capitalization? I posit there's a non-random error responsible for that.

I'll leave you with a few other 19th-century anachronisms (click to enlarge):




Sex ratio is the clearest indicator of bias in the baby names dataset


I've written before about how the U.S. Social Security baby names dataset, despite being trotted out by plenty of commercial websites aimed at partents, needs to be taken with a grain of salt, and a whole shaker of salt before the 1930s. This is just about the clearest graphical demonstration I've come up with.

It's impossible to quantify race ratio for the dataset, but since only certain occupations were allowed at first, and they excluded most of the occupations that were available for black men and women (for example, day labor and domestic work), it's safe to say the database is severely unbalanced in that regard as well.

Despite having an extensive work history in biology, I never knew that more male babies are born than female babies, a univeral phenomenon across the world (exacerbated by sex-selective abortions in some regions, unfortunately).

I've updated my previous Tableau Public storyboard on the limitations of the Social Security dataset to include this tidbit.

Full movies in 60 seconds: 70 animated gifs of films

Here are 70 animated gifs I've made over the past year, condensing full movies into about 60 seconds. I'll just list the first batch below, then I'll talk a little about the process, then I'll list the rest. Please note that these are about 10 MB each, so they can take a while to load.
Star Wars
    Star Wars Episode I: The Phantom Menace (1999) animated full movie gif
    Star Wars Episode II: Attack of the Clones (2002) animated full movie gif
    Star Wars Episode III: Revenge of the Sith (2005) animated full movie gif
    Star Wars (1977) animated full movie gif
    The Empire Strikes Back (1980) animated full movie gif
    Return of the Jedi (1983) animated full movie gif
    All Star Wars movies in 60 seconds animated full movie gif
    Star Wars Holiday Special (1978) animated full movie gif

First obvious question: why? I did these for the Reddit forum /r/FullMovieGifs. Why do (some) people like them? Because it gives a nice feeling to be quickly reminded of a favourite movie, and also it can reveal things not obvious in the full movie. For example, Paul Thomas Anderson likes really long takes where the camera slowly, slowly zooms in; this is way more evident when you watch the gif of Magnolia, below.

This was not just a matter of taking one frame every 10-15 seconds; I did my best to choose specific frames in order to give the best possible summary. I tried to linger on the most important and/or memorable scenes, and go quickly through ones that were less important or were a wasted frame (such as most establishing shots, where they show the building exterior before they show the characters in a room in a building; it's a good visual shorthand but useless in a gif), or were disorienting (long dialogue scenes that flip quickly back and forth between two people's faces are way more kinetic and disorienting in gif form, so they had to be cut down and the frames carefully chosen.

Here's an example of the frame rate of the Star Wars, Ep. IV (which I call Star Wars, 1977 above, 'cause I'm old and stubborn). You can see it's not very uniform, and if you click to enlarge you'll see I lingered on the most iconic scenes.


The Lord of the Rings
    The Lord of the Rings: The Fellowship of the Ring (2001) animated full movie gif
    The Lord of the Rings: The Two Towers (2002) animated full movie gif
    The Lord of the Rings: The Return of the King (2003) animated full movie gif
Star Trek
    Star Trek II: The Wrath of Khan (1982) animated full movie gif
    Star Trek Into Darkness (2013) animated full movie gif
    Star Trek (2009) animated full movie gif
    Star Trek (2009) [lens flares only] animated full movie gif
TV Specials
    A Charlie Brown Christmas (1965) animated full movie gif
    Dr. Seuss' How the Grinch Stole Christmas! (1966) animated full movie gif
    Rudolph the Red-Nosed Reindeer (1964) animated full movie gif
Sci-Fi
    A Trip to the Moon (1902) animated full movie gif
    2001: A Space Odyssey (1968) animated full movie gif
    Planet of the Apes (1968) animated full movie gif
    Alien (1979) animated full movie gif
    Blade Runner (Final Cut) (1982) animated full movie gif
    The Terminator (1984) animated full movie gif
    Terminator 2: Judgment Day (1991) animated full movie gif
    Contact (1997) animated full movie gif
    Predator (1987) animated full movie gif
    Robocop (1987) animated full movie gif
    28 Days Later (2002) animated full movie gif
    The Fountain (2006) animated full movie gif
    Inception (2010) animated full movie gif
Cornetto Trilogy
    Shaun of the Dead (2004) animated full movie gif
    Hot Fuzz (2007) animated full movie gif
    The World's End (2013) animated full movie gif
Anime
    Summer Wars (2009) animated full movie gif
Comedy
    Monty Python and the Holy Grail (1975) animated full movie gif
    The Rocky Horror Picture Show (1975) animated full movie gif
    Say Anything... (1989) animated full movie gif
    Ferris Bueller's Day Off (1986) animated full movie gif
    The Princess Bride (1987) animated full movie gif
    Groundhog Day (1993) animated full movie gif
    Fargo (1996) animated full movie gif
Drama/Action
    Citizen Kane (1941) animated full movie gif
    Casablanca (1942) animated full movie gif
    Rashomon (1950) animated full movie gif
    The GodFather, Part II (1974) animated full movie gif
    Barry Lyndon (1975) animated full movie gif
    Jaws (1975) animated full movie gif
    One Flew Over the Cuckoo's Nest (1975) animated full movie gif
    Apocalypse Now (1979) animated full movie gif
    Raiders of the Lost Ark (1981) animated full movie gif
    Scarface (1983) animated full movie gif
    The Silence of the Lambs (1991) animated full movie gif
    Reservoir Dogs (1992) animated full movie gif
    Schindler's List (1993) animated full movie gif
    The Shawshank Redemption (1994) animated full movie gif
    The Usual Suspects (1994) animated full movie gif
    Trainspotting (1996) animated full movie gif
    The Big Lebowski (1998) animated full movie gif
    Magnolia (1999) animated full movie gif
    Memento (2000) animated full movie gif
    Memento (2000) in chronological order animated full movie gif
    Ocean's Eleven (2001) animated full movie gif
    Y Tu Mama Tambien (2001) [NSFW] animated full movie gif
    Eternal Sunshine of the Spotless Mind (2004) animated full movie gif
    The Prestige (2006) animated full movie gif
    300 (2006) animated full movie gif
    No Country for Old Men (2007) animated full movie gif
    There Will Be Blood (2007) animated full movie gif
    Scott Pilgrim vs. the World (2010) animated full movie gif