Monday, September 29, 2014

Why's 'First World War' ngrams article is kinda sloppy, in one chart.

This morning (Sept. 29, 2014), Dylan Matthews at wrote an article based on a tweet by Jared Keller of MicNews that purports to show what Keller calls 'The exact moment 'The Great War' became 'World War I', based on the following Google Ngrams search:

Matthews speculates further than Keller, claiming the following:
What's intriguing is that references to World War I began increasing even before World War II began in Europe. The big growth obviously came as the war began and after its conclusion, but this suggests that in at least some texts, "World War II" was used in the same ominous, premonitory way that "World War III" is today.
There are problems with the methodology and conclusions by both parties, as almost always happens when people use Google Ngram Viewer without understanding how it works and what its limitations are (which Culturomics -- yes, that's really the name of the organization in charge of Google Ngram Viewer -- does not at all go out of its way to acknowledge).

First of all, the title (and the tweet) mentions 'World War I', and the search is for 'first world war'; these aren't the same phrase, and that's what first set my alarm bell ringing. Also, the default setting in Google Ngrams Viewer is for a smoothing of 3, which means, for example, results for 1936 are a combination of the results from 1933 to 1939. So seeing the graph start to rise in 1936 is pretty much indicative of nothing. Also, you say the '(All') next to each search term? That means the search is agglomerating every combination of upper- and lowercase results.

Let's have a look at unsmoothed, case-sensitive searches for the most common ways to refer to both wars ('WWI' and 'WWII' are much less common than these, BTW):

Firstly, the 'exact moment' that Keller referred to is a little more complicated when the results are case-sensitive; you can see the big hump at the beginning of what he implies is 'First World War' is actually 'first World War' -- a description, not a name, if 'first' isn't capitalized.

Secondly, virtually all of these terms start being used in 1939 or later; pretty much the opposite of Matthews's conclusion. There's a little bit of 1930-1938 activity, but it pales in comparison to the mentions these terms get in 1893.

Wait, what?

This is something people who use Google Ngram Viewer regularly are very familiar with. It's based on automatic processing of Google Books, which is based on automatic processing of library books. Lots of the dates of books are just plain wrong, lots of books are given the date of their first edition even though the scan is of a later edition with an introduction written with vocabulary from decades after the fact, and there are lots of OCR errors (that last one doesn't seem to be relevant in this case).

What are these 1893 books use these terms? Luckily, Google Ngram Viewer links directly to Google Books search, so we can find out easily -- as, really, should anyone who uses this tool, as a basic 'sanity check' on their conclusions (click to enlarge):

Obviously, none of the sentences with the search terms were written in 1893. That small level of activity pre-1939 does not pass this threshold of activity to be sure they were written contemporaneously to the listed date. (The original graphs started at 1900, BTW, so this bit of counterfactual information was not visible.)

Just for fun, let's check Matthews's statement about how 'World War III' is used today (along with 'Third World War':

It appears that the big rise in these terms was during World War II -- which kind of makes sense, we've just started calling them I and II, III is an obvious point of speculation. The high point was in the 1950s, but then it's calmed down a lot to 'today' (the graph stops at 2000 because data from 2000 to 2008, where the database stops, are profoundly changed by the advent of digital media and can't easily be compared to the corpus before this data).


Post a Comment

Please leave comments & corrections here. Courtesy is appreciated.

Popular Posts

Scroll To Top