Monday, April 14, 2014

Google ngrams of 'google' and 'ngram'

I like to test data tools and data sets with "edge cases", a fancy word for using them in ways they were not designed to be used (which is, by the way, the definition of hacking). It's informative to see how far things will bend before they break -- and the good thing with data is it's easy to un-break.

Rare occurrences make good edge cases; so do recursive cases, i.e. run a data tool on itself. We looked briefly at the Google Ngram Viewer a couple of weeks ago; what happens if we determine Google Ngrams of the words "Google" and "Ngram"? (By the way, I like to call this kind of approach 'selfremetacursironiferentiality'. I'm sure it will catch on one day so I look like less of a dork when I say it.)

Of course, the frequency of the word "google" after the company was incorporated in September 1998 is predictable: it becomes a very common word (and is even adopted into that hallowed club, The Verb, where Xerox briefly rested and from which Kleenex was inexplicably barred). The only interesting thing about its 2001-2008 (where the data set ends) rise is that it's pretty linear; I would have intuited either positive or negative curvature, but don't forget this is the word's appearance in published, printed matter, not in conversation.


Let's have a look at "google" and "ngram" (both case-insensitive) from 1880 to 2000, before the rise of Google and with a vertical axis about fifty-fold lower so we can see the edge cases (in my experience, the more jagged a line is*, the more interesting it is.**)

That's a lot of use of the word "google" before the company we all know and... well, know... existed. Using Google Books, the mystery is easy to solve: there was a newspaper comic strip character named Barney Google, and a lot of anthologies were published over the years. Not unusually, the technical term "ngram" lags far behind a term used in pop culture; however, it is surprising that around the dawn of the 20th century a term used in computational linguistics would turn up. Again, Google Books solves the mystery: this is an artifact of a lot of directories of names from around this time being poorly scanned; the name "Ingram" is being recorded as "I, ngram" (which sounds like a terrible book title).


The moral of this story, as with all data sets too huge to be curated by humans (and, coincidentally, every other Aesop's fable): things are not always what they seem, so we'll be sure to dig a little before drawing conclusions, especially in edge cases. The next time someone brings up over the water cooler how ngrams were being studied in 1902, you can nod to yourself knowingly.

* of course, sometimes that means it's just noise, but I find noise interesting too
*** that's what she said.

0 comments:

Post a Comment

Please leave comments & corrections here. Courtesy is appreciated.

Popular Posts

Scroll To Top