I became unexpectedly unemployed yesterday, and since I don't believe in long mourning periods (or poverty) I started my job search right away, and came across this infographic. Let's be fair: there are far worse infographics out there. But my version of human nature somehow gets more perturbed by almost-competence than by abject failure; I suppose, knowing nothing about the creator, that in my head I'm blaming them for not trying hard enough. Well, if the creator happens to come across this, I totally don't want to hurt your feelings (much), you just need a little more practice, as do we all. 


I like to test data tools and data sets with "edge cases", a fancy word for using them in ways they were not designed to be used (which is, by the way, the definition of hacking). It's informative to see how far things will bend before they break -- and the good thing with data is it's easy to un-break.

Rare occurrences make good edge cases; so do recursive cases, i.e. run a data tool on itself. We looked briefly at the Google Ngram Viewer a couple of weeks ago; what happens if we determine Google Ngrams of the words "Google" and "Ngram"? (By the way, I like to call this kind of approach 'selfremetacursironiferentiality'. I'm sure it will catch on one day so I look like less of a dork when I say it.)

Of course, the frequency of the word "google" after the company was incorporated in September 1998 is predictable: it becomes a very common word (and is even adopted into that hallowed club, The Verb, where Xerox briefly rested and from which Kleenex was inexplicably barred). The only interesting thing about its 2001-2008 (where the data set ends) rise is that it's pretty linear; I would have intuited either positive or negative curvature, but don't forget this is the word's appearance in published, printed matter, not in conversation.


Let's have a look at "google" and "ngram" (both case-insensitive) from 1880 to 2000, before the rise of Google and with a vertical axis about fifty-fold lower so we can see the edge cases (in my experience, the more jagged a line is*, the more interesting it is.**)

That's a lot of use of the word "google" before the company we all know and... well, know... existed. Using Google Books, the mystery is easy to solve: there was a newspaper comic strip character named Barney Google, and a lot of anthologies were published over the years. Not unusually, the technical term "ngram" lags far behind a term used in pop culture; however, it is surprising that around the dawn of the 20th century a term used in computational linguistics would turn up. Again, Google Books solves the mystery: this is an artifact of a lot of directories of names from around this time being poorly scanned; the name "Ingram" is being recorded as "I, ngram" (which sounds like a terrible book title).


The moral of this story, as with all data sets too huge to be curated by humans (and, coincidentally, every other Aesop's fable): things are not always what they seem, so we'll be sure to dig a little before drawing conclusions, especially in edge cases. The next time someone brings up over the water cooler how ngrams were being studied in 1902, you can nod to yourself knowingly.

* of course, sometimes that means it's just noise, but I find noise interesting too
*** that's what she said.

I had a different webcomic planned for this week, but the shiny orange "Publish" button is more tempting to press when I'm half-asleep than the "Save" button, so here goes!

For those who aren't up on their 1972 celebrity semi-scandals, here's what the comic is referring to. Edmund J. Mittlebaum is entirely invented; the identity of Carly's spurned love has more incompatible theories than the Kennedy assassination. Taylor Swift claims she knows who it is, which makes sense, because if there's one thing Taylor is tight-lipped about, it's ex-boyfriends.

It's been noted before that one of the most striking trends when analyzing American baby names is the rise in popularity of boys' names ending with the letter 'n' over the past few decades. What I haven't seen is a visualization that truly demonstrates the scale of this phenomenon. And for a good reason; it's difficult to show trends over time in 26 variables. So I made this animated GIF of bar graphs; pay attention to the 'n' after the mid-70s.


I was also interested in the trends for each letter; in the GIF above, there's a rise and a fall of names ending in "d" (although the rise ends in the mid-1930s, which I've already explained is problematic due to the way data was collected). So here's a grid of every letter; the scales are not the same ("n" is far more popular than "q", for example) so I've shaded each one so that darker green goes along with most popularity, and the overall trends of each one can be seen:


There's still more that can done with this data; only since 2011 have as many as four of the top ten boys' names ended in 'n', so evidently this is a phenomenon that has carried through more than the top tier of popularity; it would be interesting to see the contributions of different names. I also wonder what some of the peaks and valleys for other names represent, and of course one could always do the same analysis to the last letters of girls' names (let me guess: lots of "a"s), the first letters of either sex, and even middle letters or multi-letter patterns. More to come, unless some other shiny data bauble catches my eye first...

Other prooffreader.com posts about baby names:



Let the games begin.






EDIT: My previous attempt, shown below, had a meteor that looked more like a thermometer. No wonder the dinosaur was concerned.

    If you're unfamiliar with the term or concept of ngrams in general or Google Ngram Viewer in particular, a look at it in action is the best explanation:



This shows how often the words "overrated" and "underrated" appear in Google Books from 1800 to 2008 -- sort of. There are a few caveats, which Google is upfront about (although I wish they'd post a précis of the shortcomings of the database and the main erroneous conclusions that can be drawn from them on the main page of the Ngram Viewer). I'll get into the unique problems of computerized curation of a dataset so huge it comprises 6% of all the books in existence (so they claim, it depends how you count them, but it's a defensible number).

So as the title says, what they heck is an ngram? Well, what you see above are 1grams. If I look up phrases, they become 2grams (or bigrams), 3grams (trigrams), 4grams, 5 grams (not to be confused with pentagrams). Some fascinating things can be revealed by searching for multiword units; we'll look at them in later blog posts.

You have to be careful what conclusions you draw: from the above graph, could you say people were more pessimistic in 1850? No, we haven't run the proper controls: for instance, are there synonyms for "overrated" that took over in 1900? Are there certain kinds of books overrepresented in the database that are more likely to use these terms? Google published a paper with some interesting results (such as the effects of Nazi censorship), but they had the resources to have verifiable control experiments.

Still, it's an interesting database, and one I find myself turning to a lot. Just as there are those who pore through Google Street View to find oddities like people wearing horse head costumes; I do the same with Google Ngram Viewer. I don't like Google's presentation, though, so I wrote a script to automatically import results into python and create prettier graphs (that use per million instead of per cent so you don't have all those leading zeroes, for one):


That's a dramatic rise for "onto the". What could it possily mean? Well, I'll telll you... later.
Copyright © 2012 prooffreader.com