Monday, March 31, 2014

The Nteresting Nnovation of Google Ngrams

If you're unfamiliar with the term or concept of ngrams in general or Google Ngram Viewer in particular, a look at it in action is the best explanation:



This shows how often the words "overrated" and "underrated" appear in Google Books from 1800 to 2008 -- sort of. There are a few caveats, which Google is upfront about (although I wish they'd post a précis of the shortcomings of the database and the main erroneous conclusions that can be drawn from them on the main page of the Ngram Viewer). I'll get into the unique problems of computerized curation of a dataset so huge it comprises 6% of all the books in existence (so they claim, it depends how you count them, but it's a defensible number).

So as the title says, what they heck is an ngram? Well, what you see above are 1grams. If I look up phrases, they become 2grams (or bigrams), 3grams (trigrams), 4grams, 5 grams (not to be confused with pentagrams). Some fascinating things can be revealed by searching for multiword units; we'll look at them in later blog posts.

You have to be careful what conclusions you draw: from the above graph, could you say people were more pessimistic in 1850? No, we haven't run the proper controls: for instance, are there synonyms for "overrated" that took over in 1900? Are there certain kinds of books overrepresented in the database that are more likely to use these terms? Google published a paper with some interesting results (such as the effects of Nazi censorship), but they had the resources to have verifiable control experiments.

Still, it's an interesting database, and one I find myself turning to a lot. Just as there are those who pore through Google Street View to find oddities like people wearing horse head costumes; I do the same with Google Ngram Viewer. I don't like Google's presentation, though, so I wrote a script to automatically import results into python and create prettier graphs (that use per million instead of per cent so you don't have all those leading zeroes, for one):


That's a dramatic rise for "onto the". What could it possily mean? Well, I'll telll you... later.

4 comments:

  1. Cool stuff! Can the Ngrams be embedded directly to your site instead of the images? I wonder if they offer an API...

    ReplyDelete
  2. I'm doubly glad you commented, because it appears I made an error with the new CSS and comments are not appearing on the site! I really suck at CSS...

    Anyways, to answer your question, on the actual Google Ngram Viewer result page, there is a little link on the upper right called "Embed"; click it and it gives you HTML code you can just copy onto a web page. I went with an image because I consider Google's presentation suboptimal, especially when you try to show more than two words or phrases on the same page; even in the top example, I like putting the words near the lines, not off to the right. Your mileage, of course, may vary.

    ReplyDelete
  3. I'm sorry the comments aren't working in my new redesign. I'm working on it!

    ReplyDelete
    Replies
    1. And now they're working! It was rather useless of me to write the above comment before I fixed the comments. Ah well.

      Delete

Please leave comments & corrections here. Courtesy is appreciated.

Popular Posts

Scroll To Top