Rare occurrences make good edge cases; so do recursive cases, i.e. run a data tool on itself. We looked briefly at the Google Ngram Viewer a couple of weeks ago; what happens if we determine Google Ngrams of the words "Google" and "Ngram"? (By the way, I like to call this kind of approach 'selfremetacursironiferentiality'. I'm sure it will catch on one day so I look like less of a dork when I say it.)
Of course, the frequency of the word "google" after the company was incorporated in September 1998 is predictable: it becomes a very common word (and is even adopted into that hallowed club, The Verb, where Xerox briefly rested and from which Kleenex was inexplicably barred). The only interesting thing about its 2001-2008 (where the data set ends) rise is that it's pretty linear; I would have intuited either positive or negative curvature, but don't forget this is the word's appearance in published, printed matter, not in conversation.
Let's have a look at "google" and "ngram" (both case-insensitive) from 1880 to 2000, before the rise of Google and with a vertical axis about fifty-fold lower so we can see the edge cases (in my experience, the more jagged a line is*, the more interesting it is.**)
The moral of this story, as with all data sets too huge to be curated by humans (and, coincidentally, every other Aesop's fable): things are not always what they seem, so we'll be sure to dig a little before drawing conclusions, especially in edge cases. The next time someone brings up over the water cooler how ngrams were being studied in 1902, you can nod to yourself knowingly.
* of course, sometimes that means it's just noise, but I find noise interesting too
*** that's what she said.