Tuesday, September 2, 2014

Buck naked to butt naked, arms to anus, 19th century iPhones and other Google Ngram oddities

I've posted a couple times about the Google Books Ngrams Viewer before:
This data set is a rich vein for data mining. Plus it's almost completely uncurated, so it's a good target for data spelunking (that's my own idiolect for testing the boundaries of a dataset, to see what false conclusions it can appear to support). However, it's slow going because the metadata alone is really, really enormous (I've only got a fraction of it, and it's more than 3 terabytes). But as I peruse, I've come across some items of interest, totally non-systematically:

1. Butt Naked appears to be taking the place of Buck Naked
The etymology of the phrase "buck naked" is shrouded in mystery; some even think it's a Bowdlerization, and "butt naked" was the original term. But it's clear that in this corpus, anyway, "butt naked" is becoming more and more popular. I hypothesize that it's an example of elision (the k sound followed by the n sound is difficult to say, whereas the t becomes a glottal stop and rolls right off the... er, glottis.) Plus it does make a certain semantic sense: if you're naked, one can see your butt, no?
    Speaking of naked (my searches for this word seem to have influenced my Google AdWords profile, so I'm getting much racier suggestions online), this was a little surprising to me:

Does this mean we're getting more prurient, and less willing to discuss the absence of clothes? Probably not. The Google Books corpus is heavily weighted with 'Library Bias'; it reflects the contents of books it was able to scan in the mid-2000s. I believe a higher proportion of 19th-century books in the corpus are biblical or scientific compared to later books, and use the word less ashamedly.

2. OCR sometimes misreads 'arms' as 'anus'

I didn't come up with this observation of the fallible nature of Optical Character Recognition, but I haven't seen any Ngrams of it. This story got wide media coverage in May 2014, when someone noticed some old romance novels in Google Books contained phrases like this (click to enlarge):

Most of these Google Books examples are difficult to find individually in Google Ngrams viewer (but they're there, you just have to dig), because the exact search phrase has to appear more than 40 times in a year to be listed in their metadata. I first became aware of this risible phenomenon in 2009 thanks to this blog post, but it didn't get much traction at the time. So it goes.

3. 19th century iPhone?

I'm reasonably certain U.S. President Martin Van Buren didn't have an iPhone in the 1830s. Anachronisms in this data set sometimes come from documents being assigned the wrong pulibcation year, but that bias usually works in the opposite direction: A book writien in 1848 is reprinted in 1964, so it shows up in the database as a later year. In general, it's been my experience that modern terms in the past come from OCR errors (sometimes, as in this case, errors in word boundaries; there's a species of snake and a character in the Aeneid called Tisiphone that sometimes is rendered "tis iphone"... and then there are errors that are much less understandable, such as the following:
The OCR thinks "There" is "iPhone", with proper trademark capitalization? I posit there's a non-random error responsible for that.

I'll leave you with a few other 19th-century anachronisms (click to enlarge):




1 comment:

  1. I laughed way too much at this. Please share more perfectly ginormous findings!

    ReplyDelete

Please leave comments & corrections here. Courtesy is appreciated.

Popular Posts

Scroll To Top