Monday, March 31, 2014

The Nteresting Nnovation of Google Ngrams

If you're unfamiliar with the term or concept of ngrams in general or Google Ngram Viewer in particular, a look at it in action is the best explanation:

This shows how often the words "overrated" and "underrated" appear in Google Books from 1800 to 2008 -- sort of. There are a few caveats, which Google is upfront about (although I wish they'd post a précis of the shortcomings of the database and the main erroneous conclusions that can be drawn from them on the main page of the Ngram Viewer). I'll get into the unique problems of computerized curation of a dataset so huge it comprises 6% of all the books in existence (so they claim, it depends how you count them, but it's a defensible number).

So as the title says, what they heck is an ngram? Well, what you see above are 1grams. If I look up phrases, they become 2grams (or bigrams), 3grams (trigrams), 4grams, 5 grams (not to be confused with pentagrams). Some fascinating things can be revealed by searching for multiword units; we'll look at them in later blog posts.

You have to be careful what conclusions you draw: from the above graph, could you say people were more pessimistic in 1850? No, we haven't run the proper controls: for instance, are there synonyms for "overrated" that took over in 1900? Are there certain kinds of books overrepresented in the database that are more likely to use these terms? Google published a paper with some interesting results (such as the effects of Nazi censorship), but they had the resources to have verifiable control experiments.

Still, it's an interesting database, and one I find myself turning to a lot. Just as there are those who pore through Google Street View to find oddities like people wearing horse head costumes; I do the same with Google Ngram Viewer. I don't like Google's presentation, though, so I wrote a script to automatically import results into python and create prettier graphs (that use per million instead of per cent so you don't have all those leading zeroes, for one):

That's a dramatic rise for "onto the". What could it possily mean? Well, I'll telll you... later.

Saturday, March 29, 2014

Word clouds of On the Road, The Great Gatsby and Heart of Darkness

Total words: 115,000
Title words: On(1066) the(6317) Road(189); as a phrase, "On the Road"(24) 
Famous phrase: "the mad(90) ones(12), the ones who are mad to live(115), mad to talk(88), mad to be saved(7), desirous(2) of everything(171) at the same(78) time(287)."
Onomatopoeic variants: aah(2), aaah(1), aaaaah(3), aw(2), aww(0), awww(1)
Some words that appear only once: mainstream, destination, television, speed limit, freedom
"Merge", "yield", "horsepower", "whim" and "destiny" do not appear.

Total words: 49,000 Title words: The(2401) Great(26) Gatsby(262); as a phrase, "The Great Gatsby" does not appear.
Full names: "Jay Gatsby"(10). "Nick Carraway" and "Daisy Buchanan" do not appear.
Famous phrase: beautiful(7) little(103) fool(8); "beautiful little fool" appears once.
Some words that appear once only: adventitious, caravansary, ectoplasm, fruiterer, Rockefeller, feminine
"Adultery" does not appear.

Total words: 37,000
Title words: Heart(29) of(1499) Darkness(31); as a phrase, "Heart of Darkness" (2)
All variations of "dark", "black" or "shadow": 124 (Similar frequency: "been"(123).)
Famous phrase: Mistah(1) Kurtz(122) he(597) dead(23). "Mistah Kurtz -- he dead" (1)
Famous phrase: The(2468) horror(7); as a phrase, "The horror! The horror!" (3)
Some words that appear once only: towser, fisticuffs, fecund, assegais, thoughtful, starboard, divine
Place names: Africa(1). "Congo" does not appear. River(65).
"Evil" does not appear.

It's been a few months since I made a word cloud, it was nice to return to one of the inspirations for this blog. The On the Road background is a detail from a Dutch book cover which was repurposed for a paperback of James Sallis's Drive; the font is Quid Pro Quo. The Great Gatsby is, of course, the iconic first edition design with Francis Cugat's painting, "Celestial Eyes"; the font is, appropriately, GatsbyFLF bold. The design of Heart of Darkness is based on Joseph Maclise's 1859 pocket manual Surgical Anatomy; the font is Primitive (a name, not an adjective).
The word clouds were made using the great online tool Tagxedo; I realized I forgot to give them credit for my previous word clouds, so I've gone back and edited the posts.
Here are links to some of my other word clouds:

Monday, March 24, 2014

Holy juxtapositions, Batman! Images from Christopher Nolan's Dark Knight trilogy with quotes from the Adam West series

Adam West was the Batman we deserved, but not the one we need right now. Which is a shame, so I wanted to pay homage.

Related: Pictures of Pavel Chekov with quotes by Anton Chekhov

I tried to double-check the accuracy of the quotes, but it's possible some errors may have propagated through the interwebs.

Friday, March 21, 2014

Sunday, March 16, 2014

U.S. baby names: Variations on a theme, girls named after months, and Britney

In my last two forays into the U.S. Social Security Administration baby names database, I explored the extent to which data was skewed towards adults before social security numbers were introduced in 1935, giving parents more incentive to register their children. It later occurred to me that that there is an important, and verifiable, difference between the names of adults and babies: nicknames. While things have become less formal nowadays, in the '30s I'm guessing it was a rare baby who was called "Larry" from the get-go.

A stacked-area or stream graph gives a good view of the distribution of the name "William" and its major variants (it has plenty of other minor variants like "Willy" or "Willem" which were not included):

The pattern takes a little thought to interpret: there's a big dropoff in the adult nicknames "Will" and "Willie" around 1914; in 1935, they would be 21-year-olds. A greater proportion of minors would officially identify themselves as "William". "Billy" sees a big increase from 1920 to 1932; were people emulating the popular radio star Billy Jones? I suspect what's going on is more subtle -- and chilling. If you read about World War I, there seem to be a lot of Billys who crop up, notably Billy Mitchell and Billy Bishop; the Google Ngram Viewer shows a spike in mentions of "Billy" in printed matter during World War I and a fall afterwards, the opposite of what we see above.

In 1935, adults who applied for a social security number had to have survived World War I. I think what we're seeing is Billys who were too young to have perished in the war aging, getting their first job and applying for a social security number. Billy wan't unpopular before 1920; it's just that a lot of them died without leaving a record. Note that the relative proportions between name and nicknames stay relatively constant after 1932.

Let's cheer ourselves up a bit with another name, a girl's name this time (and thus less vulnerable to WWI), and one with two variants right off the bat: Rachel. The huge explosion of popularity of the common spelling in the 1970s makes it a little difficult to see some of the less common versions, so I've added a normalized graph, as well (with all forms of 'Rachel' adding up to 100%, so it's no longer a graph of name popularity, but of variant trends).

Before 1900, the only variant, making up about 5% of the total, is "Rachael", and here's an interesting etymology: basically, parents liked the baroque feel of the name "Michael" and copied it. In Hebrew, the vowel before the "l" in Rachel and Michael is different; this isn't a transliteration, it's an adaptation for aesthetic reasons. The same thing starts to happen later with some girls being named "Racheal". (As an aside, I know someone named Micheal; he rues the irony that his parents wanted it to be easier to spell, but instead ensured everyone misspells his name. One time after a few too many libations I asked him if it was possible his parents were just poor spellers; he claims not.)

Once again we see a huge discontinuity in 1935 with the name Rochelle (which isn't etymologically linked to Rachel, but it's still pretty similar); my guess is this is an artifact of parents for the first time signing up young children before they had a chance to die of childhood illnesses, but that hypothesis would require further testing.

By the way, there are 79 versions of Rachel in the database (here's my first, failed attempt to graph them all); what you see here is the top 16, with all the rest lumped together as "Other", including Rchel (which is, I suspect, actually R'chel. Could be inspired by Hebrew, could be inspired by Klingon. I leave you to judge what, if any, trauma will occur to baby girls named Ratchel.)

Stream or stacked-area graphs are a good way of exploring certain patterns, such as girls named after months:

A few observations: (1) The only month missing is February. (2) Spider-Man's Aunt May is well named, when the comic started in the 1960s it was already old fashioned. (3) Apparently there are a bunch of 30-year-olds named April whom I've never met. (4) January Jones's name is not as uncommon as I thought; statistically, at least one of them must be a good actor. (5) I would not have predicted that the first unusual month name would be "September", nor that it would have started in the 1950s. (5) Who names a girl "March"?!

I've gotten good feedback (read: Tumblr reposts) from my graphs of Heather and Sigourney, so I'll leave you with Britney:

Monday, March 3, 2014

U.S. baby names: Sigourney, Marilyn, and sex confusion

To start with a bang, I'll just leave these here (click on the images to enlarge):

Last week, I posted some visual analyses of the U.S. Social Security Administration database of baby names from 1880-2012, focusing on increasing diversity of baby names. Now I'm going to do what I've done all my life with my new toys: I'll try to break it.

Like many large data sets, especially historical ones that cannot be updated, the SSA baby names database is not perfect. Very few useful databases are completely error-free; one needs simply to understand the imperfections before drawing conclusions, or one might end up writing a misleading headline like 'The least popular American baby names (from 1880 to 1932)'.

The SSA themselves point out some problematic facets of the data, like the fact that baby names that appear fewer than five times in a year are omitted due to privacy concerns, and that Social Security numbers were introduced in 1935, so at that time everyone (mostly adults) who applied for a number got their name and birth year entered retroactively in the database, leaving out people whose occupations did not require a number or who had been born after 1880 and died before 1935.

When starting with an unfamiliar dataset, one should always do a quick seismograph (during data mining, it amuses me to use geology terms, even ones that strain the metaphor, like "spelunking", "prospecting" and even "dowsing"). Let's compare the number of births per year with statistics from the Department of Health and Human Services which go back to 1910:

This is not a surprise; we already knew births were underreported before the '30s, and now we know it's about by a factor of 4! The totals don't reach 100% because the DHS reports live births in the U.S., to citizens and non-citizens alike, and the SSA does not report births with extremely rare names or ones that do not correspond to a Social Security number, such as non-citizens.

An easy thing to check in a database is whether everything adds up. The unique values for "sex" are "M" and "F" -- there are no "Unknowns" or missing data. This in itself is a little troubling; in a data set that's already exhibited problems, how likely is it the sex categorization is perfect? An easy check would be to take some of the most popular names from last week and see how many of them were reported with the other sex, e.g. boys named Linda or girls named Robert:

If anyone has any theories about the high error rate for Emma in 1910, I'm all ears: see the data on my other blog, prooffreaderplus. It will also show you lots of boys named Anna, Ella, Georgia, Bertha, Clara, etc.

One first-rank name that surprised me was Ashley; I even mentioned it last week as a modern girls' name. Well, imagine my surprise when I found out it was a (albeit relatively rare) boys' name until 1960!

The next obvious culprit (besides obvious missing data, which does not appear to be significant) would be misspelled names, and that's a tricky one (how do you tell if it's deliberate?) and a subject for another blog post.

I'll leave you with one more curiosity I came across during my data spelunking; I hope their brothers weren't named Zinc.

Popular Posts

Scroll To Top