Last week I posted some visual analyses of the U.S. Social Security Administration database of baby names from 1880-2012, focusing on increasing diversity of baby names. Now I'm going to do what I've done all my life with my new toys: I'll try to break it.
Like many large data sets, especially historical ones that cannot be updated, the SSA baby names database is not perfect. Very few useful databases are completely error-free; one needs simply to understand the imperfections before drawing conclusions, or one might end up writing a misleading headline like 'The least popular American baby names (from 1880 to 1932)'.
The SSA themselves point out some problematic facets of the data, like the fact that baby names that appear fewer than five times in a year are omitted due to privacy concerns, and that Social Security numbers were introduced in 1935, so at that time everyone (mostly adults) who applied for a number got their name and birth year entered retroactively in the database, leaving out people whose occupations did not require a number or who had been born after 1880 and died before 1935.
When starting with an unfamiliar dataset, one should always do a quick seismograph (during data mining, it amuses me to use geology terms, even ones that strain the metaphor, like "spelunking", "prospecting" and even "dowsing"). Let's compare the number of births per year with statistics from the Department of Health and Human Services which go back to 1910:
This is not a surprise; we already knew births were underreported before the '30s, and now we know it's about by a factor of 4! The totals don't reach 100% because the DHS reports live births in the U.S., to citizens and non-citizens alike, and the SSA does not report births with extremely rare names or ones that do not correspond to a Social Security number, such as non-citizens.
An easy thing to check in a database is whether everything adds up. The unique values for "sex" are "M" and "F" -- there are no "Unknowns" or missing data. This in itself is a little troubling; in a data set that's already exhibited problems, how likely is it the sex categorization is perfect? An easy check would be to take some of the most popular names from last week and see how many of them were reported with the other sex, e.g. boys named Linda or girls named Robert:
If anyone has any theories about the high error rate for Emma in 1910, I'm all ears: see the data on my other blog, prooffreaderplus. It will also show you lots of boys named Anna, Ella, Georgia, Bertha, Clara, etc.
One first-rank name that surprised me was Ashley; I even mentioned it last week as a modern girls' name. Well, imagine my surprise when I found out it was a (albeit relatively rare) boys' name until 1960!
The next obvious culprit (besides obvious missing data, which does not appear to be significant) would be misspelled names, and that's a tricky one (how do you tell if it's deliberate?) and a subject for another blog post.
I'll leave you with one more curiosity I came across during my data spelunking; I hope their brothers weren't named Zinc.