U.S. baby names, 1880-2012: Diversity

The United States Social Security Administration released a fascinating data set a few years ago: all of the names of newborns registered since 1880. There have already been some great analyses and visualizations of this data (there’s a list below). Still, I think there are plenty more items of interest to be found in mining this data, or if that fails, surveying, dowsing and spelunking it.

Friends’ first reaction to news I was looking at this data set was inevitably, “So you must see more and more unique names as the years progress?” (Usually followed by, “Do you see a decline in ‘Adolf’ after WWII?” Short answer: yes.) Name diversification is anecdotally evident —  nobody named their babies Rumer, or North (or even Ashley) in 1900 — but it’s nice to have evidence. I’m far from the first to explore this phenomenon, but I think I’ve come up with some interesting displays:

This data seem to support the view that there are more and more names out there: In the 1880s, over 8% of the population whose birth records ended up reported in the Social Security database (an important distinction, as we’ll see!) were named John or Mary; the most popular names nowadays are closer to 1% of the total, and their share has decreased rapidly since the late 1960s. Other analysts have shown that girls’ names have more of a bandwagon effect than boys’, and these graphs seem to bear that out, with higher peaks when names like Linda or Jessica become very popular for a few years, then fade into relative obscurity.

The Social Security database isn’t perfect, nor do its curators claim it to be; in particular, statistics before World War II are suspect because of missing data and they way the data was collected. This is most evident when we try to visualize diversity by looking at how many new names per birth were introduced into the records:

The huge peak in the 1910s isn’t an explosion of weird names, it’s an artifact of the database. Social security numbers were introduced in 1935, and then adults signed up and gave their birth years; what you are seeing is evidence that there are a lot of names missing before then, which only stands to reason. How many people born in Appalachia or the Old West would the Social Security administration find records of decades after their births? Once the system kicks into gear, we see an increase in overall diversity since the late 1960s, as expected.

This is a fascinating data set, and in upcoming posts I’ll mine it some more. For now, though, I’ll spelunk one name: the hook of the 1989 film Heathers is that there were a lot of teenage girls with that name at that time, and the data shows that is the case; it also shows how dated the film is, because by now the name has faded into the obscurity from whence it came.


  • The raw data courtesy of the United States Social Security Administration
  • The unequivocal maven of this data set is Laura Wattenberg, whose site Baby Name Wizard has lots of great information. If you want to look up a particular name, this is the place to go.
  • nametrends.net has a lot of graphs and search possibilities and multi-name sparkline collections, it’s a great took to browse the data set for patterns.
  • FlowingData, a great blog that maybe one day I’ll pay for premium acess to, has had several good posts on this data set about the most trendy names, the most unisex names and the most regional names.
  • Jezebel.com has a set of maps of most popular girls names by state for each year, including a fascinating animated GIF, all designed by Reuben Fischer-Baum.
  • waitbutwhy does some interesting data spelunking using Wattenberg’s aforementioned wizard.
  • Hilary Parker showed the most poisoned names in US History: #1 was her own!
  • The Boston Globe has a great article about how one expectant father working at the Social Security Administration was responsible for recognizing this data set and its utility.
  • Nancy Man’s Nancy’s Baby Names has well-presented graphs of name popularity over time.
  • The book Python for Data Analysis has a section on this data set; they’re to blame for my initial interest.
  • The Python code I used is posted at my other, nerdier blog.