Every time I post about the popular U.S. Social Security Administration baby names dataset, I try to acknowledge the fact that there are some serious problems with it — and by “problems”, I mean things the average person unfamiliar with it will assume are true, but which actually aren’t, specially prior to World War II. I’ve covered all of these to one degree or another in my previous baby names posts here and here and here and here and here and here, but there are always a few questions from readers, so I thought it would be nice to be able to link to something that explained all the major concerns clearly and concisely:
Tableau Public’s new Story View feature is well-suited to this kind of presentation, and I’ll add panels if and when I come across more problematic aspects of sufficient magnitude.
EDIT 2019: Welp, there seems to be something wrong with Tableau’s five year old code, probably just not maintained. The original story mode is still on their website but shows a prominent error. So instead I’m going to post ALL THE SCREENSHOTS here in one big vertical view!
I’d like to reiterate one thing: the problem isn’t in the data, it’s in how it’s often presented and understood. The Social Security Administration does not make any false claims whatsoever (although IMHO they could make their disclaimers more prominent). And some of the baby names blogs and websites make a decent effort to address these issues, or at least not to make unsupportable conclusions based on the data.
Update 2019: Back then, I did exchange a few e-mails with the guy at the SSA (I’m sorry, I can’t remember his name) who created this dataset, and he agreed 100% with my findings. Many of the baby names websites that were making bank off this dataset have backed away from it, including one run by the wife of a famous statistician. Did I have anything to do with it? Probably not, the flaws in the data were there for all to see, but just maybe?