Monday, May 12, 2014

Skyrocketing of boys names ending in 'n' is not reflected in the most popular names overall

(Click on chart to enlarge:)

In 2007, arguably the foremost statistician-cum-maven concerning the U.S. Social Security Administration's baby name database, Laura Wattenberg*, appears to be the first to have noticed quite a dramatic trend, the rise of boys' names ending in "n" from about 15% in the 1950s to around 35% now.

A few weeks ago, I made an animated gif to visualize this phenomenon, and it became, well, probably not viral per se, but at least antibiotic-resistant:


The next obvious interesting question to me is: the trend was over 40 years old when Ms. Wattenberg noticed it, so why did it take so long? The usual answers to this question, lack of computing power, lack of published data, are certainly factors, but I think by the late '90s someone could have figured this out -- if they had been familiar enough with the data to think of it as a potential route of analysis, which only Ms. Wattenberg seems to have been. This means the trend is somehow hidden in the data, and that piques my interest. (It reminds me of Simpson's paradox, while simultaneously being totally different.)

The problem, I think, is twofold: (a) most people are exposed to baby names as a "Top 10" list, or at best a "Top 100 list", without context, and these names tell less and less of the whole story In 1950, the Top 10 names made up 33% births, while in 2013 it is less than 9% as names have become more diverse. (Also, in 1950 there are 5,700 unique names in the database, while in 2013 there were 17,800.) And (b) Patterns in the last letter are less obvious to casual analysis than first-letter patterns, especially when the second-last letter varies (EthAn, JasOn, JaydEn, etc.)

Comparing a chart of how often names ending in 'n' show up in the Top 10 with their popularity overall shows, except for brief periods, this fad was not reflected in the most popular names (i.e. the black bars are mostly below the red line):

This screams out for a quantile analysis to determine at what level of popularity names ending in 'n' was driven, which you will see at the top of this post.

The pre-World War II distribution of the most popular name, "John" being three quintiles above the second-most, "Benjamin", is interesting, but the most telling pattern is in the top three quintiles after 1950. Gains in popularity in this kind of graph will always show a "rolling" of peaks from lower quintiles to higher, but in this graph, this gain is not symmetrical; most of it was led by the second quintile, and the most popular names still lag behind the overall popularity of 'n' names. In other words, U.S. high schools and colleges right now have lots of Ethans, Masons and Jaydens, but they will be most likely to have multiple Michaels and Jacobs.

The underlying phenomenon, I think, is a linguistic one: there just happen to be a lot of names ending in 'n' which strike a balance between conformity and individuality that many parents are looking for. Their son can have a name that doesn't make him stand out, but that isn't exactly the same as all the rest. The similarity pool simply happens to be much smaller for other popular names: the #1 name for 2013, announced three days ago, was "Noah". The rest of the "h" names are even more biblical, which is not what everyone wants. (#334 is "Messiah" -- no pressure, kid.) Number 2 is Jacob, and after "Caleb" all the other "b" names are uncommon derivatives "Jakob," "Kaleb", etc. (#4852 is "Gleb", which I hope means the typists' fingers slipped from the adjacent "n" key a few times.)

Finally, I would like to post something related to my previous 'n'-names post: Dana Silver,a computer science major in Vermont, made a D3.js version of my animated histogram with a slider that allows you to choose the year you would like to see. This had been my first impulse, but my JavaScript is so rudimentary, I just didn't have the courage to tackle the problem (I managed to get something that worked in JQuery UI, but it looked terrible.) Note that (a) her y-axis autoscales, which Edward Tufte might frown upon for changing the data-to-ink ratio every year, but it sure makes the differences between years more obvious, and (b) You can also see girls' names, which do not show a similar pattern; I'm suspect there are

If anyone is interested in the code (and some nice IPython notebooks that may be of interest even to non-pythonistas) I used to do the last two analyses and make the charts for boys' names beginning with n, check my sister blog, prooffreaderplus.

 * I so highly recommend Ms. Wattenberg's Baby Name Wizard blog that I have to carefully meter my own exposure to it so as not to bias my own results! Unlike Ms. Watternberg, I am not a baby name expert, I just like finding hidden patterns and disguised strengths and weaknesses in data sets, and this is a good data set to play with. When Social Security released the 2013 baby names three days ago (which are incorporated into this post), I discarded the idea of analysing them after about 10 seconds because I knew she would do it better, being more intimately familiar with the history of each name. Seriously, if you're interested in the subject, visit her blog. She was mentioned in Wes McKinnon's de rigueur Python for Data Analysis (O'Reilly 2012), and as far as I'm concerned there is no higher praise.


Post a Comment

Please leave comments & corrections here. Courtesy is appreciated.

Popular Posts

Scroll To Top