Reference.
Every time I post about the popular U.S. Social Security Administration baby names dataset, I try to acknowledge the fact that there are some serious problems with it -- and by "problems", I mean things the average person unfamiliar with it will assume are true, but which actually aren't, specially prior to World War II. I've covered all of these to one degree or another in my previous baby names posts here and here and here and here and here and here, but there are always a few questions from readers, so I thought it would be nice to be able to link to something that explained all the major concerns clearly and concisely:


Tableau Public's new Story View feature is well-suited to this kind of presentation, and I'll add panels if and when I come across more problematic aspects of sufficient magnitude.

I'd like to reiterate one thing: the problem isn't in the data, it's in how it's often presented and understood. The Social Security Administration does not make any false claims whatsoever (although IMHO they could make their disclaimers more prominent). And some of the baby names blogs and websites make a decent effort to address these issues, or at least not to make unsupportable conclusions based on the data.

Explanation, if you don't get out much.
A few attempts have been made to determine the trendiest baby names in the U.S. Social Security Administration database; FlowingData, for example, looked at the quickest rises and falls, and determined that Catina was the most flash-in-the-pan name; however, at its peak, it comprised only 0.0097% of girls' names. This is a perfectly legitimate analysis, but it's been in the back of my mind that to measure an admittedly ill-defined quality like "trendiness", maybe overall popularity should count as well as steepness of rise and fall.

Therefore, I turned to a technique I've used in chemometrics (I knew it would come in handy one day, it's been years since I've touched a gas chromatograph) to analyze peaks for both size and sharpness. First the results:



For comparison, here are the much sharper peaks for the two names that had the quickest rise and fall regardless of overall popularity, Catina and Deneen*, and then those peaks in comparison to the trendiest name according to this technique, Linda:




Here's an explanation of how the "trendiness" score for this analysis was determined; peak height divided by peak width (which can be measured in various ways) is a pretty standard metric in chemistry:


With chromatographic peaks we normally use 50% peak heights, but they're of a more predictable shape. The 10% figure I chose is entirely arbitrary, but it seems to strike a good balance between allowing and disallowing names due to weird shapes and baseline noise. Nothing changes if you go down to 5% or up to 20%.

The beauty of this approach is that it is almost equally sensitive to changes in peak height or peak width, i.e. popularity of the name or length of time the name was popular.

As has been remarked by many analysts, girls' names tend to rise and fall in popularity higher and quicker than those of boys; this analysis bears that out. There are really only two boys' names that one would consider a sharp peak; the other three are presidents' names or, in the case of Dewey, that of the hero of the Spanish-American war.

As always, be wary of numbers from this dataset before 1936, when social security numbers were first assigned.

I've put the top 100 trendy boys' and girls' names on my other, nerdier blog, prooffreaderplus.com.

Finally, here are links to my Baby Name GitHub Repo, and to an IPython notebook for this analysis.

* The names Catina and Deneen come from a soap opera and a musical act, respectively. Thanks to a reader who pointed out that missing values (nobody was named Catina before 1949) had shifted the peaks down around the year 1900; whoops, that was quite careless of me. The graph is now correct.






Click to enlarge:


EDIT 2014-07-08: Sometimes the readers are just smarter than me! The original graphic (which you can see at the end of the post) had the letters in rows; I got six e-mails in the first few hours suggesting it would be a lot easier to compare across languages if they were in columns. When you're right, you're right.

My May 27 blog post of the distribution of letters in English toward the beginning, middle and end of words seemed well-received, and generated quite a few compliments, and not a few requests to do the same for other languages. One reader was even inspired to do a similar project in French.

Since I already had the code, I thought, why not? Now the only problem was getting my hands on a corpus; you can read about my adventures in this regard, as well as some more esoteric analysis of this data set, on my other, geekier blog; suffice it to say I was quite fortunate to find the Europarl Parallel Corpus, a collection of proceedings of the European Parliament with simultaneous translations in twenty languages. Since every language has the same subject matter, we're maximizing the chances that any differences we see are actually due to the language, not because of differences in the corpus.

I chose the seven languages with the most speakers in the European Parliament, plus Finnish because I thought it would be interesting to have a non-Indo-European language to compare as well.

Note that characters outside of the Basic Latin Unicode block* (accents, digraphs, etc.) are aggregated with their non-ornamented versions; this is not ideal, since they're not at all interchangeable (if you ask someone where a congrès is in Paris without the accent, they'll point you towards a seafood shop), but it's really the only way we can make the datasets comparable in this limited and hardly scholarly context.

Note that to determine the shape of the individual graphs I followed the same methodology as last time, outlined here.

There are a lot of interesting features in the chart, I could stare at the thing for hours (and I have.. well, 45 minutes, anyway). I'll just name a few here:

VowelsUnsurprisingly, Spanish, Italian and Portuguese have the vowels "a" and "o" shifted towards the end of words, plus "e" and "i" in Italian. Non mi credi? È vero!
Foreign letters Some languages use certain letters only in foreign or borrowed words (French, Spanish and Portuguese, k, w; German x, y;*Italian j, k, w, x, y; Polish q, v, x; and Finnish b, c, f, q, w, x, z). Look at the effect on "q": Finnish is the only language not to have it towards the beginning, because the most popular words in this corpus are foreign names like "Jacques", without grammar words like "que".
DEnglish is alone having the "d" most commonly* at the end of words, thanks to the past tense. Who'd have guessed? Well, anyone who thought about it, I suppose.
HTo my eye, "h" shows the most difference in distributions patterns: the most representative words are the, chaque, nicht, ha, che, senhor, tych and puhemies.
LBecause of their articles, French and Spanish's "l"s are much more front-heavy. Le phénomène, es la verdad. Similarly, due to their grammar, German and Finnish's "n"s are towards the end.
P"P" is mostly* at the beginning of words in all eight languages; due to its source, this particular corpus has a preponderance of cognates of president, parliament and political, but even removing these words leaves the phenomenon intact. This makes intuitive sense to me (n.b. IANAL -- I am not a linguist); as a bilabial plosive, it's awkward at the end of words; consider the word "pep". The same phenomenon occurs to a lesser extent for p's voiced analogue, the letter "b". [relevant Sesame Street]
W, Y, ZLetter frequencies aren't the point of this chart, but they're interesting in their own right. Look how much more often "w", "y" and "z" are used in Polish than in any other language (the word for "all" is "wszystkich"), and I never would have guessed "f" to be such a particularly English letter (though it's obvious in retrospect, given that among the most common words are "of" and "from").

Here's the original graphic:
* Thanks to readers for pointing out that my quickly googled source on German orthography was quite wrong, and that my phrasing seemed to be saying that only English ever had a d as the last letter of a word (which would, of course, not be la verdad). Also, since I deal with Unicode and UTF-8 difficulties every day, I forgot that some might be confused that I seemed to be claiming the Latin language had 26 letters.

Go on, do the math and verify there are 101 of them, you know you want to...
Copyright © 2012 prooffreader.com