Every time I post about the popular U.S. Social Security Administration baby names dataset, I try to acknowledge the fact that there are some serious problems with it -- and by "problems", I mean things the average person unfamiliar with it will assume are true, but which actually aren't, specially prior to World War II. I've covered all of these to one degree or another in my previous baby names posts here and here and here and here and here and here, but there are always a few questions from readers, so I thought it would be nice to be able to link to something that explained all the major concerns clearly and concisely:


Tableau Public's new Story View feature is well-suited to this kind of presentation, and I'll add panels if and when I come across more problematic aspects of sufficient magnitude.

I'd like to reiterate one thing: the problem isn't in the data, it's in how it's often presented and understood. The Social Security Administration does not make any false claims whatsoever (although IMHO they could make their disclaimers more prominent). And some of the baby names blogs and websites make a decent effort to address these issues, or at least not to make unsupportable conclusions based on the data.

Explanation, if you don't get out much.
A few attempts have been made to determine the trendiest baby names in the U.S. Social Security Administration database; FlowingData, for example, looked at the quickest rises and falls, and determined that Catina was the most flash-in-the-pan name; however, at its peak, it comprised only 0.0097% of girls' names. This is a perfectly legitimate analysis, but it's been in the back of my mind that to measure an admittedly ill-defined quality like "trendiness", maybe overall popularity should count as well as steepness of rise and fall.

Therefore, I turned to a technique I've used in chemometrics (I knew it would come in handy one day, it's been years since I've touched a gas chromatograph) to analyze peaks for both size and sharpness. First the results:



For comparison, here are the much sharper peaks for the two names that had the quickest rise and fall regardless of overall popularity, Catina and Deneen*, and then those peaks in comparison to the trendiest name according to this technique, Linda:




Here's an explanation of how the "trendiness" score for this analysis was determined; peak height divided by peak width (which can be measured in various ways) is a pretty standard metric in chemistry:


With chromatographic peaks we normally use 50% peak heights, but they're of a more predictable shape. The 10% figure I chose is entirely arbitrary, but it seems to strike a good balance between allowing and disallowing names due to weird shapes and baseline noise. Nothing changes if you go down to 5% or up to 20%.

The beauty of this approach is that it is almost equally sensitive to changes in peak height or peak width, i.e. popularity of the name or length of time the name was popular.

As has been remarked by many analysts, girls' names tend to rise and fall in popularity higher and quicker than those of boys; this analysis bears that out. There are really only two boys' names that one would consider a sharp peak; the other three are presidents' names or, in the case of Dewey, that of the hero of the Spanish-American war.

As always, be wary of numbers from this dataset before 1936, when social security numbers were first assigned.

I've put the top 100 trendy boys' and girls' names on my other, nerdier blog, prooffreaderplus.com.

Finally, here are links to my Baby Name GitHub Repo, and to an IPython notebook for this analysis.

* The names Catina and Deneen come from a soap opera and a musical act, respectively. Thanks to a reader who pointed out that missing values (nobody was named Catina before 1949) had shifted the peaks down around the year 1900; whoops, that was quite careless of me. The graph is now correct.






Click to enlarge:


EDIT 2014-07-08: Sometimes the readers are just smarter than me! The original graphic (which you can see at the end of the post) had the letters in rows; I got six e-mails in the first few hours suggesting it would be a lot easier to compare across languages if they were in columns. When you're right, you're right.

My May 27 blog post of the distribution of letters in English toward the beginning, middle and end of words seemed well-received, and generated quite a few compliments, and not a few requests to do the same for other languages. One reader was even inspired to do a similar project in French.

Since I already had the code, I thought, why not? Now the only problem was getting my hands on a corpus; you can read about my adventures in this regard, as well as some more esoteric analysis of this data set, on my other, geekier blog; suffice it to say I was quite fortunate to find the Europarl Parallel Corpus, a collection of proceedings of the European Parliament with simultaneous translations in twenty languages. Since every language has the same subject matter, we're maximizing the chances that any differences we see are actually due to the language, not because of differences in the corpus.

I chose the seven languages with the most speakers in the European Parliament, plus Finnish because I thought it would be interesting to have a non-Indo-European language to compare as well.

Note that characters outside of the Basic Latin Unicode block* (accents, digraphs, etc.) are aggregated with their non-ornamented versions; this is not ideal, since they're not at all interchangeable (if you ask someone where a congrès is in Paris without the accent, they'll point you towards a seafood shop), but it's really the only way we can make the datasets comparable in this limited and hardly scholarly context.

Note that to determine the shape of the individual graphs I followed the same methodology as last time, outlined here.

There are a lot of interesting features in the chart, I could stare at the thing for hours (and I have.. well, 45 minutes, anyway). I'll just name a few here:

VowelsUnsurprisingly, Spanish, Italian and Portuguese have the vowels "a" and "o" shifted towards the end of words, plus "e" and "i" in Italian. Non mi credi? È vero!
Foreign letters Some languages use certain letters only in foreign or borrowed words (French, Spanish and Portuguese, k, w; German x, y;*Italian j, k, w, x, y; Polish q, v, x; and Finnish b, c, f, q, w, x, z). Look at the effect on "q": Finnish is the only language not to have it towards the beginning, because the most popular words in this corpus are foreign names like "Jacques", without grammar words like "que".
DEnglish is alone having the "d" most commonly* at the end of words, thanks to the past tense. Who'd have guessed? Well, anyone who thought about it, I suppose.
HTo my eye, "h" shows the most difference in distributions patterns: the most representative words are the, chaque, nicht, ha, che, senhor, tych and puhemies.
LBecause of their articles, French and Spanish's "l"s are much more front-heavy. Le phénomène, es la verdad. Similarly, due to their grammar, German and Finnish's "n"s are towards the end.
P"P" is mostly* at the beginning of words in all eight languages; due to its source, this particular corpus has a preponderance of cognates of president, parliament and political, but even removing these words leaves the phenomenon intact. This makes intuitive sense to me (n.b. IANAL -- I am not a linguist); as a bilabial plosive, it's awkward at the end of words; consider the word "pep". The same phenomenon occurs to a lesser extent for p's voiced analogue, the letter "b". [relevant Sesame Street]
W, Y, ZLetter frequencies aren't the point of this chart, but they're interesting in their own right. Look how much more often "w", "y" and "z" are used in Polish than in any other language (the word for "all" is "wszystkich"), and I never would have guessed "f" to be such a particularly English letter (though it's obvious in retrospect, given that among the most common words are "of" and "from").

Here's the original graphic:
* Thanks to readers for pointing out that my quickly googled source on German orthography was quite wrong, and that my phrasing seemed to be saying that only English ever had a d as the last letter of a word (which would, of course, not be la verdad). Also, since I deal with Unicode and UTF-8 difficulties every day, I forgot that some might be confused that I seemed to be claiming the Latin language had 26 letters.

Go on, do the math and verify there are 101 of them, you know you want to...

If you prefer, you can watch this as a GIF or on YouTube.

The animation above shows all earthquakes with epicenters in the bounded area and magnitudes greater than 5.0. The first slide says "Richter scale" because that's most familiar to most people, but the actual scale used was the Moment Magnitude Scale; it's generally within a few decimals of the 1930s-era Richter.

The data is from IRIS (the Incorporated Research Institute for Seismology), the maps were produced using Python with Pandas, Matplotlib and Basemap, and the animation with GIMP and HTML5 conversion with gfycat.

The width of the circles is the result of a compromise relationship between magnitude scale number, circle area, perceived circle size difference and total energy release. I chose a scale that (a) was intermediate between the extremes of size every approach suggested, and (b) had a simple formula so that it would be, at the very least, transparent:
circle size = 20 pixels * (magnitude - 4.5) ^ 1.5
As you can see, it's arbitrary, but there is no non-arbitrary scale that is not equally misleading in certain respects, which is probably why IRIS does not vary the circle size at all in their maps, instead indicating intensity with color (which has its own perceptual issues, unfortunately: pdf).

The main thing to note is that the circle size is somewhat related to the area on the surface affected by the earthquake, but the relationship is very fuzzy. Different geographical features affect the distance earthquakes travel; faults, for example, actually contain them in a smaller area. In the animation, the most important difference in scale to show is that earthquakes of magnitude 5.0, which can be felt and are alarming but generally cause little damage in areas prepared for them, are small circles, and the 42 quakes with magnitude 7 or above are quite perceptually different.

Feel free to disagree and comment. As usual, I don't claim to have found the solution to a quandary, just a solution, and I'm sure there are better ideas out there.

As you watch the animation, you may want to keep an eye out for the following notable earthquakes (and you'll also notice a lot of large earthquakes that are not notable, because thankfully the damage they caused was not in proportion to their magnitude).

 •  July 1976: The Tangshan Earthquake (magnitude 7.5) on the northern Chinese coast near Korea. The deadliest earthquake of the 20th century, killing between 250,000 and 650,000 people.
 •  January 1995: 7.3 The Great Hanshin Earthquake (magnitude 7.3) near Kobe in southern Japan, caused $100 billion in damage, 2.5% of Japan's GDP at the time
 •  September 1999: The 9/21 Earthquake (magnitude 7.5) in Taiwan (at the very bottom edge of this map)
 •  May 2008: The Great Sichuan Earthquake (magnitude 7.9) in central China, killed 70,000
 •  March 2011: The Tōhoku earthquake (magnitude 9.0) and tsunami, the fifth largest earthquake in modern times; hundreds of huge aftershocks appear on the map all the way through December 2013.

Copyright © 2012 prooffreader.com