Tuesday, May 27, 2014

Graphing the distribution of English letters towards the beginning, middle or end of words




Some data visualizations tell you something you never knew. Others tell you things you knew, but didn't know you knew. This was the case for this visualization.

Many choices had to be made to visually present this essentially semi-quantitative data (how do you compare a 3- and a 13-letter word?). I semi-exhaustively explain everything at on my other, geekier blog, prooffreaderplus, and provide the code I used; I'll just repeat the most crucial here:

    The data is from the entire Brown corpus in the Natural Language Toolkit. It's a smaller and out-of-date corpus, but it's open source and easy to obtain. I repeated the analysis with COHA, the Corpus of Historical American English, a well-curated, proprietary data set from Brigham Young University for which I have a license, and the only differences were in rare letters like "z" or "x".
    I used a corpus rather than a dictionary so that the visualization would be weighted towards true usage. In other words, the most common word in English, "the" influences the graphs far more than, for example, "theocratic".
    The ordinal (y) scales are obviously not equal: "e" is used 100-200 times more often than "z", and while I could have fudged everything with log scales, letter frequency is not the point of the graphs. As long as I had to fudge anyway, I did so in a way that, I believe, makes it easiest to understand what the graph shows. Your mileage may, of course, vary. The color coding is a quick guide to help understanding, since letter frequency is of course relevant to the shapes you see.
    There are 15 "bins" of letter positions, as a purely qualitative comparison suggested to me this was about the ideal number to show the underlying trends without under- or overfitting. Therefore the "t" in "the" takes up positions 1 through 5, the "h" 6 through 10, etc. When letters straddle a boundary they are apportioned proportionately.

Now then: I became curious about how letters are placed in English while doing many different, often quick, sometimes pointless, pattern analyses of letters for a wide variety of reasons. (One example: for one art project that will hopefully be posted on this blog one day, I found all the anagrams of "Hollywood", and noticed that words beginning with "w" were overrepresented.)

I've had many "oh, yeah" moments looking over the graphs. For example, words almost never begin with "x", but it's quite common as the second letter. There's a little hump near the beginning of "u" that's caused by its proximity to "q", which is most common at the beginning of a word. When you remove "q" from the dataset, the hump disappears. "F" occurs toward the extremes, especially in prepositions ("for", "from", "of", "off") but rarely just before the middle.

A final thought: the most common word in the English language is "the", which makes up about 6% of most corpuses (sorry, corpora). But according to these graphs, the most representative word is "toe".

Monday, May 26, 2014

Tuesday, May 20, 2014

The five commandments (and fifteen footnotes) of data visualization

The five[1]  commandments[2]  of data visualization[3] 

I'm nobody special in the world of data viz; I have no profound observations or innovations to add to those of the likes of Edward Tufte, Hans Rosling, Hadley Wickham or Mike Bostock; but I think I have a little common sense and boots-on-the-ground experience when it comes to the more mundane, journeyman work of making a PowerPoint slide and being proud it doesn't use Comic Sans[4] . (I use Python now, I'm never going back.) By all means, if you have the time and luxury, concern yourself with data-to-ink ratio[5] ; but before that point, here are a few tips to help ensure there's any ink at all.[6] 

(Comments, suggestions, dissenting opinions and, especially, corrections are very welcome, don't be shy. Yes, I'm saying don't be shy to the Internet. Stand back.)

1. A graph is like a paragraph.

A visualization should have more to say than a sentence or a sparkline, but less to say than a short story or, well, the raw data itself. A data visualization needs to strike a balance between saying too much and saying too little.[7]  A good rule of thumb I use is: if one really well-written footnote might help someone understand the graph a bit better, I've done my job right, and often I don't even end up using that footnote. If the footnote is absolutely necessary or two footnotes present themselves, or if not even a five-year-old would see any value in a footnote whatsoever, then maybe the scope of the graph is wrong.

2. Visualization is translation

The Italians have a saying: "Traddutore, traductore", roughly "Translation is treason". Creating a visualization is translating raw data into another medium, and it involves loss of information, and it involves choices, hard choices, desperate choices [8] . Think of it as describing a movie to your significant other. ("I know you don't like action movies, but it was so cool when Schwarzenegger threw this grenade..."[9] ) You can't describe the entire movie, nor does your audience want you to. They trust you to make the choices necessary to get the pattern hidden in the data across with skill, clarity and integrity. Which leads me to...

3. Visualize with integrity

When you were a child, someone must have told you honesty is the best policy[10]  , and they were right. When in doubt, act with integrity. Actually, when not in doubt too. Actually, especially when not in doubt: if you're not wondering if you're making the right choices when you're deciding how to visually present your data, you're doing it wrong. [11]  Basically everything I need to say about integrity, Fox News has said more eloquently (and just slightly less intentionally) than me[12] :


4. The second-worst outcome is for someone not to understand your graph.

There are lots of very complicated visualizations out there. They certainly have their place, but they probably should be tackled by the elite among us.[13]  But even with more modest goals in a visualization, it's quite possible for the message to get muddled in the medium [14] . Boxplots are brilliant. I absolutely love boxplots. But so few people understand them, more time gets spent explaining how they work than on trying to understand the actual data. Of course, it all depends on your audience, and your mileage may vary.

5. The worst outcome is for someone to MISunderstand your graph.

What's obvious to you as the creator might not be so to your audience. Always ask yourself what a fresh pair of eyes might understand from your graph. If necessary, go and find a fresh pair of eyes. [15]  When someone misunderstands what your visualization presents, no matter how obvious you think it is, you failed. ("You had one job!") You are intimately familiar with your data: others are not, and first erroneous impressions are hard to erase, and people get upset when they need to rearrange their brains. Don't scoff, you do it too.

6. Don't limit yourself to your first idea.

Like "This will be a list with five items on it."[16] 


Monday, May 12, 2014

Skyrocketing of boys names ending in 'n' is not reflected in the most popular names overall

(Click on chart to enlarge:)

In 2007, arguably the foremost statistician-cum-maven concerning the U.S. Social Security Administration's baby name database, Laura Wattenberg*, appears to be the first to have noticed quite a dramatic trend, the rise of boys' names ending in "n" from about 15% in the 1950s to around 35% now.

A few weeks ago, I made an animated gif to visualize this phenomenon, and it became, well, probably not viral per se, but at least antibiotic-resistant:


      Links: 

The next obvious interesting question to me is: the trend was over 40 years old when Ms. Wattenberg noticed it, so why did it take so long? The usual answers to this question, lack of computing power, lack of published data, are certainly factors, but I think by the late '90s someone could have figured this out -- if they had been familiar enough with the data to think of it as a potential route of analysis, which only Ms. Wattenberg seems to have been. This means the trend is somehow hidden in the data, and that piques my interest. (It reminds me of Simpson's paradox, while simultaneously being totally different.)

The problem, I think, is twofold: (a) most people are exposed to baby names as a "Top 10" list, or at best a "Top 100 list", without context, and these names tell less and less of the whole story In 1950, the Top 10 names made up 33% births, while in 2013 it is less than 9% as names have become more diverse. (Also, in 1950 there are 5,700 unique names in the database, while in 2013 there were 17,800.) And (b) Patterns in the last letter are less obvious to casual analysis than first-letter patterns, especially when the second-last letter varies (EthAn, JasOn, JaydEn, etc.)

Comparing a chart of how often names ending in 'n' show up in the Top 10 with their popularity overall shows, except for brief periods, this fad was not reflected in the most popular names (i.e. the black bars are mostly below the red line):


This screams out for a quantile analysis to determine at what level of popularity names ending in 'n' was driven, which you will see at the top of this post.

The pre-World War II distribution of the most popular name, "John" being three quintiles above the second-most, "Benjamin", is interesting, but the most telling pattern is in the top three quintiles after 1950. Gains in popularity in this kind of graph will always show a "rolling" of peaks from lower quintiles to higher, but in this graph, this gain is not symmetrical; most of it was led by the second quintile, and the most popular names still lag behind the overall popularity of 'n' names. In other words, U.S. high schools and colleges right now have lots of Ethans, Masons and Jaydens, but they will be most likely to have multiple Michaels and Jacobs.

The underlying phenomenon, I think, is a linguistic one: there just happen to be a lot of names ending in 'n' which strike a balance between conformity and individuality that many parents are looking for. Their son can have a name that doesn't make him stand out, but that isn't exactly the same as all the rest. The similarity pool simply happens to be much smaller for other popular names: the #1 name for 2013, announced three days ago, was "Noah". The rest of the "h" names are even more biblical, which is not what everyone wants. (#334 is "Messiah" -- no pressure, kid.) Number 2 is Jacob, and after "Caleb" all the other "b" names are uncommon derivatives "Jakob," "Kaleb", etc. (#4852 is "Gleb", which I hope means the typists' fingers slipped from the adjacent "n" key a few times.)


Finally, I would like to post something related to my previous 'n'-names post: Dana Silver,a computer science major in Vermont, made a D3.js version of my animated histogram with a slider that allows you to choose the year you would like to see. This had been my first impulse, but my JavaScript is so rudimentary, I just didn't have the courage to tackle the problem (I managed to get something that worked in JQuery UI, but it looked terrible.) Note that (a) her y-axis autoscales, which Edward Tufte might frown upon for changing the data-to-ink ratio every year, but it sure makes the differences between years more obvious, and (b) You can also see girls' names, which do not show a similar pattern; I'm suspect there are

If anyone is interested in the code (and some nice IPython notebooks that may be of interest even to non-pythonistas) I used to do the last two analyses and make the charts for boys' names beginning with n, check my sister blog, prooffreaderplus.

      Footnote: 
 * I so highly recommend Ms. Wattenberg's Baby Name Wizard blog that I have to carefully meter my own exposure to it so as not to bias my own results! Unlike Ms. Watternberg, I am not a baby name expert, I just like finding hidden patterns and disguised strengths and weaknesses in data sets, and this is a good data set to play with. When Social Security released the 2013 baby names three days ago (which are incorporated into this post), I discarded the idea of analysing them after about 10 seconds because I knew she would do it better, being more intimately familiar with the history of each name. Seriously, if you're interested in the subject, visit her blog. She was mentioned in Wes McKinnon's de rigueur Python for Data Analysis (O'Reilly 2012), and as far as I'm concerned there is no higher praise.

Thursday, May 8, 2014

Popular Posts

Scroll To Top