Graphing Distribution of English

Some data visualizations tell you something you never knew. Others tell you things you knew, but didn’t know you knew. This was the case for this visualization.

Many choices had to be made to visually present this essentially semi-quantitative data (how do you compare a 3- and a 13-letter word?). I semi-exhaustively explain everything at on my other, geekier blog, prooffreaderplus, and provide the code I used; I’ll just repeat the most crucial here:

    The data is from the entire Brown corpus in the Natural Language Toolkit. It’s a smaller and out-of-date corpus, but it’s open source and easy to obtain. I repeated the analysis with COHA, the Corpus of Historical American English, a well-curated, proprietary data set from Brigham Young University for which I have a license, and the only differences were in rare letters like “z” or “x”.
    I used a corpus rather than a dictionary so that the visualization would be weighted towards true usage. In other words, the most common word in English, “the” influences the graphs far more than, for example, “theocratic”.
    The ordinal (y) scales are obviously not equal: “e” is used 100-200 times more often than “z”, and while I could have fudged everything with log scales, letter frequency is not the point of the graphs. As long as I had to fudge anyway, I did so in a way that, I believe, makes it easiest to understand what the graph shows. Your mileage may, of course, vary. The color coding is a quick guide to help understanding, since letter frequency is of course relevant to the shapes you see.
    There are 15 “bins” of letter positions, as a purely qualitative comparison suggested to me this was about the ideal number to show the underlying trends without under- or overfitting. Therefore the “t” in “the” takes up positions 1 through 5, the “h” 6 through 10, etc. When letters straddle a boundary they are apportioned proportionately.

Now then: I became curious about how letters are placed in English while doing many different, often quick, sometimes pointless, pattern analyses of letters for a wide variety of reasons. (One example: for one art project that will hopefully be posted on this blog one day, I found all the anagrams of “Hollywood”, and noticed that words beginning with “w” were overrepresented.)

I’ve had many “oh, yeah” moments looking over the graphs. For example, words almost never begin with “x”, but it’s quite common as the second letter. There’s a little hump near the beginning of “u” that’s caused by its proximity to “q”, which is most common at the beginning of a word. When you remove “q” from the dataset, the hump disappears. “F” occurs toward the extremes, especially in prepositions (“for”, “from”, “of”, “off”) but rarely just before the middle.

A final thought: the most common word in the English language is “the”, which makes up about 6% of most corpuses (sorry, corpora). But according to these graphs, the most representative word is “toe”.