The second graph below shows the reverse probabilities, e.g. given U, what's the probability that a Q precedes it, and given we're at the end of a word, what's the probability that the last letter is Y. Trust me, I know, it requires a little mental agility, I've been working on this for a week and I still get mixed up. If you can understand why, in the top graph, the horizontal rows add up to 100% but the vertical columns don't, you've totally got it.

A while back, I posted a series of charts about letter positions in English words. Nathan Yau of FlowingData was kind enough to write about it, and he suggested I look at letter proximity.

The source data is the COHA corpus of Historical American English; each word was analyzed and weighted as to their frequency (so the "th" in "the" influenced the probability of H following T way more than the "th" in theremin.)

Here's a GitGub repo with the code used to produce the data; after experimentation, both Plotly and Bokeh had serious drawbacks when it came to presenting heatmaps of this sort (which will presumably be addressed by later releases), so I went with Tableau Public, took about 20 minutes tops. Note that with this app, you can click things and hide things and have all kinds of fun.

Here's the graph of the probabilities of letters

*preceding*, not following, one another. There are also static graphic versions at the very end. Enjoy!

## 0 comments:

## Post a Comment

Please leave comments & corrections here. Courtesy is appreciated.