The following chart is an interactive heat map of the probability that, given the letter on the vertical axis in an English word, the next letter will be the letter on the horizontal axis.
Conditional probabilities can give you a headache; that’s why the Monty Hall problem is so difficult. The best way to grasp it is by example. Look at the darkest point, for QU. This shows that, GIVEN Q, the next letter is U 98.7% of the time. Similarly, the dark spot on the bottom left shows that GIVEN Y, the most probable event is that there is NO letter following it (signified by “_”), i.e. it’s at the end of a word.
The second graph below shows the reverse probabilities, e.g. given U, what’s the probability that a Q precedes it, and given we’re at the end of a word, what’s the probability that the last letter is Y. Trust me, I know, it requires a little mental agility, I’ve been working on this for a week and I still get mixed up. If you can understand why, in the top graph, the horizontal rows add up to 100% but the vertical columns don’t, you’ve totally got it.
A while back, I posted a series of charts about letter positions in English words. Nathan Yau of FlowingData was kind enough to write about it, and he suggested I look at letter proximity.
The source data is the COHA corpus of Historical American English; each word was analyzed and weighted as to their frequency (so the “th” in “the” influenced the probability of H following T way more than the “th” in theremin.)
Here’s a GitGub repo with the code used to produce the data; after experimentation, both Plotly and Bokeh had serious drawbacks when it came to presenting heatmaps of this sort (which will presumably be addressed by later releases), so I went with Tableau Public, took about 20 minutes tops. Note that with this app, you can click things and hide things and have all kinds of fun.
Here’s the graph of the probabilities of letters preceding, not following, one another. There are also static graphic versions at the very end. Enjoy!