Graphing the distribution of English letters towards the beginning, middle or end of words




Some data visualizations tell you something you never knew. Others tell you things you knew, but didn't know you knew. This was the case for this visualization.

Many choices had to be made to visually present this essentially semi-quantitative data (how do you compare a 3- and a 13-letter word?). I semi-exhaustively explain everything at on my other, geekier blog, prooffreaderplus, and provide the code I used; I'll just repeat the most crucial here:

    The data is from the entire Brown corpus in the Natural Language Toolkit. It's a smaller and out-of-date corpus, but it's open source and easy to obtain. I repeated the analysis with COHA, the Corpus of Historical American English, a well-curated, proprietary data set from Brigham Young University for which I have a license, and the only differences were in rare letters like "z" or "x".
    I used a corpus rather than a dictionary so that the visualization would be weighted towards true usage. In other words, the most common word in English, "the" influences the graphs far more than, for example, "theocratic".
    The ordinal (y) scales are obviously not equal: "e" is used 100-200 times more often than "z", and while I could have fudged everything with log scales, letter frequency is not the point of the graphs. As long as I had to fudge anyway, I did so in a way that, I believe, makes it easiest to understand what the graph shows. Your mileage may, of course, vary. The color coding is a quick guide to help understanding, since letter frequency is of course relevant to the shapes you see.
    There are 15 "bins" of letter positions, as a purely qualitative comparison suggested to me this was about the ideal number to show the underlying trends without under- or overfitting. Therefore the "t" in "the" takes up positions 1 through 5, the "h" 6 through 10, etc. When letters straddle a boundary they are apportioned proportionately.

Now then: I became curious about how letters are placed in English while doing many different, often quick, sometimes pointless, pattern analyses of letters for a wide variety of reasons. (One example: for one art project that will hopefully be posted on this blog one day, I found all the anagrams of "Hollywood", and noticed that words beginning with "w" were overrepresented.)

I've had many "oh, yeah" moments looking over the graphs. For example, words almost never begin with "x", but it's quite common as the second letter. There's a little hump near the beginning of "u" that's caused by its proximity to "q", which is most common at the beginning of a word. When you remove "q" from the dataset, the hump disappears. "F" occurs toward the extremes, especially in prepositions ("for", "from", "of", "off") but rarely just before the middle.

A final thought: the most common word in the English language is "the", which makes up about 6% of most corpuses (sorry, corpora). But according to these graphs, the most representative word is "toe".

36 comments:

  1. Cool! Is plural taken into account? My English lexicon isn't that of a native speaker, but the tendency of the 's' to increase at the end makes me think that words in plural are included.

    I wonder what this would look like with 'k' in Spanish; it's not common. I think mostly in the beginning, whereas 'ñ' in the center.

    ReplyDelete
  2. Indeed, plurals and endings like -ed and -ing are taken into account, the source data, the Brown corpus, is a collection of books and news articles and other texts, all in complete sentences with no alteration.

    It would be pretty easy to adapt the code to do the same calculations with other languages; I'd just have to check how it handles accented characters like ñ, I assume one would want to analyze it differently from "n".

    ReplyDelete
  3. Love this!

    I wonder if you could use this to "predict" all the English words that haven't been invented yet. For instance, working out three-letter words, select the common first letters, the common middle letters and the common end letters, find all the combinations of those, exclude existing words, and ensure you include a vowel. You could even predict the probability of the word being used based on the combined popularity of the letters you use. (I thought I'd just discovered "noy" and "pon" using this method, but it turns out someone beat me to it.)

    ReplyDelete
    Replies
    1. Unfortunately, this wouldn't provide much useful information, for a couple of reasons.

      The first is that the list of all possible English words doesn't provide us much information about which ones will be adopted and used commonly - the space of all possible english words is very, very large, but the number of words that people will hear, consider appropriate for concepts they actually need to use, adopt, spread socially and so forth isn't constrained solely by the likelihood of the letters - compare "toothsome", an infrequently used word, to "delicious", a frequently-used synonym, or "gourmand" to its synonym "lickerish". Loanwords often become more popular than the English words that fit our letter distribution.

      Which leads to another point - language change. Let's say you create a list of all possible English words; in this list, the word Xzybn doesn't feature, because English doesn't permit those consonant clusters. But if English borrows three or four foreign words that allow them, and those become popular words, the rules of English will change and people will consider those clusters more acceptable. In a decade, it's possible that some of the impossible English words will be possible, and possible that some of the possible words will be impossible; imagine trying to extend those predictions to the entire future of the language!

      Delete
  4. J, Q & W are the most front end loaded letters

    ReplyDelete
  5. What is the longest word that has letters only within 1 position of its most frequently occurring location?

    ReplyDelete
  6. Might it be caused by the history of the English language? In Italian we have most of the words closing with vowels because of the fall of the last letters in the passage from the ancient latin word to the medieval word in the primitive Italian. The corresponding ancient word in latin determined most of times which vowel has shown in the modern words.
    I don't want to say this work is pointless, but I can't actually here some reference to something like this:
    http://www.englishproject.org/resources/development-english-grammar

    ReplyDelete
  7. How about the phonemic distribution, along with phonemic spelling distribution?

    ReplyDelete
    Replies
    1. Corpora of phonemic data aren't as easy to come by, partly because the segmentation into phonemes varies by accent. As an example, "car" is three phonemes in a rhotic accent (such as most North American accents) and two in a non-rhotic (such as most British accents)- [kɑɹ] and [kɑː]. In extreme cases, there are even extra phonemes - Received Pronounciation has 20 vowels, but there's only 16 in General American English, while the Scots have a voiceless velar fricative - sort of a raspy, throaty "k" sound - that doesn't appear in any other dialects.

      And it gets worse - when you pronounce a word in English, you stress some syllables and don't stress others. The vowels in unstressed syllables are all reduced to the schwa - a sort of neutral vowel sound, the last "uh" sound in a normal pronounciation of "america". Schwa ends up being the most common vowel in spoken english, but stress patterns vary - when people are angry, they tend to stress even the unstressed words. Plus, "I" rarely reduces to schwa, it's less noticable for long vowels like u, and the speed of pronounciation and the letters either side of it can influence whether it's rendered as schwa or not. How do you chart that?

      Basically, we can't even agree on how many phonemes there are in English, so we don't have a good corpus of this data on which to perform analysis.

      Delete
  8. I'm not entirely sure what you mean by "most representative word". But if you take those 26 distributions and try to find a "sum" that's "nearly flat" that might be what you mean. Eyeballing, it looks as if maybe "pie" would be pretty flat as well. Is that roughly what you mean for "toe"? It would be amusing to run this over a dictionary and find the "flattest" and "least flat" words. Maybe even those words that start high and end low, and are nearly nice slopes. That would be a word that starts surprisingly -- perhaps with an X -- and ends predictably. Hmm. But a flat word might still be surprising at every letter.

    ReplyDelete
  9. I meant "toe" has the most common English letters in the beginning, middle and end of words, t, o and e respectively.

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. I'm a typeface designer, and this is highly useful! Would it be possible to show different results based on lowercase versus uppercase?

    ReplyDelete
  12. This comment has been removed by the author.

    ReplyDelete
  13. I absolutely love this project. Thank you so much for sharing it! I am a bit confused about the individual graphs and what they represent. If you were only looking at beginning, middle, and end, then there would be 3 data points horizontally for each letter, but that is not the case in your graphs. There is much more variation. What does the graph actually tell us (in more detail)? If my question is confusing, forgive me and let me know. I'm really interested in learning your process...

    ReplyDelete
    Replies
    1. Basically, I made 15 "bins", so that the three letters of "and", for example, would contribute to bins 1-5, 6-10 and 11-15, whereas a two-letter word would be bins 1-7 and half of 8, then half of 8 and 9 to 15, etc. You can read more about my methodology at http://prooffreaderplus.blogspot.ca/2014/05/methodology-and-analysis-of-letter.html, and if you have any questions I'd be happy to answer them.

      Delete
  14. Thank you so much for your reply, David. I am honestly going to read through your methodology. As someone interested in social linguistics, I often forget how fun the more "mathematical" side of language can be.

    ReplyDelete
  15. It would be so fascinating to do this but with IPA translation. I'm sure that data is not available in sufficient amounts, but it could tell us so much about prosody, and then . Especially if compared to another language. I'm starting to geek out. Look what you've done, David Taylor!

    ReplyDelete
  16. I'm glad! Although most of the response to this effort has been positive, there were a few who seemed to think there was no point to what I'd done. I know it's not earth-shattering, but I find it terribly interesting regardless. Others have suggested doing a phonetic version, I'm looking into it, there are phonetic dictionaries and algorithms one could use (dialect-dependent, of course). Computational linguistics is fascinating, but there's a lot to learn for someone who only took three linguistics classes 20 years ago!

    ReplyDelete
    Replies
    1. Philistines! Of course this data is relevant and important. Especially with an IPA analysis, this project could help bridge gaps between social linguists and deep structure theory. Do you have a mailing list or a place I could be contacted should you do another project?

      Delete
    2. You're giving me the encouragement I need! I would suggest following my Twitter, Facebook or RSS; there are links on the top right of the page.

      Delete
  17. Although "toe" is representative (each letter standing in as an ambassador for its position), aren't "joy" or "bod" even more so?

    ReplyDelete
    Replies
    1. That's a good point, it depends on what one means by "representative", i.e. what it's "representing". For "toe" it's the most common letter overall at each of the three main positions (because the overall areas are dark red), for "joy" and "bod" it's the letters that have the most skewed distribution towards each position.

      Delete
    2. Ahh... I got it. That's the weighting at play. Same reason "the" carries more weight than "thee" (as footnoted). Thanks for the clarification.

      Delete
  18. Great visualization!! This was posted as the "Viz o' the Week" on our company's social media "Data Visualization" group's page.

    I just sent this link to Alena Graedon, author of "The Word Exchange", a novel set in the not-too-distant future, in which the forecasted "death of print" has become a reality, and most people rely on their electronic devices so much that they become unable to carry on a conversation without them. I thought she would enjoy this viz. I think you might might enjoy reading her book. I just finished it, and it was quite thought provoking, and a good read.

    ReplyDelete
    Replies
    1. Thanks for the recommendation, I'm always looking for good books to read! When I googled Ms. Graedon, one of the results compared her favorably to Samuel R. Delany, that pretty much sold it for me.

      Delete
  19. Wonderful! The last few years I've been playing online word games. I've noticed some of these patterns. "ng" for example, is almost always at the end of a word and that is shown in the graph for G.

    ReplyDelete
  20. Thanks for your research. It would be interesting to compare your results with those obtained from corpora of other English variants (e.g., British English).

    ReplyDelete
  21. It might be interesting to compare the general English language corpus with some selected corpora. For example, since the corpus of all Shakespeare is public domain, did Shakespeare use words that were distinctively different? What about the King James Bible, compared against the general corpus or against new Bibles? I suspect there are corpora specific to individual centuries, and one could see how the language has changed in this respect.

    ReplyDelete
  22. Hi,
    Just so you know, after reading your article, I did a similar graph representation of the position of letters in French (my language).
    You can see the results here (in French of course). To see the differences, I reused your graph and added my results as a transparent layer over it.
    http://sansdeconner.net/frequence-de-position-des-lettres-dans-les-mots/

    ReplyDelete
    Replies
    1. That's very interesting, thank you for sharing! C'est super intéressant, merci de l'avoir partagé! Moi aussi j'ai eu l'instinct de comparer différentes langues

      Delete
  23. For people who are interested in such things, there is a Carnegie-Mellon pronunciation dictionary with phonemes and stresses: http://www.speech.cs.cmu.edu/cgi-bin/cmudict , including (IIRC) alternates. One could presumably "translate" a corpus into a phonemic representation.

    ReplyDelete
  24. Wouldn't "bud" be more representative than "toe"?

    ReplyDelete
    Replies
    1. Nevermind... I see your definition includes "most common" letters.

      Delete
  25. Very interesting work David! And thanks to Toenail for the further development in French! Could anyone kindly do the same for Italian? I dabble in theatrical improvisation and use a Keith Johnstone technique to elicit fantastic stories from an unsuspecting audience by getting them to ask me yes/no/maybe questions about a story. If the last letter in the last word in the question ends with a consonnant; I answer yes, if a vowel no; if with a Y maybe. This works in English but is lousy in Italian. Any chance of a letter by letter breakdown for me to use as rules for my answers in Italian?

    ReplyDelete

Please leave comments & corrections here. Courtesy is appreciated.

Copyright © 2012 prooffreader.com