EDIT 2014-07-08: Sometimes the readers are just smarter than me! The original graphic (which you can see at the end of the post) had the letters in rows; I got six e-mails in the first few hours suggesting it would be a lot easier to compare across languages if they were in columns. When you’re right, you’re right.
My May 27 blog post of the distribution of letters in English toward the beginning, middle and end of words seemed well-received, and generated quite a few compliments, and not a few requests to do the same for other languages. One reader was even inspired to do a similar project in French.
Since I already had the code, I thought, why not? Now the only problem was getting my hands on a corpus; you can read about my adventures in this regard, as well as some more esoteric analysis of this data set, on my other, geekier blog; suffice it to say I was quite fortunate to find the Europarl Parallel Corpus, a collection of proceedings of the European Parliament with simultaneous translations in twenty languages. Since every language has the same subject matter, we’re maximizing the chances that any differences we see are actually due to the language, not because of differences in the corpus.
I chose the seven languages with the most speakers in the European Parliament, plus Finnish because I thought it would be interesting to have a non-Indo-European language to compare as well.
Note that characters outside of the Basic Latin Unicode block* (accents, digraphs, etc.) are aggregated with their non-ornamented versions; this is not ideal, since they’re not at all interchangeable (if you ask someone where a congrès is in Paris without the accent, they’ll point you towards a seafood shop), but it’s really the only way we can make the datasets comparable in this limited and hardly scholarly context.
Note that to determine the shape of the individual graphs I followed the same methodology as last time, outlined here.
There are a lot of interesting features in the chart, I could stare at the thing for hours (and I have.. well, 45 minutes, anyway). I’ll just name a few here:
|Vowels||Unsurprisingly, Spanish, Italian and Portuguese have the vowels “a” and “o” shifted towards the end of words, plus “e” and “i” in Italian. Non mi credi? È vero!|
|Foreign letters||Some languages use certain letters only in foreign or borrowed words (French, Spanish and Portuguese, k, w; German x, y;*, Italian j, k, w, x, y; Polish q, v, x; and Finnish b, c, f, q, w, x, z). Look at the effect on “q”: Finnish is the only language not to have it towards the beginning, because the most popular words in this corpus are foreign names like “Jacques”, without grammar words like “que”.|
|D||English is alone having the “d” most commonly* at the end of words, thanks to the past tense. Who’d have guessed? Well, anyone who thought about it, I suppose.|
|H||To my eye, “h” shows the most difference in distributions patterns: the most representative words are the, chaque, nicht, ha, che, senhor, tych and puhemies.|
|L||Because of their articles, French and Spanish’s “l”s are much more front-heavy. Le phénomène, es la verdad. Similarly, due to their grammar, German and Finnish’s “n”s are towards the end.|
|P||“P” is mostly* at the beginning of words in all eight languages; due to its source, this particular corpus has a preponderance of cognates of president, parliament and political, but even removing these words leaves the phenomenon intact. This makes intuitive sense to me (n.b. IANAL — I am not a linguist); as a bilabial plosive, it’s awkward at the end of words; consider the word “pep”. The same phenomenon occurs to a lesser extent for p’s voiced analogue, the letter “b”. [relevant Sesame Street]|
|W, Y, Z||Letter frequencies aren’t the point of this chart, but they’re interesting in their own right. Look how much more often “w”, “y” and “z” are used in Polish than in any other language (the word for “all” is “wszystkich”), and I never would have guessed “f” to be such a particularly English letter (though it’s obvious in retrospect, given that among the most common words are “of” and “from”).|
Here’s the original graphic:
* Thanks to readers for pointing out that my quickly googled source on German orthography was quite wrong, and that my phrasing seemed to be saying that only English ever had a d as the last letter of a word (which would, of course, not be la verdad). Also, since I deal with Unicode and UTF-8 difficulties every day, I forgot that some might be confused that I seemed to be claiming the Latin language had 26 letters.