Monday, July 7, 2014

Comparison of letter positions in eight languages

Click to enlarge:


EDIT 2014-07-08: Sometimes the readers are just smarter than me! The original graphic (which you can see at the end of the post) had the letters in rows; I got six e-mails in the first few hours suggesting it would be a lot easier to compare across languages if they were in columns. When you're right, you're right.

My May 27 blog post of the distribution of letters in English toward the beginning, middle and end of words seemed well-received, and generated quite a few compliments, and not a few requests to do the same for other languages. One reader was even inspired to do a similar project in French.

Since I already had the code, I thought, why not? Now the only problem was getting my hands on a corpus; you can read about my adventures in this regard, as well as some more esoteric analysis of this data set, on my other, geekier blog; suffice it to say I was quite fortunate to find the Europarl Parallel Corpus, a collection of proceedings of the European Parliament with simultaneous translations in twenty languages. Since every language has the same subject matter, we're maximizing the chances that any differences we see are actually due to the language, not because of differences in the corpus.

I chose the seven languages with the most speakers in the European Parliament, plus Finnish because I thought it would be interesting to have a non-Indo-European language to compare as well.

Note that characters outside of the Basic Latin Unicode block* (accents, digraphs, etc.) are aggregated with their non-ornamented versions; this is not ideal, since they're not at all interchangeable (if you ask someone where a congrès is in Paris without the accent, they'll point you towards a seafood shop), but it's really the only way we can make the datasets comparable in this limited and hardly scholarly context.

Note that to determine the shape of the individual graphs I followed the same methodology as last time, outlined here.

There are a lot of interesting features in the chart, I could stare at the thing for hours (and I have.. well, 45 minutes, anyway). I'll just name a few here:

VowelsUnsurprisingly, Spanish, Italian and Portuguese have the vowels "a" and "o" shifted towards the end of words, plus "e" and "i" in Italian. Non mi credi? È vero!
Foreign letters Some languages use certain letters only in foreign or borrowed words (French, Spanish and Portuguese, k, w; German x, y;*Italian j, k, w, x, y; Polish q, v, x; and Finnish b, c, f, q, w, x, z). Look at the effect on "q": Finnish is the only language not to have it towards the beginning, because the most popular words in this corpus are foreign names like "Jacques", without grammar words like "que".
DEnglish is alone having the "d" most commonly* at the end of words, thanks to the past tense. Who'd have guessed? Well, anyone who thought about it, I suppose.
HTo my eye, "h" shows the most difference in distributions patterns: the most representative words are the, chaque, nicht, ha, che, senhor, tych and puhemies.
LBecause of their articles, French and Spanish's "l"s are much more front-heavy. Le phénomène, es la verdad. Similarly, due to their grammar, German and Finnish's "n"s are towards the end.
P"P" is mostly* at the beginning of words in all eight languages; due to its source, this particular corpus has a preponderance of cognates of president, parliament and political, but even removing these words leaves the phenomenon intact. This makes intuitive sense to me (n.b. IANAL -- I am not a linguist); as a bilabial plosive, it's awkward at the end of words; consider the word "pep". The same phenomenon occurs to a lesser extent for p's voiced analogue, the letter "b". [relevant Sesame Street]
W, Y, ZLetter frequencies aren't the point of this chart, but they're interesting in their own right. Look how much more often "w", "y" and "z" are used in Polish than in any other language (the word for "all" is "wszystkich"), and I never would have guessed "f" to be such a particularly English letter (though it's obvious in retrospect, given that among the most common words are "of" and "from").

Here's the original graphic:
* Thanks to readers for pointing out that my quickly googled source on German orthography was quite wrong, and that my phrasing seemed to be saying that only English ever had a d as the last letter of a word (which would, of course, not be la verdad). Also, since I deal with Unicode and UTF-8 difficulties every day, I forgot that some might be confused that I seemed to be claiming the Latin language had 26 letters.

4 comments:

  1. In French, we have a whole class of verbs such as "attendre", "tendre" that finish with a "d" as in:
    il attend, il tend...

    ReplyDelete
  2. Spanish also has words that end in d. The first set, imperatives directed at the second person informal (vosotros all as opposed to ustedes), are probably going to be rare encounters in a corpus derived from parliamentary texts; I imagine any imperatives you encounter will be more formal. The others are words like ciudad (city), salud (health), etc., but they aren't terribly numerous.

    I think it might be fun to compare a random generator based on this method with some based on Markov chains to see what gets the best regional flavor for randomly generated words.

    ReplyDelete
  3. The D in German is pretty extreme (der, die das, and to a lesser extent dem, den, des)

    ReplyDelete
  4. Did you think about applying this knowledge to create a language detector? For each language, one could instantiate a 26-dimension (at least) "language" vector with the mean position values of the letters. Then, for any given text of unknown language, create its vector and compare it to each "language" vectors. The cosine measure can be used to compute pairwise similarities (sim → 1 means the 2 vectors are similar). Then you pick the languages with higher cosine values: these are the most likely languages to be found in the input text :-)

    I'd be interested in comparing this statistical method to other lexicon-oriented / machine learning methods (e.g., https://code.google.com/p/language-detection).

    http://en.wikipedia.org/wiki/Language_identification

    Cheers,

    Guillaume

    ReplyDelete

Please leave comments & corrections here. Courtesy is appreciated.

Popular Posts

Scroll To Top