Yesterday as of this writing (late May 2015), the U.S. Office of the Director of National Intelligence released 103 documents written by Osama bin Laden, captured during the raid on Abbotabad. Of course, my first thought was, “Great, a corpus!”.
Bearing in mind that these are translations, here’s my preliminary text analysis of the infodump. If you want to see how I did the analysis, check out this IPython notebook.
First of all, to get a taste of the flavor of the whole collection, here’s some randomly generated text based on it using a Markov chain:
The secretary of Muslims, but it the new environments at your news, always to take statements from our command of the issue is no harm, or is preferable to convince any operation must take Hamzah arrives here, such expression without an exhortation and all the world. This should be using such as a documentary about Ibn ‘Abbas… In the list you and so that viewed it comes from God. We, thank you and those companions and reads and ready to live in Gaza? Connect them all that he returned to spread the hearts [of the women who may respond to hearing them. He is what it is established system. Usually, these conditions, even with it. He might take me to give support such that it in the path for an end is a dangerous and they are you wish I am just a brother to be upon him very important of the family and that rose up the frontiers area (Islamic Maghreb); so were the reality of those who follows this great hypocrite. I am blessed with his companions… Furthermore, To the world as in missing the tenth anniversary in Algeria more religious issues.
Now the word clouds. Everybody loves word clouds. Except for the people who hate word clouds because they’re semi-quantitative at best. Well, you can’t have everything.
Here’s the word cloud of the vocabulary in all of the documents:
Unsurprisingly, OBL uses a lot of religious terms; ‘god’ is used so frequently it’s in most of the top results when we look at bigrams (two-word phrases):
Most of the documents are short letters, but a couple of them are really long (about 20 pages, double spaced). The two long ones are A Letter to the Sunnah People in Syria and Letter to Shaykh Abu Abdallah dated 17 July 2010.
The total length of the correspondence is 74,908 words. Here’s how that length compares to some well-known novels:
I also compared the reading level of the (and I stress, translated) text using the Flesch-Kincaid formula, which must be taken with a grain of salt since it sometimes gives weird results, like scoring Shakespeare below children’s books:
It’s about as hard to read as Mark Twain, so it’s got that going for it, which is nice.
I also did a quick topic modelling using the NMF algorithm, which determines which words best separate the 103 documents into similar groups, or clusters. Here are the results, in no particular order; the topics were named by me, based on nothing but intuition.
Word clouds made with Tagxedo