Monday, September 30, 2013

Word Cloud of Animal Farm and Nineteen Eighty-Four by George Orwell

I decided to post two at once by the same author this week. Warning: the Nineteen Eighty-Four word cloud is an animated GIF, if it hurts your eyes or your brain scroll further down the page and you'll see a non-animated version.

I made up my own design instead of using an existing book cover, because I found none that suggested themselves to me for a word cloud. The texture and colors are supposed to be reminiscent of a communist flag, and hopefully it's obvious that the cloud is in the form of a pig's head.

Once the trivial words are removed, the list contains mostly character names (Napoleon drives away Snowball, and thus his name is mentioned more times), types of animals (pigs, hens), farm features (windmill, barn) -- and then the words that remain are very evocative of the character of the book, such as comrade, orders, rebellion, commandments and orders.

This is a short book, at about 30,000 words. There are 3,900 individual words, but once the Porter Algorithm is used to identify word stems (so that "thinking", "thinks" and "think" are counted together, but "thought" is not -- it's an imperfect algorithm but it produces about as many false positives as negatives, and is more than reasonably accurate), the number drops to 2,900 words.

The title occurs 43 times, and the book has an oft-recurring six-word phrase: "four legs good, two legs bad" appears 13 times. For 0.25% of all words in a book to be the same six-letter phrase is highly unusual, especially without any of the words being as common as "the" or "and".

The words used only once are similar in character to the most common words: the personal (Simmonds, Caesar), pastoral (matchwood, piebald, stockbreeder, windowsill) and atmospheric (tunefully, conciliatory, bloodshed).

Non-animated version:

Again, I could not find a book cover or movie poster that spoke to me for a word cloud. I was fooling around with some typography and came up with the following:

I wasn't sure it was clear that the background of the word cloud was supposed to be static, so I animated it.

The progatonist's name, Winston Smith, features heavily (526 times), as does that of his antagonist, O'Brien (205). Julia's name appears less than half as often as O'Brien's (100), but to be fair Winston doesn't learn it until partway through the book.

Nineteen Eighty-Four is famous for the introduction of newspeak, but those words appear relatively little: doublethink is used 30 times, compared to telescreen at 92 and Oceania at 60. Speakwrite and thoughtcrime are used 13 and 11 times, respectively.

Like Animal Farm, there is a long, repeated phrase: "Oranges and lemons, say the bells of St. Clement's" appears eight times -- exactly as often as the organization of which Julia is a member, the "junior anti-sex league".

There are many more common two-word phrases such as the culturally iconic "Big Brother" (78 times). The word "party" appears 70 times in recurring 2-grams such as "party member" and "inner party".

"Two and two make five" appears three times.

Among the words used only once are bastards, homosexuality, tribunal, silk, monopoly, romantic and intercourse.

The book is 100,000 words long; there are 8,500 individual words and 5,700 unique word stems determined by the Porter algorithm.

Word cloud created using Tagxedo.

Monday, September 23, 2013

Word Cloud of A Clockwork Orange by Anthony Burgess

Practice makes perfect; this is a replacement for the word cloud of this book I made two weeks ago, and I think it's much better looking. It is still based on the original, iconic Penguin cover:

A Clockwork Orange makes for an interesting word cloud, because so many of the words are in Nadsat; six of the top ten non-trivial words are in Anthony Burgess's invented Russian/Cockney youth dialect, including veck, "guy" (from Russian chellovek), viddy (to see) and horrorshow (an Anglicization of the Russian khorosho, "good").

The most common non-trivial word is "brother(s)"; it occurs 259 times, 86 of them as the phrase "O my brothers," which is how the sociopathic narrator, Alex, addresses the reader. Two memorable phrases from the book and the Kubrick movie are "ultra-violence" and "Ludwig van", which occurs relatively few times: 15 and 12, respectively.

The title (which is never uttered in the movie) appears nine times, since it's the title of a book within the book; in chapter two Alex mockingly reads it aloud before he and his gang rape the author's wife to death.

Alex uses the misspelling "heighth" 21 times, but gets it right twice, all in similar contexts.

About half of the words in the book are only used once, an exceptionally high proportion. Many of these are Nadsat, which is an interesting challenge, since the reader has to understand the meaning from context. Many of these are easy derivatives of English words ("chickiwick", "clopclopclop"), others are more challenging ("choodessny", "oobivat").

There are about 59,000 words in A Clockwork Orange, of which about 14,000 are non-trivial ("the", "in", etc.) There are about 5,500 unique words.

For more info:

Wednesday, September 18, 2013

Frequency of Scripps National Spelling Bee winning words in Google Books, 1925-2008

Please note that this is a logarithmic scale; every vertical division to the right makes the word is ten times more common, two divisions make it 100 times more common, etc.

If you want to see a standalone image at full resolution, it's hosted here.

The diagonal black line shows the overall trend: the Scripps National Spelling Bee winning words have become more and more unusual as the years pass.

As noted above, this is a logarithmic scale, and thus there cannot be a zero value. (It's a math thing.) Words without bars at all did not appear at all in the Google Ngram Viewer for that year. According to Google, this means the word appeared fewer than 40 times in their corpus for that year.

The only word in this list that does not appear in the Google Ngram Viewer for any year is esquamulose, the winning word of 1962. It is an adjective meaning "not covered in scales".

The list of winning words was from Wikipedia; for more information about the Scripps National Spelling Bee, see the Wikipedia page or their home page.

No smoothing was used in the Google Ngram Viewer, in order to get the most accurate results. The Google Ngram Viewer only goes to 2008 at present, which is why the graph stops there; the winning words since then were Laodicean, stromuhr, cymotrichous, guetapens and knaidel.

Monday, September 16, 2013

Word Cloud of Fight Club

The design of this word cloud is based, not on the book cover, but on the iconic poster for the film:

Unsurprisingly, four of the five most common non-trivial words are the names of the two main characters (besides the unnamed protagonist): “Tyler” (the ninth most common word overall, surpassing “is” and “it”), “Marla” and “Durden”. The nameless “mechanic” is the densest non-trivial word, appearing 70 times in only three chapters.

Also unsurprisingly to anyone familiar with the book or the movie, “fight” is the most common non-trivial non-eponym, appearing 232 times; 45% of these appearances alone are in Chapter 5, where the actual Fight Club is introduced. 69% of the time, the next word is “club”; 13% of the time, the preceding word is “about” (as in “talk about fight club”).

There are approximately* 50,000 total words, and approximately 5,100 unique words. The title, “fight club”, appears 162 times.

The words that appear only once in Fight Club but still are more common than in the Brown corpus are a real grab-bag of scientific, sexual, slang and posh terms, including “chlorofluorocarbons”, “amyl”, “sodomy”, “cunnilingus”, “bahzillion”, “wedgie”, “wainscoting”, “vichyssoise” and all three words in the phrase “Dakapo halogen torchiere”.

For more information, see the Wikipedia entries about the novel, the film or the author. You can also see *the method I used to determine the non-trivial words and their frequencies.
Word cloud created using Tagxedo.

Monday, September 9, 2013

Word Cloud of The Catcher in the Rye

This word cloud is based not on a published book cover, but on an artist’s homage that is, in my opinion, far more striking and evocative than any of the “official” versions. The artist, M. S. Corley, was kind enough to give me permission to adapt his work; please look at the rest of his oeuvre at his website, The Art of M.S. Corley.

Three of the top five words in The Catcher in the Rye – “goddam”, “hell” and “damn” – are curse words, which of course is one of the reasons this is one of the most banned books in school libraries. In fact, continuing down the list with “chrissake”, “bastard”, “crap”, “sonuvabitch”, etc., my rough calculation is that just over 5% of the book consists of words that would not be acceptable at a 1951 supper table.
But they and many of the top words, like “lousy” and “terrific”, are essential to the voice Salinger gives his protagonist, Holden Caulfield. One of the most thematic and memorable words in the book is “phony”, but Salinger uses it relatively sparingly and strategically: 35 times, ranking it #46, behind Holden’s iconic “hat”, featured on this cover.
Holden calls everyone “old”, such as “old Phoebe” and “old Stradlater”, but the concordance software omits the most common 1,000 words in the English language so the word clouds aren’t full of words like “the” and “I”. “Old” appears 397 times (more than “or”!), which would have put it in the #1 spot, beating “goddam” with 245 appearances. Interestingly, when compared to the word frequencies in the Brown corpus of written English, “old” appears no more often in The Catcher in the Rye than it does elsewhere (it is 31st when ranked by frequency and also 31st by “keyness” compared to Brown).
“Phoebe” is the seventh-most-used word in the book; Holden mentions his sister’s name 115 times. “Rye” appears only seven times, most often as “If a body catch a body comin’ through the rye.” The book’s title, and the word “catcher”, appear just once, during Holden's conversation with Phoebe in Chapter 22.
Among the words that appear only once are “bassackwards”, “oversexed”, the brand name “Tattersall”, the curiously spelled “wutchamacallit” (the more common “whatchamacallit” does not appear) and several phonetically caricatured French words like “voolay voo” (when Holden imitates the speech of Janine, a singer in the Wicker Bar). There are approximately* 75,000 words, and 3,500 unique words.
After I finished making this word cloud, I found out there is a plan to publish a posthumous sequel to The Catcher in the Rye in 2015. I have absolutely no comment to make about this.
For more information, please see the Wikipedia article about book, the Wikipedia article about the author, or this analysis of the themes in the book, including the hat. You can also see the method I used to determine the words and their frequencies. Last week, I posted a boring, traditional word cloud of this book; I’ve removed it, but you can see it here if you want.
 * See the method file for an explanation why these counts are approximate.

Word cloud created using Tagxedo.

Monday, September 2, 2013

The Myth of Criticism


Studies showing praise is more effective than criticism: see the bibliography at (Note that it’s not that simple: different kinds of praise are more effective than others, some kinds of praise can be counterproductive, and some kinds of criticism can be productive.)

Studies outlining the bias that leads us to believe negative feedback is more effective than positive feedback:
•    P. E. Schaffner (1985) Specious learning about reward and punishment. Journal of Personality and Social Psychology, 48, 1377-86
•    D. Kahneman & A. Tversky (1973) On the psychology of prediction. Psychological Review, 80, 237-51
•    Tversky & D. Kahneman (1974) Judgment under uncertainty: Heuristics and biases. Science, 185, 1124-31

Dish soap > vinegar > honey for catching flies:,

