Monday, November 17, 2014

Visualizing word and letter frequencies in Gadsby, a novel without the letter 'e'

In 1939, Ernest Vincent Wright published the novel Gadsby (gee, I wonder where he came up with that name...), 58,124 words (by my count), none of which contain the letter 'e'.

The cover is actually more colorful than the plot.

Here are a few of the features of the English language Wright was deliberately ruling out by avoiding its most common letter:
  • "The": the most common word in English (about 5% of all words in most books);
  • The pronouns "he", "she", "we", "they", "me", "her", "them";
  • The common functional words "when", "where", "these", "those", "every";
  • Most past-tense verbs, "walked", "went", "loved";
  • "Sleeplessness". Hey, I like that word.
The copyright to Gadsby expired because Wright's estate didn't apply for a renewal, so you can find the entire lipogram (that's the term for this kind of writing) here or here. I tried to read it , but it was just too difficult. Not entirely due to the missing letter, but because it's really, really uninteresting. You want a good lipogram, try A Void (which I couldn't find an electronic version of to analyze), a lipogrammatic translation of a French lipogramatic novel. How impressive is that?

As data-centric soft of fellow, my immediate thought was to wonder how this constraint affected the word and letter frequency compared to 'normal' English. Obviously (I posited, correctly), word choice would be affected much more profoundly. On reading it, I saw there were a lot of Anglo-Saxon words and irregular verbs ("said", "had", "was", etc.). So, using Python's Natural Language Toolkit, I calculated word frequencies and compared them to the Brown corpus -- after I'd removed every word containing the letter 'e' from the latter. I used the standard technique of Log Likelihood keyness (basically, it's the confidence that a difference in frequency is 'real' instead of random) to determine the significance of word frequency differences (I've put the frequencies and comparisons using different metrics of all 3934 unique words in Gadsby in a Google Doc if you're interested):

The overrepresented words contain character names, but also "big" (as a replacement for "large"? "enormous"?) and "folks" ("people"?). The underrepresented words are the ones I found interesting, however: "of", "to" and "in" are very common words in English, and to have their usage reduced that much implies that even though they do not contain the letter "e", they are used in tandem with words containing "e" -- such as "the". So I analyzed how often each word has a neighboring "e"-word in Brown, and made a quasi-volcano plot (the area of the circles is the frequency in Gadsby):

You can see there's a palpable tendency for the over- and underrepresented words to be adjacent to "e"-containing words, whereas in that mishmash in the middle (words that have comparable frequencies in Gadsby and Brown), the probability of e-adjacency is far more spread out.

How much of this is simply due to the word "the"? Here's a volcano-ish plot restricting the analysis to frequency of "the"-adjacence in the full Brown:

We basically see the same pattern, but lower down the graph because we're using a more restrictive metric. The spike in the top middle shows that words that are often "the"-adjacent in Brown are, unsurprisingly, rare in Gadsby.

Letter frequencies

Well, that was fun. Next item: what happens to letter frequencies (again, here's a Google Doc)? Let's compare Gadsby to Brown-without-e-words:

That was unexpected (by me, anyway). The only vowel to be more frequent in Gadsby is the relatively little-used "u"! The others, "a", "e", and "i", are all less used in Gadsby. It appears that the slack must be taken up by moderate-frequency consonants. Let's have a look at the log-likelihoods:

I would never have predicted "g" and "f" to be the biggest winner and loser, respectively! Here are the top 10 g-containing words:

  Rank  Word    Gadsby freq.  Brown freq.*
    26  gadsby      364             0
    32  big         297            32
    59  young       187            35
    74  good        129            73
    84  got         113            44
    86  long        108            68
    87  girls       106            13
    90  go          104            57
    92  girl        100            20
    94  right        99            56

* Brown without 'e'-containing words,
normalized to same length as Gadsby.

Of course, there's the main character, Mr. Gadsby himself (interesting aside: Wright never calls him that, because even though "Mr" contains no "e", it's short for a word that does, and that would be cheating). Now let's see the "f"-containing words in the Brown corpus:

 Rank  Word    Frequency
  2    of        36406
  9    f         12431
  14   for        9485
  39   from       4370
  64   if         2199
  88   first      1359
 102   after      1070
 107   before     1011
 144   life        709
 157   off         637

Interestingly, most of these words do not contain "e", but they are often "e"-adjacent; "of", for example, is preceded or followed by an "e"-containing word 85.7% of the time in Brown.

Okay, let me us try this: I hope am hoping that you enjoyed liked did cotton to this blog post; it was an intriguing subject topic to look at, for its linguistic traits. Gadsby is a fun study!

Whew! That was exactly as hard as it looked!

Methodology: here is a Github repo of the analysis, and nbviewer docs of the word and letter frequency analyses. Tools: Python with IPython, NLTK, Pandas, Matplotlib, Seaborn and; Microsoft Excel; Adobe Photoshop


Please leave comments & corrections here. Courtesy is appreciated.

Popular Posts

Scroll To Top