Visualizing word and letter frequencies in Gadsby, a novel without the letter ‘e’
In 1939, Ernest Vincent Wright published the novel Gadsby (gee, I wonder where he came up with that name…), 58,124 words (by my count), none of which contain the letter ‘e’.
Here are a few of the features of the English language Wright was deliberately ruling out by avoiding its most common letter:
- “The”: the most common word in English (about 5% of all words in most books);
- The pronouns “he”, “she”, “we”, “they”, “me”, “her”, “them”;
- The common functional words “when”, “where”, “these”, “those”, “every”;
- Most past-tense verbs, “walked”, “went”, “loved”;
- “Sleeplessness”. Hey, I like that word.
The copyright to Gadsby expired because Wright’s estate didn’t apply for a renewal, so you can find the entire lipogram (that’s the term for this kind of writing) here or here. I tried to read it , but it was just too difficult. Not entirely due to the missing letter, but because it’s really, really uninteresting. You want a good lipogram, try A Void (which I couldn’t find an electronic version of to analyze), a lipogrammatic translation of a French lipogramatic novel. How impressive is that?
As data-centric soft of fellow, my immediate thought was to wonder how this constraint affected the word and letter frequency compared to ‘normal’ English. Obviously (I posited, correctly), word choice would be affected much more profoundly. On reading it, I saw there were a lot of Anglo-Saxon words and irregular verbs (“said”, “had”, “was”, etc.). So, using Python’s Natural Language Toolkit, I calculated word frequencies and compared them to the Brown corpus — after I’d removed every word containing the letter ‘e’ from the latter. I used the standard technique of Log Likelihood keyness (basically, it’s the confidence that a difference in frequency is ‘real’ instead of random) to determine the significance of word frequency differences (I’ve put the frequencies and comparisons using different metrics of all 3934 unique words in Gadsby in a Google Doc if you’re interested):
The overrepresented words contain character names, but also “big” (as a replacement for “large”? “enormous”?) and “folks” (“people”?). The underrepresented words are the ones I found interesting, however: “of”, “to” and “in” are very common words in English, and to have their usage reduced that much implies that even though they do not contain the letter “e”, they are used in tandem with words containing “e” — such as “the”. So I analyzed how often each word has a neighboring “e”-word in Brown, and made a quasi-volcano plot (the area of the circles is the frequency in Gadsby):
You can see there’s a palpable tendency for the over- and underrepresented words to be adjacent to “e”-containing words, whereas in that mishmash in the middle (words that have comparable frequencies in Gadsby and Brown), the probability of e-adjacency is far more spread out.
How much of this is simply due to the word “the”? Here’s a volcano-ish plot restricting the analysis to frequency of “the”-adjacence in the full Brown:
We basically see the same pattern, but lower down the graph because we’re using a more restrictive metric. The spike in the top middle shows that words that are often “the”-adjacent in Brown are, unsurprisingly, rare in Gadsby.
Well, that was fun. Next item: what happens to letter frequencies (again, here’s a Google Doc)? Let’s compare Gadsby to Brown-without-e-words:
That was unexpected (by me, anyway). The only vowel to be more frequent in Gadsby is the relatively little-used “u”! The others, “a”, “e”, and “i”, are all less used in Gadsby. It appears that the slack must be taken up by moderate-frequency consonants. Let’s have a look at the log-likelihoods:
I would never have predicted “g” and “f” to be the biggest winner and loser, respectively! Here are the top 10 g-containing words:
Rank Word Gadsby freq. Brown freq.* 26 gadsby 364 0 32 big 297 32 59 young 187 35 74 good 129 73 84 got 113 44 86 long 108 68 87 girls 106 13 90 go 104 57 92 girl 100 20 94 right 99 56 * Brown without 'e'-containing words, normalized to same length as Gadsby.
Of course, there’s the main character, Mr. Gadsby himself (interesting aside: Wright never calls him that, because even though “Mr” contains no “e”, it’s short for a word that does, and that would be cheating). Now let’s see the “f”-containing words in the Brown corpus:
Rank Word Frequency 2 of 36406 9 f 12431 14 for 9485 39 from 4370 64 if 2199 88 first 1359 102 after 1070 107 before 1011 144 life 709 157 off 637
Interestingly, most of these words do not contain “e”, but they are often “e”-adjacent; “of”, for example, is preceded or followed by an “e”-containing word 85.7% of the time in Brown.
Okay, let me us try this: I hope am hoping that you enjoyed liked did cotton to this blog post; it was an intriguing subject topic to look at, for its linguistic traits. Gadsby is a fun study!
Whew! That was exactly as hard as it looked!
Methodology: here is a Github repo of the analysis, and nbviewer docs of the word and letter frequency analyses. Tools: Python with IPython, NLTK, Pandas, Matplotlib, Seaborn and Plot.ly; Microsoft Excel; Adobe Photoshop