Monday, November 24, 2014

Projections of White Christmases until the year 2100, based on a climate model

Below is a climate model projection of what areas of North America will be snow-covered on December 25 of each year between 2014 and 2100:

A few things should be pointed out. First and foremost is:

  1. Further to point #1 above, the point of this kind of climate model is not to accurately predict the weather every single day for 87 years, even though that's what the model contains. The point is to experiment, and experimental science is built on prediction. Evaluating those predictions makes for better models down the road. I'm no climatologist, so I'll let the Oregon Climate Change Research Institute explain Why We Use Climate Models.
  2. In the map above, white is 100% snow coverage, and the white becomes more and more transparent at the fringes from 99% to 1% snow coverage, until the bare background is 0% snow coverage. The resolution of the climate model is only 0.44 degrees, so the fit isn't exact at the coastlines.
  3. The data is from the Canadian Centre for Climate Modeling and Analysis, hosted at Environment Canada. The exact model is the Fourth Generation Canadian Regional Climate Model (CanRCM4), RCP 8.5
  4. That global warming kinda sneaks up on you, doesn't it? It's gradual, but when it loops back down to 2014, it's pretty obvious. I imagine people in Grande Prairie, Alberta are looking forward to the end of the 21st century.
  5. Here is a post on my other, nerdier blog about how to make maps in Python based on the CCCma's NetCDF files. There are plenty of examples out there on plotting these files, but not with the format CCCma uses.
  6. There's also code on my GitHub, with links to nbviewer notebooks
  7. Tools used: Python with IPython, netCDF4, Matplotlib, Basemap and PIL; Photoshop; Gfycat.

Tuesday, November 18, 2014

An Introduction to Data Visualization

An introduction to the practice of data visualization, with theory, examples, and good humor.

This is a studio rerecording of a presentation I was asked to give to McGill University graduate students from many disciplines in Montreal, Canada in November 2014.

Here's a link to the slideshare version if you'd rather read than listen to me.

I give shoutouts to Alberto Cairo, Nathan Yau and Edward Tufte, without whom I'd be much less well informed.

Monday, November 17, 2014

Visualizing word and letter frequencies in Gadsby, a novel without the letter 'e'

In 1939, Ernest Vincent Wright published the novel Gadsby (gee, I wonder where he came up with that name...), 58,124 words (by my count), none of which contain the letter 'e'.

The cover is actually more colorful than the plot.

Here are a few of the features of the English language Wright was deliberately ruling out by avoiding its most common letter:
  • "The": the most common word in English (about 5% of all words in most books);
  • The pronouns "he", "she", "we", "they", "me", "her", "them";
  • The common functional words "when", "where", "these", "those", "every";
  • Most past-tense verbs, "walked", "went", "loved";
  • "Sleeplessness". Hey, I like that word.
The copyright to Gadsby expired because Wright's estate didn't apply for a renewal, so you can find the entire lipogram (that's the term for this kind of writing) here or here. I tried to read it , but it was just too difficult. Not entirely due to the missing letter, but because it's really, really uninteresting. You want a good lipogram, try A Void (which I couldn't find an electronic version of to analyze), a lipogrammatic translation of a French lipogramatic novel. How impressive is that?

As data-centric soft of fellow, my immediate thought was to wonder how this constraint affected the word and letter frequency compared to 'normal' English. Obviously (I posited, correctly), word choice would be affected much more profoundly. On reading it, I saw there were a lot of Anglo-Saxon words and irregular verbs ("said", "had", "was", etc.). So, using Python's Natural Language Toolkit, I calculated word frequencies and compared them to the Brown corpus -- after I'd removed every word containing the letter 'e' from the latter. I used the standard technique of Log Likelihood keyness (basically, it's the confidence that a difference in frequency is 'real' instead of random) to determine the significance of word frequency differences (I've put the frequencies and comparisons using different metrics of all 3934 unique words in Gadsby in a Google Doc if you're interested):

The overrepresented words contain character names, but also "big" (as a replacement for "large"? "enormous"?) and "folks" ("people"?). The underrepresented words are the ones I found interesting, however: "of", "to" and "in" are very common words in English, and to have their usage reduced that much implies that even though they do not contain the letter "e", they are used in tandem with words containing "e" -- such as "the". So I analyzed how often each word has a neighboring "e"-word in Brown, and made a quasi-volcano plot (the area of the circles is the frequency in Gadsby):

You can see there's a palpable tendency for the over- and underrepresented words to be adjacent to "e"-containing words, whereas in that mishmash in the middle (words that have comparable frequencies in Gadsby and Brown), the probability of e-adjacency is far more spread out.

How much of this is simply due to the word "the"? Here's a volcano-ish plot restricting the analysis to frequency of "the"-adjacence in the full Brown:

We basically see the same pattern, but lower down the graph because we're using a more restrictive metric. The spike in the top middle shows that words that are often "the"-adjacent in Brown are, unsurprisingly, rare in Gadsby.

Letter frequencies

Well, that was fun. Next item: what happens to letter frequencies (again, here's a Google Doc)? Let's compare Gadsby to Brown-without-e-words:

That was unexpected (by me, anyway). The only vowel to be more frequent in Gadsby is the relatively little-used "u"! The others, "a", "e", and "i", are all less used in Gadsby. It appears that the slack must be taken up by moderate-frequency consonants. Let's have a look at the log-likelihoods:

I would never have predicted "g" and "f" to be the biggest winner and loser, respectively! Here are the top 10 g-containing words:

  Rank  Word    Gadsby freq.  Brown freq.*
    26  gadsby      364             0
    32  big         297            32
    59  young       187            35
    74  good        129            73
    84  got         113            44
    86  long        108            68
    87  girls       106            13
    90  go          104            57
    92  girl        100            20
    94  right        99            56

* Brown without 'e'-containing words,
normalized to same length as Gadsby.

Of course, there's the main character, Mr. Gadsby himself (interesting aside: Wright never calls him that, because even though "Mr" contains no "e", it's short for a word that does, and that would be cheating). Now let's see the "f"-containing words in the Brown corpus:

 Rank  Word    Frequency
  2    of        36406
  9    f         12431
  14   for        9485
  39   from       4370
  64   if         2199
  88   first      1359
 102   after      1070
 107   before     1011
 144   life        709
 157   off         637

Interestingly, most of these words do not contain "e", but they are often "e"-adjacent; "of", for example, is preceded or followed by an "e"-containing word 85.7% of the time in Brown.

Okay, let me us try this: I hope am hoping that you enjoyed liked did cotton to this blog post; it was an intriguing subject topic to look at, for its linguistic traits. Gadsby is a fun study!

Whew! That was exactly as hard as it looked!

Methodology: here is a Github repo of the analysis, and nbviewer docs of the word and letter frequency analyses. Tools: Python with IPython, NLTK, Pandas, Matplotlib, Seaborn and; Microsoft Excel; Adobe Photoshop

Popular Posts

Scroll To Top