Baby Boom: An Excel Tutorial on Analyzing Large Data Sets

tl;dr: I wrote a data science tutorial for Excel for the good folks at Udemy: click here!

The usual progression I've seen in data science is the following:

  1. Start out learning data analysis with Microsoft Excel
  2. Switch to a more powerful analysis environment like R or Python
  3. Look down one's nose at everybody still using Excel
  4. Come to realize, hey, Excel's not so bad
I'll admit, I was stuck at Step 3 for a few weeks, but luckily I got most of my annoying pooh-poohing (if you're not a native English speaker, that expression might not mean what you think it means) out of my system decades ago when I was a proofreader (hence my nickname, if you were curious).

I think most mature data scientists see Excel as an essential and useful part of the ecosystem; I think the way it brings you so close to your raw data is essential in the early stages to develop data literacy, and later on when you're munging vectors and dataframes it can still be useful to fire up a .csv and have a look-see with no layers of abstraction above it.

Feedback is welcome. I'm not involved with the rest of the Excel course, but I have taken the Complete Web Developer course from Udemy and recommend it. I get absolutely no money for referrals or anything like that (or for page visits for my tutorial for that matter), so this is honest, cross my heart.

Dialogue plot of Star Trek: The Original Series

First, the plot. Hover over the points to see the character names.

Why Star Trek? Well, I'm working on an in-depth analysis of all of Shakespeare's plays, so I'm vetting my method on Star Trek because (a) the size of the corpus is much smaller so each step in development takes less time and (b) I'm, sadly, more immediately familiar with the minutiae of Star Trek because reruns were on every day after school when I was growing up, so I'm more able to notice trends and problems.

This isn't the finished product, but I thought it was interesting enough to warrant an interim blog post. All of the guest characters along the bottom appeared in one episode (except for a handful like those in both parts of The Menageries and Harcourt Fenton Mud who appeared in two). Trelane (if you're too young for TOS, he's sort of like a proto-Q from TNG) has the most dialogue per episode of any TOS character, guest or regular (if you've seen the episode, this will not surprise you). The super-speed Scalosian Queen Deela is the female character with the most dialogue; in fact, most of the high-dialogue guest stars are antagonists. Edith Keeler is the largest Kirk-love-interest part (ah, Joan Collins in the '60s); in general, Kirk was attracted to women due to the size of things other than their vocabularies, it seems (sorry, sorry, couldn't resist).

I tend to think of TOS as an ensemble drama, but Kirk is really the only regular with more dialogue than most of the main guest stars. Kirk and Spock are the only characters who appear in all 79 episodes (McCoy is missing from one... I challenge you to leave a comment below saying which episode that is). Uhura is in more episodes than the rest of the supporting cast, but speaks less ("Hailing frequencies open, Captain" is only four words, after all). Interestingly, Yeoman Janice Rand has more dialogue per episode than any supporting character except Scotty, but she's way down the vertical axis because she was fired after 15 episodes, either (a) because they'd exhausted her flirtiness potential with Kirk, (b) because she was showing up to work drunk, or (c) because she objected to being sexually assaulted by a TV executive, depending on the version of events.

Finally, the Enterprise computer voice has slightly more words per episode than Nurse Chapel; they were voiced and played, respectively, by the same actress, Majel Barrett, beloved of Trek fans and of series creator Gene Roddenberry.

I got the scripts from; they appear to be fan-transcribed scripts (hey, in the '60s, that's all you could do. I myself made one in 1996 of my favorite X-Files episode, Jose Chung's From Outer Space). They're rather error-prone (as is to be expected), so if you want to see the gory details of how I cleaned them up and made the graph in Bokeh, check out this GitHub repo or go directly to this IPython notebook.

Visualizing 10 unusual causes of death in the CDC mortality database

Let me make two things clear right up front:
  1. The metrics I used to decide what causes of death are unusual are purely subjective, i.e. which of the thousands of causes I skimmed through caught my eye and made me go, "Huh."
  2. It is in no way my intention to make fun of anyone's death. I find these causes of death unusual, not amusing.
Also: I am a great believer in reproducible data science, so as always, I've made available everything anyone would need to reproduce (or extend!) my results in an IPython notebook (nbviewer version or faster-loading html version) and this GitHub repo folder.

The U.S. Centers for Disease Control maintains a data service called WONDER (Wide-ranging OnLine Data for Epidemiologic Research); among its databases is the Compressed Mortality File tracking underlying cause of death from 1968 to 2012.

The causes of death are taken from the International Classification of Diseases (which contains an enormous number of causes of death that are not what I would call diseases, such as being struck by a train). It went through revisions in 1979 and 1999, so the categories do not match up cleanly through every year. For example, after 1978 "transvestitism" is no longer listed as a possible cause of death. (I'm not making this up. There are no deaths attributed to transvestitism in this database, but it's there in the schema, so perhaps it was assigned to someone before 1968)

Tools used: Python (with pandas and plotly) and Photoshop.

1. Dental caries

If my dentist had told me cavities could result in death, I might have flossed more often. We can see the change in cause of death definitions, as 1999 and on has slightly different wording.

2. Weather or storm

Here we see even more clearly the divide between cause of death classifications. I don't know what they called these deaths before 1979, or what the big event was in 1980. I wonder about 2005; could it be Hurricane Katrina? The ICD-10 lists 'hurricanes' as a separate cause of death, but you always have to allow for human error in assigning these categories.

3. Migraine

A good friend of mine has her life encompassed by her migraines; I had no idea they could result in death. You can see that after the 1979 revision, increased knowledge of this condition led to parsing into further categories.

4. Spacecraft

This one I find somewhat puzzling. Since the database starts in 1968, we just miss the Apollo 1 fire the year before, but what about the seven deaths aboard the space shuttle Challenger in 1986?

5. Conjunctivitis

I had conjunctivitis, or 'pinkeye', at least once as a child, and it was no big deal, so I was bemused at the first season South Park episode in which the entire town is so afraid of pinkeye, they confuse it with zombification. (As an aside, this was the first time I'd ever heard of pirated media, as a coworker of mine downloaded it off Usenet in 1997.) Turns out it can be deadly, and there are many, many categories of conjunctivitis deaths. (The graph's defaults don't have enough different colors to differentiate them all, but I think the forest matters more than the trees here.)

6. Cleft palate or cleft lip

I find it encouraging that the death rate for this condition appears to have gone down. My dad grew up in the '50s with a girl with a "hare lip", as he called it, and hearing stories about it as a kid I felt so bad for her. Had I known it was a cause of death (and quite a more substantial one than many others on this list), my secondhand suffering would have been even worse.

7. Elbow

I tried to think of the most unlikely part of the body to result in death. Here it is. If you're wondering why there are six causes of death in the legend and four in the graph, it means two had no entries during the time period. Also, the cause 'Of elbow' means it was a subgroup of a supergroup that does not appear in the database I downloaded (I could have downloaded the supergroup fields, but I didn't, the file was huge enough already).

I'm assuming there was no outbreak of elbow deaths in the '80s and '90s, and that the higher bars are due to differences in criteria of classification. 'Enthesopathy' (a disorder of bone attachments) only appears in 1979, and its diagnosis drops down for all bones in 1999. If you're curious, you can see the graph in my gist notebook.

8. Animal

This one was a little tricky. There are 13 categories of vehicle occupants in collision with 'pedestrian or animal' to remove, and then I thought to check specific animals like dogs and bees (cat fanciers will be happy to know there is no category devoted to death by feline, and yes, bees are animals.)

9. Ingrown nail

I actually had a pretty badly infected ingrown toenail as a kid. Still, it appears my odds were pretty good, as there's less than two deaths per decade attributed to it.

10. War

You wouldn't think war would be an unusual cause of death, the world being what it is, but I find the low numbers attributed to it unusual. There's absolutely no increase when the Iraq war starts in 2003. Make of it what you will.

Most decade-specific words in Billboard popular song titles, 1890-2014

Chart first, then explanation (click to enlarge):
The inspiration for this post came from my being too lazy to set my iPod to shuffle, and then noticing it played a bunch of songs in a row from the 1930s and '40s that started with the letters "in" ("In the Wee Small Hours of the Morning," "In the Still of the Night", etc.) Naturally, being a data nerd, my first thought was to quantify the phenomenon.

The data comes not from Billboard itself, but from; I don't know much about the data source, but it certainly looks thorough and painstaking, and up to date. If you'd like to know a little more about my methodology (like a quick explanation of the metric, "keyness"), see the code I used and/or see the actual songs that correspond to these words, head on over to my other, nerdier blog, prooffreaderplus.

Observations about the results:
  • The 2010s seem both more vulgar ("hell" and "fuck") and more inclusive ("we" instead of the "you", "ya" and "u" of the 1990s and 2000s).
  • The 1990s and 2000s were the decades of neologisms, with "U", "Ya" and "Thang". "U" was so popular it occurred twice (but see the note on decade-binning on prooffreaderplus.)
  • Fun! Lots of the decades can be made into intelligible five-word sentences. For example: "Hell Yeah, We Die, Fuck!" (2010s). "Ya Breathe It Like U" (2000s), "You Get Up, U Thang" (1990s), "Don't Rock On Fire, Love" (1980s), "Sing, Moon, In A Swing" (1930s)
  • As anyone who listens to the radio in December knows, all the Christmas songs are oldies, and that shows in the results for the 1950s, with "Christmas" and "Red-nosed".
  • You can track genres with the keywords: "Rag" (1910s), "Blues" (1920s), "Swing" (1930s), "Boogie", "Polka" (1940s), "Mambo" (1950s), "Twist" (1960s), "Disco" (1970s), "Rock" (1970s and 1980s). After that, people realized you don't have to actually name the genre in the song title, people can figure it out by listening. (N'Sync must not have gotten that memo for 2001's "Pop".)
  • Who knew Billboard song rankings went back to the 1890s? It was a surprise to me. That fact, and the fact that there are fewer songs then, but not so few as to be negligible, influenced a lot of the choices into how I presented this data (read more here if you want). But those early decades seem to be more focused on first names ("Michael", "Reuben", "Casey"), familial relationships ("Uncle", "Mammy")
  • The first two decades -- the oldest ones compared to now -- both have the keyword "old". I blame time travel.
  • I find it interesting that there are short, common articles, adverbs, prepositions and pronouncs in the list; these have a higher bar for keyness, since they're present in other decades: "When" (1900s), "A" (1930s), "In" (1930s), "On" (1980s), "Up" (1990s), "It" (2000s)
Now if you'll excuse me, I'm going to hunt through my iPod to see if there's even one song with "gems" in the title; it seems to have been popular in the 1910s.

Projections of White Christmases until the year 2100, based on a climate model

Below is a climate model projection of what areas of North America will be snow-covered on December 25 of each year between 2014 and 2100:

A few things should be pointed out. First and foremost is:

  1. Further to point #1 above, the point of this kind of climate model is not to accurately predict the weather every single day for 87 years, even though that's what the model contains. The point is to experiment, and experimental science is built on prediction. Evaluating those predictions makes for better models down the road. I'm no climatologist, so I'll let the Oregon Climate Change Research Institute explain Why We Use Climate Models.
  2. In the map above, white is 100% snow coverage, and the white becomes more and more transparent at the fringes from 99% to 1% snow coverage, until the bare background is 0% snow coverage. The resolution of the climate model is only 0.44 degrees, so the fit isn't exact at the coastlines.
  3. The data is from the Canadian Centre for Climate Modeling and Analysis, hosted at Environment Canada. The exact model is the Fourth Generation Canadian Regional Climate Model (CanRCM4), RCP 8.5
  4. That global warming kinda sneaks up on you, doesn't it? It's gradual, but when it loops back down to 2014, it's pretty obvious. I imagine people in Grande Prairie, Alberta are looking forward to the end of the 21st century.
  5. Here is a post on my other, nerdier blog about how to make maps in Python based on the CCCma's NetCDF files. There are plenty of examples out there on plotting these files, but not with the format CCCma uses.
  6. There's also code on my GitHub, with links to nbviewer notebooks
  7. Tools used: Python with IPython, netCDF4, Matplotlib, Basemap and PIL; Photoshop; Gfycat.

An Introduction to Data Visualization

An introduction to the practice of data visualization, with theory, examples, and good humor.

This is a studio rerecording of a presentation I was asked to give to McGill University graduate students from many disciplines in Montreal, Canada in November 2014.

Here's a link to the slideshare version if you'd rather read than listen to me.

I give shoutouts to Alberto Cairo, Nathan Yau and Edward Tufte, without whom I'd be much less well informed.

Visualizing word and letter frequencies in Gadsby, a novel without the letter 'e'

In 1939, Ernest Vincent Wright published the novel Gadsby (gee, I wonder where he came up with that name...), 58,124 words (by my count), none of which contain the letter 'e'.

The cover is actually more colorful than the plot.

Here are a few of the features of the English language Wright was deliberately ruling out by avoiding its most common letter:
  • "The": the most common word in English (about 5% of all words in most books);
  • The pronouns "he", "she", "we", "they", "me", "her", "them";
  • The common functional words "when", "where", "these", "those", "every";
  • Most past-tense verbs, "walked", "went", "loved";
  • "Sleeplessness". Hey, I like that word.
The copyright to Gadsby expired because Wright's estate didn't apply for a renewal, so you can find the entire lipogram (that's the term for this kind of writing) here or here. I tried to read it , but it was just too difficult. Not entirely due to the missing letter, but because it's really, really uninteresting. You want a good lipogram, try A Void (which I couldn't find an electronic version of to analyze), a lipogrammatic translation of a French lipogramatic novel. How impressive is that?

As data-centric soft of fellow, my immediate thought was to wonder how this constraint affected the word and letter frequency compared to 'normal' English. Obviously (I posited, correctly), word choice would be affected much more profoundly. On reading it, I saw there were a lot of Anglo-Saxon words and irregular verbs ("said", "had", "was", etc.). So, using Python's Natural Language Toolkit, I calculated word frequencies and compared them to the Brown corpus -- after I'd removed every word containing the letter 'e' from the latter. I used the standard technique of Log Likelihood keyness (basically, it's the confidence that a difference in frequency is 'real' instead of random) to determine the significance of word frequency differences (I've put the frequencies and comparisons using different metrics of all 3934 unique words in Gadsby in a Google Doc if you're interested):

The overrepresented words contain character names, but also "big" (as a replacement for "large"? "enormous"?) and "folks" ("people"?). The underrepresented words are the ones I found interesting, however: "of", "to" and "in" are very common words in English, and to have their usage reduced that much implies that even though they do not contain the letter "e", they are used in tandem with words containing "e" -- such as "the". So I analyzed how often each word has a neighboring "e"-word in Brown, and made a quasi-volcano plot (the area of the circles is the frequency in Gadsby):

You can see there's a palpable tendency for the over- and underrepresented words to be adjacent to "e"-containing words, whereas in that mishmash in the middle (words that have comparable frequencies in Gadsby and Brown), the probability of e-adjacency is far more spread out.

How much of this is simply due to the word "the"? Here's a volcano-ish plot restricting the analysis to frequency of "the"-adjacence in the full Brown:

We basically see the same pattern, but lower down the graph because we're using a more restrictive metric. The spike in the top middle shows that words that are often "the"-adjacent in Brown are, unsurprisingly, rare in Gadsby.

Letter frequencies

Well, that was fun. Next item: what happens to letter frequencies (again, here's a Google Doc)? Let's compare Gadsby to Brown-without-e-words:

That was unexpected (by me, anyway). The only vowel to be more frequent in Gadsby is the relatively little-used "u"! The others, "a", "e", and "i", are all less used in Gadsby. It appears that the slack must be taken up by moderate-frequency consonants. Let's have a look at the log-likelihoods:

I would never have predicted "g" and "f" to be the biggest winner and loser, respectively! Here are the top 10 g-containing words:

  Rank  Word    Gadsby freq.  Brown freq.*
    26  gadsby      364             0
    32  big         297            32
    59  young       187            35
    74  good        129            73
    84  got         113            44
    86  long        108            68
    87  girls       106            13
    90  go          104            57
    92  girl        100            20
    94  right        99            56

* Brown without 'e'-containing words,
normalized to same length as Gadsby.

Of course, there's the main character, Mr. Gadsby himself (interesting aside: Wright never calls him that, because even though "Mr" contains no "e", it's short for a word that does, and that would be cheating). Now let's see the "f"-containing words in the Brown corpus:

 Rank  Word    Frequency
  2    of        36406
  9    f         12431
  14   for        9485
  39   from       4370
  64   if         2199
  88   first      1359
 102   after      1070
 107   before     1011
 144   life        709
 157   off         637

Interestingly, most of these words do not contain "e", but they are often "e"-adjacent; "of", for example, is preceded or followed by an "e"-containing word 85.7% of the time in Brown.

Okay, let me us try this: I hope am hoping that you enjoyed liked did cotton to this blog post; it was an intriguing subject topic to look at, for its linguistic traits. Gadsby is a fun study!

Whew! That was exactly as hard as it looked!

Methodology: here is a Github repo of the analysis, and nbviewer docs of the word and letter frequency analyses. Tools: Python with IPython, NLTK, Pandas, Matplotlib, Seaborn and; Microsoft Excel; Adobe Photoshop