Monday, March 30, 2015

Baby Boom: An Excel Tutorial on Analyzing Large Data Sets

tl;dr: I wrote a data science tutorial for Excel for the good folks at Udemy: click here!

The usual progression I've seen in data science is the following:

  1. Start out learning data analysis with Microsoft Excel
  2. Switch to a more powerful analysis environment like R or Python
  3. Look down one's nose at everybody still using Excel
  4. Come to realize, hey, Excel's not so bad
I'll admit, I was stuck at Step 3 for a few weeks, but luckily I got most of my annoying pooh-poohing (if you're not a native English speaker, that expression might not mean what you think it means) out of my system decades ago when I was a proofreader (hence my nickname, if you were curious).

I think most mature data scientists see Excel as an essential and useful part of the ecosystem; I think the way it brings you so close to your raw data is essential in the early stages to develop data literacy, and later on when you're munging vectors and dataframes it can still be useful to fire up a .csv and have a look-see with no layers of abstraction above it.

Feedback is welcome. I'm not involved with the rest of the Excel course, but I have taken the Complete Web Developer course from Udemy and recommend it. I get absolutely no money for referrals or anything like that (or for page visits for my tutorial for that matter), so this is honest, cross my heart.

Monday, March 23, 2015

Dialogue plot of Star Trek: The Original Series

First, the plot. Hover over the points to see the character names.

Why Star Trek? Well, I'm working on an in-depth analysis of all of Shakespeare's plays, so I'm vetting my method on Star Trek because (a) the size of the corpus is much smaller so each step in development takes less time and (b) I'm, sadly, more immediately familiar with the minutiae of Star Trek because reruns were on every day after school when I was growing up, so I'm more able to notice trends and problems.

This isn't the finished product, but I thought it was interesting enough to warrant an interim blog post. All of the guest characters along the bottom appeared in one episode (except for a handful like those in both parts of The Menageries and Harcourt Fenton Mud who appeared in two). Trelane (if you're too young for TOS, he's sort of like a proto-Q from TNG) has the most dialogue per episode of any TOS character, guest or regular (if you've seen the episode, this will not surprise you). The super-speed Scalosian Queen Deela is the female character with the most dialogue; in fact, most of the high-dialogue guest stars are antagonists. Edith Keeler is the largest Kirk-love-interest part (ah, Joan Collins in the '60s); in general, Kirk was attracted to women due to the size of things other than their vocabularies, it seems (sorry, sorry, couldn't resist).

I tend to think of TOS as an ensemble drama, but Kirk is really the only regular with more dialogue than most of the main guest stars. Kirk and Spock are the only characters who appear in all 79 episodes (McCoy is missing from one... I challenge you to leave a comment below saying which episode that is). Uhura is in more episodes than the rest of the supporting cast, but speaks less ("Hailing frequencies open, Captain" is only four words, after all). Interestingly, Yeoman Janice Rand has more dialogue per episode than any supporting character except Scotty, but she's way down the vertical axis because she was fired after 15 episodes, either (a) because they'd exhausted her flirtiness potential with Kirk, (b) because she was showing up to work drunk, or (c) because she objected to being sexually assaulted by a TV executive, depending on the version of events.

Finally, the Enterprise computer voice has slightly more words per episode than Nurse Chapel; they were voiced and played, respectively, by the same actress, Majel Barrett, beloved of Trek fans and of series creator Gene Roddenberry.

I got the scripts from; they appear to be fan-transcribed scripts (hey, in the '60s, that's all you could do. I myself made one in 1996 of my favorite X-Files episode, Jose Chung's From Outer Space). They're rather error-prone (as is to be expected), so if you want to see the gory details of how I cleaned them up and made the graph in Bokeh, check out this GitHub repo or go directly to this IPython notebook.

