Use of the f-word in Eddie Murphy: Delirious

I had planned to take a break from blogging during the holidays, but today I saw this post on reddit about the use of the f-word in movies in the dataisbeautiful subreddit, and I was inspired. The top movie on the list I had seen was Eddie Murphy: Delirious; I was 13 when it came out, but nobody I knew had HBO, so my best friend and I had to wait till it showed up in the Betamax tape rental place. We made a lo-fi audio recording (a microphone held up to the TV speaker), and soon had it memorized and spent several years quoting it in all sorts of inappropriate situations.

So, let's break down the use of the f-word (I admit, I'm being a total wuss, Google hosts this blog and I'd rather not deal with any automated fallout from using profanity, so I'm going to asterisk out all the naughty words) during the movie. Some simple poor man's calculus (for each use of the word at time x, y equals the inverse of the average of the times of the previous and next use) shows the clustering of swearing during different parts of the film:

It would be great to know what parts of the movie those clusters correspond to: if you go to the bottom of the post, there's a reversed version of the graph that allows you to see the dialogue (lightly Bowdlerized, again, I'm sorry) line by line.

I've been learning how to do Natural Language Programming in Python, and while I didn't bring out the big guns, I thought it would be interesting to look at some of the simple patterns in word use in the movie: 

Normally I would use a stop list to remove common words like "the" and "and", and a corpus to compare word frequencies, but I think the raw data is the most informative perspective, showing how the profanity rivals the most common syntactic words in Delirious. Here are the top N-grams (words that appear side-by-side):

I'm a contributor to the FullMovieGifs subreddit, so I couldn't resist the temptation to make one of Delirious. Hopefully Google doesn't OCR these things; if you want to see it larger, click on it.

Finally, here's a big, vertical version of the first graph in the blog, which you can mouseover to read the lines of dialogue (is it still called dialogue when only one person's talking?) to your heart's content. If you can't see a really huge graph right underneath this sentence, click here to see it.

I think I'll be hearing from my mom about this post.
Update Jan. 1, 2014: Whaddaya know, my mom was fine with it.

Tuesday, December 17, 2013

Weight of small change in USA, EU, UK and Canada

I'm interested in how much the SMALL change weighs, I don't want to get into the dollar bill/coin debate.
The graph is interactive, feel free to click and hover. [Blogger seems to be finicky with javascript; if you don't see a big interactive graph right underneath this sentence, click here.]
Bottom line: Brits need good pockets.

I started learning javascript a couple of months ago, and I'm comfortable enough to be able to lean heavily on a package and wrangle the API to give me what I want. Today Highcharts, tomorrow, D3.js!Coin weights are taken from Wikipedia.
If anyone prefers to see a simple non-interactive image, click on this:

Wednesday, December 11, 2013

Word clouds of the human genome: most and least frequent words

I work in genomics, so I thought it was time to geek out this week.

The words are taken from the reference human genome annotations in UniProt; I wanted to use the same font as the UniProt logo, but I couldn't find it, so I went with a similar Bauhaus-style font that had a serendipitous name (and was free): Monoglyceride, by Tepid Monkey Fonts.

Instead of limiting the cloud to the presently confirmed 20,272 genes, I used all 69,049 annotations (which still, BTW, covers less than 7% of the genome). You can see both data sets here; the difference between them is not dramatic, except that the unreviewed set contains the words "fragment" over 1000 times more often.

The annotation contains 505,128 "words" (any combination of contiguous letters is considered a word, punctuation and numbers are removed), and 14,689 unique words. The frequency is very unequally distributed: the top two words, "protein" and "fragment", take up almost 16% of the total words, while words that appear only once (we call them "hapax legomena" in text mining) are 40% of the number of unique words. (I made some histograms, but they're not that interesting to look at; maybe I was just unsuccessful at figuring out a way to communicate the distribution clearly.)

Among the hapaxes are the following words, which I picked out for no other reason than they caught my eye as incongruous in some way:

My favourite one is "haponin." Now when someone asks me, "Hey, man, what's haponin?" I can respond, "It's a protein similar to Human Leukemia Differentiation Factor but without nuclease activity."

Thursday, December 5, 2013

Apparently it's a controversial... area.

I'm a Canadian. I'm proud to be a Canadian. I'm proud of my fellow Canadians. But gee whiz, we can sure be sensitive sometimes.

In my post two weeks ago, I pointed out how the Mercator projection exaggerates the surface area of Canada. Map-lovers loved my post; Canadians hated it. Many seemed to think I was trying to cast aspersions on Canada's proud place as the world's second-largest country.

Far from it. But as big a fan of Canada as I am, I'm also a fan of the truth: it's a tight race for second place. If we lost Labrador, we'd drop to fourth. This fact is rather disguised by the Mercator projection:

Poor China, they really get the shaft: they drop a place and end up looking a third as big in relation to their Russian neighbours. The United States is partially buffered from this ignominious fate by Alaska.

I understand why Google Maps uses Mercator: having north, south, east and west perfectly correspond to the edges of the map is handy when you're giving directions. Plus equal-area projections (there are many different ones, because there are many different ways to do this) just look weird, with their elongated, phallic Africa:

There are hybrid projections that do a pretty good compromise, but most of the best ones aren't rectangular, and that can be inconvenient if you don't happen to have, say, a hexagonal iPhone screen.

There. Go Canada. You're big, but my favourite fact about you is that you share the world's largest undefended border. Well, that and the fact that we have the world's largest island inside a lake inside an island inside a lake.

Oh, one more thing: Relevant xkcd.

