Saturday, December 28, 2013

Use of the f-word in Eddie Murphy: Delirious

I had planned to take a break from blogging during the holidays, but today I saw this post on reddit about the use of the f-word in movies in the dataisbeautiful subreddit, and I was inspired. The top movie on the list I had seen was Eddie Murphy: Delirious; I was 13 when it came out, but nobody I knew had HBO, so my best friend and I had to wait till it showed up in the Betamax tape rental place. We made a lo-fi audio recording (a microphone held up to the TV speaker), and soon had it memorized and spent several years quoting it in all sorts of inappropriate situations.

So, let's break down the use of the f-word (I admit, I'm being a total wuss, Google hosts this blog and I'd rather not deal with any automated fallout from using profanity, so I'm going to asterisk out all the naughty words) during the movie. Some simple poor man's calculus (for each use of the word at time x, y equals the inverse of the average of the times of the previous and next use) shows the clustering of swearing during different parts of the film:

It would be great to know what parts of the movie those clusters correspond to: if you go to the bottom of the post, there's a reversed version of the graph that allows you to see the dialogue (lightly Bowdlerized, again, I'm sorry) line by line.

I've been learning how to do Natural Language Programming in Python, and while I didn't bring out the big guns, I thought it would be interesting to look at some of the simple patterns in word use in the movie: 

Normally I would use a stop list to remove common words like "the" and "and", and a corpus to compare word frequencies, but I think the raw data is the most informative perspective, showing how the profanity rivals the most common syntactic words in Delirious. Here are the top N-grams (words that appear side-by-side):

I'm a contributor to the FullMovieGifs subreddit, so I couldn't resist the temptation to make one of Delirious. Hopefully Google doesn't OCR these things; if you want to see it larger, click on it.

Finally, here's a big, vertical version of the first graph in the blog, which you can mouseover to read the lines of dialogue (is it still called dialogue when only one person's talking?) to your heart's content. If you can't see a really huge graph right underneath this sentence, click here to see it.

I think I'll be hearing from my mom about this post.
Update Jan. 1, 2014: Whaddaya know, my mom was fine with it.

Tuesday, December 17, 2013

Weight of small change in USA, EU, UK and Canada

I'm interested in how much the SMALL change weighs, I don't want to get into the dollar bill/coin debate.
The graph is interactive, feel free to click and hover. [Blogger seems to be finicky with javascript; if you don't see a big interactive graph right underneath this sentence, click here.]
Bottom line: Brits need good pockets.

I started learning javascript a couple of months ago, and I'm comfortable enough to be able to lean heavily on a package and wrangle the API to give me what I want. Today Highcharts, tomorrow, D3.js!Coin weights are taken from Wikipedia.
If anyone prefers to see a simple non-interactive image, click on this:

Wednesday, December 11, 2013

Word clouds of the human genome: most and least frequent words

I work in genomics, so I thought it was time to geek out this week.

The words are taken from the reference human genome annotations in UniProt; I wanted to use the same font as the UniProt logo, but I couldn't find it, so I went with a similar Bauhaus-style font that had a serendipitous name (and was free): Monoglyceride, by Tepid Monkey Fonts.

Instead of limiting the cloud to the presently confirmed 20,272 genes, I used all 69,049 annotations (which still, BTW, covers less than 7% of the genome). You can see both data sets here; the difference between them is not dramatic, except that the unreviewed set contains the words "fragment" over 1000 times more often.

The annotation contains 505,128 "words" (any combination of contiguous letters is considered a word, punctuation and numbers are removed), and 14,689 unique words. The frequency is very unequally distributed: the top two words, "protein" and "fragment", take up almost 16% of the total words, while words that appear only once (we call them "hapax legomena" in text mining) are 40% of the number of unique words. (I made some histograms, but they're not that interesting to look at; maybe I was just unsuccessful at figuring out a way to communicate the distribution clearly.)

Among the hapaxes are the following words, which I picked out for no other reason than they caught my eye as incongruous in some way:

My favourite one is "haponin." Now when someone asks me, "Hey, man, what's haponin?" I can respond, "It's a protein similar to Human Leukemia Differentiation Factor but without nuclease activity."

Thursday, December 5, 2013

Apparently it's a controversial... area.

I'm a Canadian. I'm proud to be a Canadian. I'm proud of my fellow Canadians. But gee whiz, we can sure be sensitive sometimes.

In my post two weeks ago, I pointed out how the Mercator projection exaggerates the surface area of Canada. Map-lovers loved my post; Canadians hated it. Many seemed to think I was trying to cast aspersions on Canada's proud place as the world's second-largest country.

Far from it. But as big a fan of Canada as I am, I'm also a fan of the truth: it's a tight race for second place. If we lost Labrador, we'd drop to fourth. This fact is rather disguised by the Mercator projection:

Poor China, they really get the shaft: they drop a place and end up looking a third as big in relation to their Russian neighbours. The United States is partially buffered from this ignominious fate by Alaska.

I understand why Google Maps uses Mercator: having north, south, east and west perfectly correspond to the edges of the map is handy when you're giving directions. Plus equal-area projections (there are many different ones, because there are many different ways to do this) just look weird, with their elongated, phallic Africa:

There are hybrid projections that do a pretty good compromise, but most of the best ones aren't rectangular, and that can be inconvenient if you don't happen to have, say, a hexagonal iPhone screen.

There. Go Canada. You're big, but my favourite fact about you is that you share the world's largest undefended border. Well, that and the fact that we have the world's largest island inside a lake inside an island inside a lake.

Oh, one more thing: Relevant xkcd.

Wednesday, November 27, 2013

Canada: strong and free, but maybe not as north as you think

Canada is farther north than the United States: everybody knows this, and for the most part it's true. An article in Monitor on Psychology says people tend to take these geographical mental shortcuts too far: most Americans are surprised to find that all of Florida is farther south than the Mexican border, for example.

So let's see how much of the United States is below Canada's most southerly city, Windsor, Ontario (I won't cheat and count the little islands in Lake Erie that belong to Canada):

For the record, the red area comprises 22% of the surface area of the contiguous United States (38% if you include Alaska), and 15% of its population. Windsor is just 25 km further north than the California-Oregon border.

The paper also states that both Americans and Canadians tend to imagine Europe more southerly than it is in relation to them (they equate Spain's latitude with the southern states, for example). Let's have a look, without that pesky Atlantic Ocean in the way:

Once again using the online tool Mapfrappe, I've marked the Latitudes of Windsor and of the 60th parallel, which divides the Prairie provinces from Northern Canada. You'll notice Windsor, which has some cold winters, is even with northern Spain, which decidedly doesn't. That's another mental shortcut we all share: north = cold, but it's not that simple when you have a nice Gulf Stream warming your coastline.

The geographical comparison was less surprising to me than the demographic one. Six weeks ago, I posted a blog about Canadian population by latitude, whose data was a little coarse because Canada Post and Statistics Canada have copyrighted the most finely detailed geographical boundaries used in the census. A wonderful reader pointed me to the ISLCP II Project, which lists the population of the entire planet for every quarter-degree of latitude and longitude -- albeit from 1995, but I'll take it. Have a look at the relative* populations by latitude of the United States, Canada and Europe:

The most northerly Canadian city with over half a million population is Edmonton, Alberta: it's at about the same latitude as Dublin, Manchester and Hamburg, and 15% of Europeans live farther north than this. (The demarcation of Europe and Asia is fuzzily defined; I chose it as including Istanbul and Moscow, which is north of Edmonton.) And the median latitude of population in Europe is 7 degrees higher than in Canada -- that's over 800 kilometers.

Thanks to these histograms I realized I'm as susceptible to that misfiring geography heurism as anyone: in my mind, Hawai'i was about the same latitude as Sacramento, California, but it's over 500 kilometers farther south than the mainland United States.

Next week, I finish my latitudinal triptych with some sundry interesting tidbits I picked up while writing the last two.

*That means all the bars in each column add up to 100% of the population of the area; obviously, there are more people in Europe and the United States than in Canada.

Thursday, November 21, 2013

Canada and the Mercator Projection: Latitude and Attitude

This post owes its existence to the excellent online tool MAPfrappe, which allows you to draw on a map of the Earth and then move it around; you can save your drawings, such as my outline of Canada. It took me a while to do, please feel free to play around with it to validate the time I spent judging how close to the squiggly borders was close enough!

Google Maps is a truly wonderful invention, but there is one flaw: the Earth is a sphere, and in order to fit it on a rectangular surface (like a computer screen), adjustments must be made. Google uses the Mercator Projection, which dates all the way back to 1569. It's famous, it's familiar, but there are many better ones.

The main weakness of Mercator is its exaggeration of surface area the closer you get to the poles; and of course Canada gets pretty close to the north pole. Compare the Mercator projection of Canada with an image from a globe: Ellesmere Island, the northermost landmass, has over four times more area in the Mercator projection!

Mapfrappe allows us to see what happens to the outlines when we move them elsewhere in the map projection. Canada extends from latitudes 42.3°N (Windsor, Ontario) to 83°N (the northern tip of Ellsemere Island). So let's drag the map so that the northern tip of Ellesemere is now at 42.3°S:

Wait a minute, the sharp-eyed among you may now be objecting. Something's wrong with this map! Windsor projects below the South Pole! How can anything be below the South Pole? And Canada's longitudes, which Mercator is not supposed to affect, have been drastically increased: the country now spans over 90% of the globe! You're seeing an artifact of geometry: MAPfrappe does not recalculate the projection of every point of the outline (which would be a very computationally comlex thing to do for what is essentially a whimsical exercise), it trapezoidally skews the projection according to its center. (I may be using the wrong terms to describe this: I'm a biochemist, not a mathematician.)

Who cares if parts of the globe where few people live are distorted in the Mercator projection, you may ask. It's a valid question. I'll just leave you with this: a comparison of the size of Canada with that of Africa on the Mercator projection (even leaving out the most northerly part) and on the globe. I think it's plausible this illusion may affect opinion and policy.

 Next week: I take even more latitude with latitude.

Thursday, November 14, 2013

How not to evaluate a weather forecast

When I was a kid in the '70s, weather forecasting really wasn't very good, but sophisticated computer modeling has improved it much more than the human ability to gauge its accuracy. I'm not a meteorologist, but I did read a chapter in a book about it: Nate Silver's The Signal and the Noise : Why So Many Predictions Fail – but Some Don't (2012). And, of course, the great thing about weather forecasts is that their accuracy can easily by measured after the fact: here's a study which concludes another inescapable fact about human endeavours: you get what you pay for.

Saturday, November 9, 2013

Full movie GIF of 2001: A Space Odyssey

Just a quick post to leave this here; a redditor earlier in the week made a splash by converting a dozen or so movies to GIFs and posting them in his own subreddit, to which no one else was allowed to post. It was mentioned on Digg, and then the subreddit disappeared. But I was inspired. This isn't just "one out of every X frames", BTW, I captured an image every 5 seconds, leaving me with almost 1800 images, then I went through and reduced them to 300 or so, strategically chosen to skip the uninteresting parts and emphasize the important bits.

Wednesday, November 6, 2013

Some graphic juxtapositions

A friend asked when the "sundry" elements that are promised in the blog subtitle would come, so here we go; I will post some of my graphic design doodling.

First, since he's in the news a lot and seems not to say much, here's a morph between Rob Ford and Hodor from Game of Thrones. I know, it's a bit of a cheap... er, crack:

I've always thought Sky Ferreira's and Robert Pattinson's faces look scarily alike:

There's a very well-studied phenomenon in psychology called the Stroop Effect, where there is a decoupling of the contents and the display of a word. Try to say the colors of these words: BLUE YELLOW BROWN. I work a lot with different typefacers and it occurred to me that the same mind-f*ck might work:

User doomrobo on reddit pointed out that I arranged the subsitutions in a Hamiltonion cycle, to which my response was, of course, "Yes, I totally meant to do that, I chose that arrangement because I definitely knew what a Hamiltonion cycle was before you mentioned it and did not have to google it just now. It is purely a coincidence that it happens to be the easiest way to do substitutions without the added effort of keeping track of them."

Finally, I work in a biochemistry lab and like many geek art fans, I love René Magritte, so I mocked up this version of La Trahison des images:

Oh, did I say "finally"? That was to lull you into a false sense of security so I could hit you with this awful, awful X-Files pun:

Thursday, October 31, 2013

Boston Celtics retired jerseys by year: when will they run out of numbers?

I'll admit I'm not a huge sports fan, but I am a huge numbers fan, and sports produces a lot of those. It also produces a lot of analysts: after all, there's lots of money riding on much of these numbers. So it's a bit of a challenge to find something original, and by definition it's going to be a bit frivolous.

It occurred to me that if teams keep retiring numbers and don't expand the pool of possible numbers, eventually they will run out. A bit of Googling revealed that the Boston Celtics have the most retired numbers of any major professional sports team. The NBA allows 100 numbers, from 1 to 99 and 00; they've retired 21 in the past 40 years, so a simple linear fit shows that at this rate they will run out in a couple of centuries.

I wouldn't worry about this problem too much; the Celtics have already shown how to solve it. When they retired Jim Loscutoff's jersey, he requested that they not retire his number (18), so their banner reads "LOSCY" instead. Later, Dave Cowens spoiled the gesture by wearing the same number and having it retired.

It occurs to me that I've seen these kinds of stepwise and extrapolation graphs on xkcd (e.g. here and here), except of course Randall Munroe is much better at them than me. So I decided to do a little tribute and rework the first graph xkcd-style using Dan Foreman-Mackey's xkcd D3.js template. My javascript skills being what they are, this was by far the longest part of this project; but it was a labour of love. I hope everyone will forgive me.

Thursday, October 24, 2013

Pictures of Pavel Chekov with quotes by Anton Chekhov

Pavel Chekov, navigator of the starship Enterprise in the original
TV series Star Trek (1966-1969), played by Walter Koenig

Anton Chekhov, Russian playwright and short-story writer, 1860-1904,
author of The Cherry Orchard.

It's notoriously easy to find misquotes on the Internet. I did my best to verify the sources of these quotes, and any that were really iffy did not make the cut, but it's possible some less than perfectly verified ones slipped through; if so, I apologize, and please let me know.

Wednesday, October 16, 2013

Population of Canada by latitude

Update: here's my final edit of the chart; I think the city labels are much less misleading now. I've come across a much more fine-grained data set, albeit from 1995; you can see it in my Nov. 27, 2013 blog post.

Here's the original, which seemed to imply that the bars were only made up of population from the indicated cities, whereas the bars indicate the population of the entire country at the same latitude of those cities:

A co-worker and friend happened to mention that Vancouver was further north than Montreal; I sort of knew that, but I was surprised to find out it was 400 km further north. So I was curious, and tried to find a histogram of Canadian population by latitude; maybe my Google fu was lacking, but I couldn't find one, so I decided to make one myself.

Little did I know what I would discover; that data is not easy to obtain. There is lots of population data available for download from the Statistics Canada website, but it does not contain geographical coordinates, and StatsCan uses its own defined areas called census subdivisions. They have available for download geographical boundary files, but they would have required an amount of computation rather disproportionate to the task of simply determining latitudes.

Luckily, StatsCan also makes the population available by Forward Sortation Area, the first three letters of the Canadian six letter postal code, e.g. the FSA of the Canadian parliament at postal code K1A 0A9 is K1A. So now it was just a matter of finding out the latitudes of FSAs or postal codes. Simple, right?

Wrong. Canada Post considers its postal codes intellectual property subject to copyright; a license to use and analyze it costs $892 a year for StatsCan's info, and over $5000 for many business products. They are suing a website for providing information on postal code geography. Universities used to be able to access Canada Post's geographical data, but no longer. I work for a university, and the reference library has someone who is able to take the publicly available ArcGIS files and determine the centroids using the expensive proprietary commercial software for which the university has a license.

So: the population data is divided into 1600 FSAs, which is pretty decent resolution. The centroid (geographical center) for most postal codes fits reasonably well within the 0.5 degree latitude (about 55 km) resolution of the graph, except of course for the very large FSAs the farther north you go. But in any case, these areas would have had to be aggregated somehow to even be visible on the scale (for example, if if the northernmost FSA, X0A, were spread out among its 14 degrees of latitude), so I think this is a reasonable compromise.

A note on the city labels: I tried to give the largest municipalities that contributed to the population in each bar of the histogram as an aid to understanding, not as a systematic data set. This became difficult for some of the larger FSA's; it was difficult to match the latitude of a town with the latitude of the centroid of its FSA. So in some cases, I may have used a town with a population of 2,000 when there was a town with 3,000 people at the extreme north or south of the FSA. And a note about Edmonton: it straddles two bars because the center of the city is almost exactly on the demarcation, 53.5 degrees north. Edmonton is a bit smaller than Calgary, but there are other sources of population in each latitude than the city mentioned, so do not draw the wrong conclusion from the size of the bars.

You can peruse the data I used in this Google Doc.

Comments are welcome, even, nay especially, critical ones.

EDIT 2013-10-16 14:49 GMT: Montreal straddles the 45.5 degree latitude, and by marking the 45.5-46.0 bar as "Laval", the graph appeared to be indicating that Laval had a larger population than Montreal. I've explained how the labels are generated, but it's an obvious conclusion to draw from a glance at the map without reading the methodology (and the methodology had to be tweaked for Edmonton and Montreal, which straddle the cusps of the graphs, and the centroids of the FSAs are problematic to begin with). Clarity is the most important thing, so I've updated the bar to read "Laval & Montréal". Thank you to the commenters in Reddit's dataisbeautiful forum for pointing this out.

EDIT 2013-10-16 15:33 GMT: When you're wrong, you're wrong, and I was wrong. My labels were utterly misleading. Now I have put the major contributor AND every Canadian city with over 100,000 population on the graph. I had intended the labels just as a geographical reference, but I definitely did not think through what fresh eyes coming to the graph would think.

EDIT 2013-10-16 21:53 GMT: These labels are really getting me in trouble. I produced the graph first without them, but I envisaged a torrent of "You should have indicated where these people live!" I've removed the most northerly ones, because again, they're misleading. Lesson learned: less is more.

EDIT 2013-10-16 22:41 GMT: Added hi-res version without labels. I think that's enough editing today. Enjoy! And thanks for all the feedback! The vast majority of it was very constructive, it's appreciated.

Popular Posts

Scroll To Top