Thursday, October 31, 2013

Boston Celtics retired jerseys by year: when will they run out of numbers?

I'll admit I'm not a huge sports fan, but I am a huge numbers fan, and sports produces a lot of those. It also produces a lot of analysts: after all, there's lots of money riding on much of these numbers. So it's a bit of a challenge to find something original, and by definition it's going to be a bit frivolous.

It occurred to me that if teams keep retiring numbers and don't expand the pool of possible numbers, eventually they will run out. A bit of Googling revealed that the Boston Celtics have the most retired numbers of any major professional sports team. The NBA allows 100 numbers, from 1 to 99 and 00; they've retired 21 in the past 40 years, so a simple linear fit shows that at this rate they will run out in a couple of centuries.

I wouldn't worry about this problem too much; the Celtics have already shown how to solve it. When they retired Jim Loscutoff's jersey, he requested that they not retire his number (18), so their banner reads "LOSCY" instead. Later, Dave Cowens spoiled the gesture by wearing the same number and having it retired.

It occurs to me that I've seen these kinds of stepwise and extrapolation graphs on xkcd (e.g. here and here), except of course Randall Munroe is much better at them than me. So I decided to do a little tribute and rework the first graph xkcd-style using Dan Foreman-Mackey's xkcd D3.js template. My javascript skills being what they are, this was by far the longest part of this project; but it was a labour of love. I hope everyone will forgive me.

Thursday, October 24, 2013

Pictures of Pavel Chekov with quotes by Anton Chekhov

Pavel Chekov, navigator of the starship Enterprise in the original
TV series Star Trek (1966-1969), played by Walter Koenig

Anton Chekhov, Russian playwright and short-story writer, 1860-1904,
author of The Cherry Orchard.

It's notoriously easy to find misquotes on the Internet. I did my best to verify the sources of these quotes, and any that were really iffy did not make the cut, but it's possible some less than perfectly verified ones slipped through; if so, I apologize, and please let me know.

Wednesday, October 16, 2013

Population of Canada by latitude

Update: here's my final edit of the chart; I think the city labels are much less misleading now. I've come across a much more fine-grained data set, albeit from 1995; you can see it in my Nov. 27, 2013 blog post.

Here's the original, which seemed to imply that the bars were only made up of population from the indicated cities, whereas the bars indicate the population of the entire country at the same latitude of those cities:

A co-worker and friend happened to mention that Vancouver was further north than Montreal; I sort of knew that, but I was surprised to find out it was 400 km further north. So I was curious, and tried to find a histogram of Canadian population by latitude; maybe my Google fu was lacking, but I couldn't find one, so I decided to make one myself.

Little did I know what I would discover; that data is not easy to obtain. There is lots of population data available for download from the Statistics Canada website, but it does not contain geographical coordinates, and StatsCan uses its own defined areas called census subdivisions. They have available for download geographical boundary files, but they would have required an amount of computation rather disproportionate to the task of simply determining latitudes.

Luckily, StatsCan also makes the population available by Forward Sortation Area, the first three letters of the Canadian six letter postal code, e.g. the FSA of the Canadian parliament at postal code K1A 0A9 is K1A. So now it was just a matter of finding out the latitudes of FSAs or postal codes. Simple, right?

Wrong. Canada Post considers its postal codes intellectual property subject to copyright; a license to use and analyze it costs $892 a year for StatsCan's info, and over $5000 for many business products. They are suing a website for providing information on postal code geography. Universities used to be able to access Canada Post's geographical data, but no longer. I work for a university, and the reference library has someone who is able to take the publicly available ArcGIS files and determine the centroids using the expensive proprietary commercial software for which the university has a license.

So: the population data is divided into 1600 FSAs, which is pretty decent resolution. The centroid (geographical center) for most postal codes fits reasonably well within the 0.5 degree latitude (about 55 km) resolution of the graph, except of course for the very large FSAs the farther north you go. But in any case, these areas would have had to be aggregated somehow to even be visible on the scale (for example, if if the northernmost FSA, X0A, were spread out among its 14 degrees of latitude), so I think this is a reasonable compromise.

A note on the city labels: I tried to give the largest municipalities that contributed to the population in each bar of the histogram as an aid to understanding, not as a systematic data set. This became difficult for some of the larger FSA's; it was difficult to match the latitude of a town with the latitude of the centroid of its FSA. So in some cases, I may have used a town with a population of 2,000 when there was a town with 3,000 people at the extreme north or south of the FSA. And a note about Edmonton: it straddles two bars because the center of the city is almost exactly on the demarcation, 53.5 degrees north. Edmonton is a bit smaller than Calgary, but there are other sources of population in each latitude than the city mentioned, so do not draw the wrong conclusion from the size of the bars.

You can peruse the data I used in this Google Doc.

Comments are welcome, even, nay especially, critical ones.

EDIT 2013-10-16 14:49 GMT: Montreal straddles the 45.5 degree latitude, and by marking the 45.5-46.0 bar as "Laval", the graph appeared to be indicating that Laval had a larger population than Montreal. I've explained how the labels are generated, but it's an obvious conclusion to draw from a glance at the map without reading the methodology (and the methodology had to be tweaked for Edmonton and Montreal, which straddle the cusps of the graphs, and the centroids of the FSAs are problematic to begin with). Clarity is the most important thing, so I've updated the bar to read "Laval & Montréal". Thank you to the commenters in Reddit's dataisbeautiful forum for pointing this out.

EDIT 2013-10-16 15:33 GMT: When you're wrong, you're wrong, and I was wrong. My labels were utterly misleading. Now I have put the major contributor AND every Canadian city with over 100,000 population on the graph. I had intended the labels just as a geographical reference, but I definitely did not think through what fresh eyes coming to the graph would think.

EDIT 2013-10-16 21:53 GMT: These labels are really getting me in trouble. I produced the graph first without them, but I envisaged a torrent of "You should have indicated where these people live!" I've removed the most northerly ones, because again, they're misleading. Lesson learned: less is more.

EDIT 2013-10-16 22:41 GMT: Added hi-res version without labels. I think that's enough editing today. Enjoy! And thanks for all the feedback! The vast majority of it was very constructive, it's appreciated.

Wednesday, October 9, 2013

[Word Cloud] Comic book superhero names

Word Cloud

It's been a while since I had this idea, but I struggled to find a good corpus of names to work with. Comicvine has a nice list of characters in comics, but it would have taken a lot of manual processing to make sure the end result was not full of "McDuck".

I stumbled across, a fan-curated list of favorite superhero names, and this seemed a decent compromise. I extracted all the names that fans had given four or five stars, separated them into morphemes (so Batman becomes "bat" and "man"), compiled a frequency list, made a shaped word cloud with a comic-style font at tagxedo, did a little phosohopping, and voila.

No surprises that Man, Captain and Girl are most highly represented (and you can draw your own conclusions about Girl being more common than Woman). A co-worker I showed this to pointed out there are some interestingly serendipitous names that can be made from the way the algorithm put the morphemes together on this graphic: "Super fire she lad", "America devil ice", "Princess cat bird hawk". I would totally buy those comics.

I've posted the names I used in this Google doc. It's rather imperfect; if anyone has any better suggestions for a corpus, I would be very interested to hear them.

Word cloud created using Tagxedo.

Monday, October 7, 2013

Word Cloud of Vladimir Nabokov's Lolita

There was certainly no dearth of images to choose from for this book; a tasteful cover for a book about ephebophilia is a challenge that many designers are fascinated by. There is a web site with 185 published book covers, a recent book of 80 commissioned conceptual book cover designs, and plenty of fan cover designs around the Internet.

Finally, I had to go with the iconic heart-shaped sunglasses from the Kubrick film poster (reproduced on many book covers afterwards), despite its notoriously presenting Lolita as a seductress instead of Nabokov's pragmatic, desperate, abused girl. Perhaps we are seeing her through Humbert Humbert's flawed perception.

1962 movie poster

By far the most common word is the title character's name: she appears as Dolores (her given name) 65 times, Dolly 100 times, Lolita 240 times and Lo 273 times, for a total of 678 mentions -- in a 110,000 word book, that's an extremely high rate of 0.6% of all words used. Her name is even repeated eight times in a row by a rapturous narrator in Chapter 26 -- a chapter so short, Lolita's name makes up over 12% of the total words.

The narrator's peculiar reduplicated name, "Humbert Humbert" appears in full 19 times; "Humbert" appears on its own another 87. Among the other strongly represented words are "Haze" (the Lolita's and her mother's surname, an inspiration for many jeux de mots by HH), and of course "young", "child", and HH's neologism "nymphet". Another word which does not appear as often, but is overrepresented in comparison to the English corpus, is "old" -- a contrast, of course, to "young".

The longest oft-repeated or nearly repeated phrase (six times in one form or another) are the lyrics to the half-remembered song "Oh Carmen, ... "the stars and the cars and the barmen", which first appears in the infamous Chapter 13 in which HH steals Lolita's apple (the symbolism is obvious) and maneuvers her onto his lap, where the physical contact makes him near-delirious. Another is "ladies and gentlemen of the jury", said in one form or another ten times, a reminder that HH fully expects to be judged by the reader.

The novel has about 110,000 words, 14,000 unique words, 10,000 unique word stems (e.g. counting "walk", "walking" and "walks" together), and 4,000 word stems used only once -- this is a high variety of words, typical for a master linguist like Nabokov. Many of these singletons are rare French and Latin words like "ensellure", "frétillement" and "quidquam", cultured words like "Chimène" (an opera) and "callypygean" (a classical reference referring to the buttocks" and plays on words like "honeymonsoon" and "dolorous" (referring to Lolita's given name).

"Incest" and "nubile" appear twice each, "tumescent" once, and "pedophilia" and "molest" do not appear at all.

Wikipedia article about the book.

Word cloud created using Tagxedo.

Popular Posts

Scroll To Top