Wednesday, December 11, 2013

Word clouds of the human genome: most and least frequent words

I work in genomics, so I thought it was time to geek out this week.

The words are taken from the reference human genome annotations in UniProt; I wanted to use the same font as the UniProt logo, but I couldn't find it, so I went with a similar Bauhaus-style font that had a serendipitous name (and was free): Monoglyceride, by Tepid Monkey Fonts.

Instead of limiting the cloud to the presently confirmed 20,272 genes, I used all 69,049 annotations (which still, BTW, covers less than 7% of the genome). You can see both data sets here; the difference between them is not dramatic, except that the unreviewed set contains the words "fragment" over 1000 times more often.

The annotation contains 505,128 "words" (any combination of contiguous letters is considered a word, punctuation and numbers are removed), and 14,689 unique words. The frequency is very unequally distributed: the top two words, "protein" and "fragment", take up almost 16% of the total words, while words that appear only once (we call them "hapax legomena" in text mining) are 40% of the number of unique words. (I made some histograms, but they're not that interesting to look at; maybe I was just unsuccessful at figuring out a way to communicate the distribution clearly.)

Among the hapaxes are the following words, which I picked out for no other reason than they caught my eye as incongruous in some way:

My favourite one is "haponin." Now when someone asks me, "Hey, man, what's haponin?" I can respond, "It's a protein similar to Human Leukemia Differentiation Factor but without nuclease activity."


