Monday, February 24, 2014

U.S. baby names, 1880-2012: Diversity

The United States Social Security Administration released a fascinating data set a few years ago: all of the names of newborns registered since 1880. There have already been some great analyses and visualizations of this data (there's a list below). Still, I think there are plenty more items of interest to be found in mining this data, or if that fails, surveying, dowsing and spelunking it.

Friends' first reaction to news I was looking at this data set was inevitably, "So you must see more and more unique names as the years progress?" (Usually followed by, "Do you see a decline in 'Adolf' after WWII?" Short answer: yes.) Name diversification is anecdotally evident --  nobody named their babies Rumer, or North (or even Ashley) in 1900 -- but it's nice to have evidence. I'm far from the first to explore this phenomenon, but I think I've come up with some interesting displays:


This data seem to support the view that there are more and more names out there: In the 1880s, over 8% of the population whose birth records ended up reported in the Social Security database (an important distinction, as we'll see!) were named John or Mary; the most popular names nowadays are closer to 1% of the total, and their share has decreased rapidly since the late 1960s. Other analysts have shown that girls' names have more of a bandwagon effect than boys', and these graphs seem to bear that out, with higher peaks when names like Linda or Jessica become very popular for a few years, then fade into relative obscurity.

The Social Security database isn't perfect, nor do its curators claim it to be; in particular, statistics before World War II are suspect because of missing data and they way the data was collected. This is most evident when we try to visualize diversity by looking at how many new names per birth were introduced into the records:


The huge peak in the 1910s isn't an explosion of weird names, it's an artifact of the database. Social security numbers were introduced in 1935, and then adults signed up and gave their birth years; what you are seeing is evidence that there are a lot of names missing before then, which only stands to reason. How many people born in Appalachia or the Old West would the Social Security administration find records of decades after their births? Once the system kicks into gear, we see an increase in overall diversity since the late 1960s, as expected.

This is a fascinating data set, and in upcoming posts I'll mine it some more. For now, though, I'll spelunk one name: the hook of the 1989 film Heathers is that there were a lot of teenage girls with that name at that time, and the data shows that is the case; it also shows how dated the film is, because by now the name has faded into the obscurity from whence it came.


Friday, February 14, 2014

By popular demand for the Cancer infographic

I've been getting lots of feedback from Monday's post with an interactive graphic I made about diagnosis and mortality rates among different cancers. This is very gratifying, the positive and negative comments both, I'm learning a lot about people's first impressions of such things.

I've had a few repeated requests: yes, you have my permission to post this anywhere and use it for non-evil purposes as long as you give me and the sources credit. I made a big, nice, static version for when the interactive one is inconvenient: here it is.


I've included all the data you need to properly understand it non-interactively; bear in mind that it's a little less obviously intuitive if you can't mouseover and get that instant cognitive feedback.

I've also had some inquiries about posters, so I made a little store at zazzle.com and uploaded the above graphic. The posters cost around $13 (I think) and I get around $3 (I think), and I will give absolutely every penny of that to the American Cancer Society. You can put this on other products too, let me know if you want, like a postcard-size to use as an educational material. You can also get this printed on a mug, wrapping paper, throw pillow, even a skateboard, but I don't think that would really be appropriate. This is all new to me, so I'm not exactly how I can prove that I'll donate the money; if anyone has any ideas, feel free.

Thank you so much to Chris Kirk at Slate.com, Lauren F. Friedman at Business Insider, Andrew Sullivan at The Dish (Chart of the Day!) and Bryce Rudow at The Daily Banter for sharing my graphic.

Apparently if I paste some code here, a neat panel showing and linking to the zazzle poster will appear. Here goes nothin':


make custom gifts at Zazzle

Tuesday, February 11, 2014

Interactive graphic: Cancer diagnosis and mortality rates

Since my site redesign, this graphic doesn't quite fit on the page anymore; I'm exploring different options, in the meantime you can see it on a standalone page.


I'm sorry this post appears on the blog twice! I made corrections after sharp-eyed Slate and Reddit users noticed some disparities between the data and my presentation, and in doing so I accidentally cloned the entire post! Now both versions are being linked to, so I don't want to delete either of them; I will make sure they match exactly.*

Thanks to Chris Kirk, interactives editor at Slate.com for reposting this graphic and dramatically increasing my number of Twitter and Facebook followers! I'll try not to let you down, new friends!

The idea for this graphic came from this homeschoolpromqueen tumblr post about how to visually categorize the data in this Edward Tufte blog post about how to present the data in this American Cancer Society publication about the number of cases of different types of cancer diagnosed in the United States every year. The 5- to 20-year mortality rates come from a 2002 study. It occurred to me that the visually most potent way to present such a large amount of data was to anchor it to something everyone knows intimately: our own anatomy.

Note that since 2002, the consensus of some of the the 5- and 10-year mortality rates have changed, but to my knowledge that study is the only one to present mortality rates for virtually all types of cancers up to 20 years. Rather than cherry-pick and include data from different data sets which could be misleading when compared, this one data set was used. I strongly encourage those interested to do some googling for more up-to-date numbers!

I learned a few things about cancer while doing this, things which may seem obvious in retrospect but nothing's obvious if you don't have the opportunity to think about it. For example, the more survivable cancers appear to be in organs you can remove, like the breast or prostate: that makes sense. I wonder why the most diagnoses are in some of these organs, too: knowing nothing about medicine, I wonder if any of this is because they're easier to diagnose. And the pie chart for pancreatic cancer is revealing; there's a huge five-year mortality rate, but it hardly increases at all after that; once you've gotten past a critical time period (which I'm guessing is much shorter than five years), your outlook vastly improves.

I started to learn Javascript last year, and most of the store of knowledge I've picked up has gone into this. Thank goodness for ready-made libraries like Raphael.js -- I ended up swapping out much of the Raphael for the increased functionality of "pure" Javascript, but I couldn't have gotten there without that intermediate step. Eventually, perhaps, I'll graduate to D3.

I'm utterly gratified by the inquiries I've had from teachers and public speakers who want to use this graphic in their presentations; you all absolutely have my permission, I just ask that you stick the name of my blog next to it, as well as acknowledgements of the source data. You can save this standalone web page (you'll need an active Internet connection for the Javascript library to work) or you can save this static standalone PNG graphic.

EDIT Tuesday, Feb 11, 7:46 EST: There were a few errors in the graphic that I have corrected; I humbly apologize, you should never proofread (or prooffread) your own work. The only error that was truly grievous was that I had lung cancer at about 2/3 the correct diameter. I believe (fingers crossed) everything is accurate now.

* Some users have reported when they view the main blog site at www.prooffreader.com, the graphic appears twice in the top post and once in the bottom post; this appears to be a Google/Blogger idiosyncrasy (and probably my limited Javascript skills), when you view the individual posts they each contain the graphic.

Monday, February 10, 2014

Interactive graphic: Cancer diagnosis and mortality rates

Since my site redesign, this graphic doesn't quite fit on the page anymore; I'm exploring different options, in the meantime you can see it on a standalone page.


Thanks to Chris Kirk, interactives editor at Slate.com for reposting this graphic and dramatically increasing my number of Twitter and Facebook followers! I'll try not to let you down, new friends!

The idea for this graphic came from this homeschoolpromqueen tumblr post about how to visually categorize the data in this Edward Tufte blog post about how to present the data in this American Cancer Society publication about the number of cases of different types of cancer diagnosed in the United States every year. The 5- to 20-year mortality rates come from a 2002 study. It occurred to me that the visually most potent way to present such a large amount of data was to anchor it to something everyone knows intimately: our own anatomy.

Note that since 2002, the consensus of some of the the 5- and 10-year mortality rates have changed, but to my knowledge that study is the only one to present mortality rates for virtually all types of cancers up to 20 years. Rather than cherry-pick and include data from different data sets which could be misleading when compared, this one data set was used. I strongly encourage those interested to do some googling for more up-to-date numbers!

I learned a few things about cancer while doing this, things which may seem obvious in retrospect but nothing's obvious if you don't have the opportunity to think about it. For example, the more survivable cancers appear to be in organs you can remove, like the breast or prostate: that makes sense. I wonder why the most diagnoses are in some of these organs, too: knowing nothing about medicine, I wonder if any of this is because they're easier to diagnose. And the pie chart for pancreatic cancer is revealing; there's a huge five-year mortality rate, but it hardly increases at all after that; once you've gotten past a critical time period (which I'm guessing is much shorter than five years), your outlook vastly improves.

I started to learn Javascript last year, and most of the store of knowledge I've picked up has gone into this. Thank goodness for ready-made libraries like Raphael.js -- I ended up swapping out much of the Raphael for the increased functionality of "pure" Javascript, but I couldn't have gotten there without that intermediate step. Eventually, perhaps, I'll graduate to D3.

I'm utterly gratified by the inquiries I've had from teachers and public speakers who want to use this graphic in their presentations; you all absolutely have my permission, I just ask that you stick the name of my blog next to it, as well as acknowledgements of the source data. You can save this standalone web page (you'll need an active Internet connection for the Javascript library to work) or you can save this static standalone PNG graphic.

EDIT Tuesday, Feb 11, 7:46 EST: There were a few errors in the graphic that I have corrected; I humbly apologize, you should never proofread (or prooffread) your own work. The only error that was truly grievous was that I had lung cancer at about 2/3 the correct diameter. I believe (fingers crossed) everything is accurate now.

* Some users have reported when they view the main blog site at www.prooffreader.com, the graphic appears twice in the top post and once in the bottom post; this appears to be a Google/Blogger idiosyncrasy (and probably my limited Javascript skills), when you view the individual posts they each contain the graphic.

Popular Posts

Scroll To Top