Monday, February 11, 2019

Tuesday, January 15, 2019

The State of the Blog

The State of the Blog
15 January, 2019

I really should have written something before this, my apologies.

When I started this data-sciency blog (before I even understood what it was about!), there were two very different situations from today:

(a) The blogosphere was a viable way to get clicks (not that I monetized, I just wanted my work to be seen), before reddit became the Toxic Avengers Meeting Saloon and before Facebook and Twitter took over the non-billionaire-owned media. This is an oversimplification, obviously, but it was a factor.

(b) I had a lot more time and energy to devote to it, because I was a freelancer and learning new stuff every day and eager to share what I'd learned (whether I'd truly understood it or not). I've been working full time as a python developer-slash-software engineer-slash-data science professional since June 2015 and this has left me (1) little time or energy to devote to this, and (2) better skills so that my threshold of what makes a publishable blog post has risen sharply.

All that to say, that in my mind, I haven't abandoned this, even though it may appear I have! I've started seven or eight projects for this space, and put them all on hiatus for one reason or another, usually some combination of my perfectionism and my running out of temporary free time to give the project some headspace.

I don't know what the future holds! (If there's one thing data science, which is all about predicting the future, has taught me is that we collectively really, really, really suck at predicting the future.) I REALLY hope I haven't written my last blog entry. Time will tell, I suppose (cliché alert).

In the meantime, feel free to browse at what I was doing mostly when I was a freelancer and wanted to showcase my skills to future employers. That was maybe the extra kick I needed? Who knows, it could happen again.

David "prooffreader" Taylor
(proofreader is misspelled, that's the joke)

Tuesday, February 21, 2017

So I don't just do data science

Certain changes in the blogosphere and the data science industry have interfered with my ambitions for posting more analyses on this blog. But I haven't given up. I'll explain more later.

In the meantime, I wrote the lyrics, (amateurishly) mixed and mastered the audio and edited the video for this:

Wednesday, September 21, 2016

Battle of the Data Science Venn Diagrams

Data science is a rather fuzzily defined field; some of the definitions I've heard are:
  • "Work that takes more programming skills than most statisticians have, and more statistics skills than a programmer has."
  • "Applied statistics, but in San Francisco."
  • "The field of people who decide to print 'Data Scientist' on their business cards and get a salary bump."
Personally, I've recently decided to avoid the controversy by calling myself a data spelunker. (Data miners are out of vogue anyway.)

As a field in search of a definition, it's unsurprising that you can find a lot of different attempts to define it.

As a field full of data nerds with a penchant for visualization, it's also unsurprising that a lot of them use Venn diagrams. (Fun fact: John Venn, who invented the eponymous diagrams, and his son filed a patent in 1909 for an lawn bowling machine.)

1. It all started with Drew Conway in 2010 (catching fire when he blogged it in 2013):

For Conway, the center of the diagram is Data Science. There's some controversy over what the bottom circle means (I'll address it farther down); all I can say, is if Conway meant something other than what I would call domain knowledge (e.g. physics), he chose the name Substantive Expertise very poorly. So assuming domain knowledge is at least part of what he meant, the idea is that a physicist, say, would have expertise in physics and math/stats knowledge, but lack hacking knowledge (I've met many physicists and I think that's less true than it used to be). Machine Learning experts tend to apply algorithms without an understanding of the domain they're analyzing (that sure as heck was my case when I first started building models in an industry that was totally new to me; I had to play a lot of catchup). And then people who can program and know their field but have no way to tell a statistically significant result from one arising from sheer coincidence are dangerous; they can arrive at some drastically wrong solutions and, for example, lose their companies lots of money.

Note that this isn't how a Venn diagram works. Hacking Skills, for example, should apply to that entire circle, and the part that doesn't intersect with anything should be labeled, e.g. "hackers". But that's a fairly minor point, it's obvious what he's getting across.

2. After Conway's was made but before it was blogged, Brendan Tierney made a diagram in 2012 that's kinda Venn-ish.

It... sure is busy. KDD stands for Knowledge Discovery and Data Mining, by the way. Despite that, Data Mining also has its own circle. I do appreciate what he did here, though, implying what makes data science worthy of its own field is the breadth of its required skills. Apparently one of those skills is Neurocomputing, which seems a little... specific.

3. Quick on Conway's heels, Ulrich Matter blogged his riff on it later the same month in 2013:
He's flipped it on the diagonal, specified the substantive expertise as Social Sciences (his field), changed hacking to computer science (you can see why someone would object to being characterized as a hacker, although I for one embrace it), and for some reason changed Math & Stats to Quantitative Methods. More importantly, he's moved Data Science where Machine Learning was in Conway's -- that's an interesting distinction, and one I've seen in the field. There are data scientists who specialize in one domain, and then there are generalists (who usually started out in one field but branched out, like me: I started in chemistry and now I'm in insurance). Also, he's apparently not comfortable with Danger Zone, changing it to... a question mark. But apparently what matters to Matter (so to speak) is in the center of the diagram: Data-driven Computational [Social] Science.

A... bit wordy, shall we say? He also made sure to insert Empirical into Traditional Research.

4. After the Edward Snowden news broke, Joel Grus supplied this tongue-in-cheek (or is it?) version. Now we're getting into more rarefied Venn territory, with four circles, the fourth being "evil".

5. In September 2013, Harlan Harris adapted this diagram to deal with data products instead of science.
The slices are no longer comparable to Conway because we've changed from science to products, but the categorizations are noteworthy (and they follow true Venn methodology, not being slices in themselves). Domain Knowledge remains, Computer Science/Hacking remains as Software Engineering, and crucially, Harris has added Predictive Analytics and Visualization to the Statistics circle. But not the actual tools they use, that's in the intersection with Software Engineering. Okay.

6. In January 2014, Steven Geringer provided a tweak that, instead of putting Data Science in the middle three-way intersection like Conway, calls all of it data science and calls the intersection Unicorn (i.e. a mythical beast with magical powers who's rumored to exist but is never actually seen in the wild.)

This is... a little weird, Venn-diagrammatically speaking. I think I know what he's getting at. When I first heard people referred to as data scientists, I often heard the riposte, "Aren't all scientists, by definition, data scientists?" True, there are no sciences that do not deal in data (insert psychiatry joke here), but still, data science, while quite nebulous, isn't just an umbrella term.

Plus, I'm sorry, but you can see the screengrab of his mouse arrow in his diagram. 

Edit: An earlier version of this post omitted to give Geringer credit where credit is definitely due: he was the first to remove the Danger Zone! (Great, now that song is going to be in my head all day). Now people with subject matter expertise and computer skills can make Traditional Software without blowing the world up, or whatever. (My apologies to Mr. Geriner, and my thanks for his correction.)

7. In February 2014, Michael Malak added a fourth bubble, claiming Conway didn't mean domain knowledge when he said Substantive Expertise.
According to Malak, he's Inigo Montoya and we're all Vizzini when it comes to Substantive Expertise: "You keep using that word. I do not think it means what you think it means." Malak split it into Domain Expertise,, knowledge of a domain, like Social Sciences. Maybe I'm dense, but I don't get the distinction. I'm also not sure what he's getting at with Holistic Traditional Research that, unlike Traditional Research, according to its placement doesn't include knowledge of the science you're researching? Am I reading that wrong? Holistic science is a thing, but it's not that thing. Anyways, Data Science is once again back in the unicorn position, and there are three danger zones (one of them double danger). Everyone be hatin' on the hackers.

8. My next example comes via Vincent Granville in April 2014, but he's reposting something by Gartner; I don't know the date of the original.

This is a Venn Diagram of Data Science Solutions, not data science itself; as such, Data Science is one of the circles, with other expertises (often not residing in the same person, but hopefully on the same team) being IT Skills and Business Skills. It kinda bothers me that the text labels are pointing to very specific positions in each slice, but the actual positions are arbitrary. That's business infographics for you.

9. Shelly Palmer guest-blogged for the Huffington Post in 2015, including this figure from a book he wrote:

Pretty standard computer-math-domain triad straight from Conway, but there's one revolutionary element: no danger zone. Now computer-and-domain geeks without stats can do Data Processing without everything going all to hell. Seems reasonable. EDIT: Sorry Shelly, Geringer beat you to it, you're just not very noteworthy anymore.

10. In November 2015, StackExchange Data Science user Stephan Kolassa came up with my personal favorite, adding Communications to Conway and changing his Substantive Expertise to Business:
For all his effort, he was rewarded with only 21 (I'm one of them) upvotes in this beta-release forum. His categories are pretty good, too. I think I fall under The Good Consultant. Or possible The Mediocre Consultant. The Consultant Who Tries Really Hard? And yes, that's what a four-set Venn diagram looks like, not four circles like Malak's above, which does not contain all the combinations of intersections.

11. In 2016, Matthew Mayo blogged a diagram by Gregory Piatetsky-Shapiro:
Okay, this owes a debt to Tierney from four years prior, and although it purports to be a Venn diagram of data science, (a) it's not a Venn diagram, and (b) Data Science is inside one of the circles. It's good to see Big Data acknowledged, though. But... Calibri? Really? You went with the default font?

12. Finally (and I'm sure I don't have them all; If you know of any Venn diagrams I missed, please let me know!), later in 2016 Gartner redid their busy Data Solutions diagram, and made it prettier and confined to data science, as blogged by Christi Eubanks:
We've come full circle, back to Conway, except again Danger Zone is replaced, this time by Data En gineer. I like the callouts pointing to the edges better than their previous mess, as well.

13. Data Science Venn diagrams of the future:

Wikipedia's page on data science has the following totally-not-a-Venn-diagram:
Really, in my opinion, this is the way to look at data science. Maybe not these exact skills, but it really is a synergy of different disciplines. Unfortunately, skill in one discipline can sometimes mask serious deficiencies in another and give data science a bad name. (I may or may not have contributed somewhat to this phenomena in my misspent youth, like, last year.)

Of course, then you'd need a really complicated Venn diagram. They do exist: here's one for seven sets:

Anyone want to give it a try?

Tuesday, April 19, 2016

Canadian postal codes that spell words in l33t

l33t (leet, 1337) is a simple substitution cipher that started in BBSes in the early 1980s (ah, how I remember my 600 baud modem) that substitutes a few letters for numbers, e.g. '3' for 'E'. My name in l33t (and there were many versions of l33t) might be D4v1d T4yl0r or D4vid T4y10r (depending on whether the 1 substituted for 'i' or 'l').

The code and the various simple dialects of l33t I used can be found in this Github gist.

I used a rather short crossword puzzle word list, and even so got many words I've never heard of ("genips"?) I actually had one of them ("kirtle") in a spelling bee when I was 10 years old (I got it wrong). Still, most of them are somewhat familiar. And yes, there's a postal code beginning with "V" in British Columbia that refers to lady parts.

Edit: a couple of observations from redditors:

  • It doesn't appear on the list, but the postal code for the small town of Rosslyn Village, Ontario is P0T 2G0.
  • T4B 0R5 was assigned to Airdrie, Alberta; it would have been better if it had been assigned to Tabor! (Mmm, Tabor corn!)

ailing A1L1N6 Paradise, NL
bonito B0N1T0 Hants County (Shubenacadie), NS
blasts B1A5T5 Glace Bay, NS
bisect B1S3C7 Sydney Central, NS
begins B3G1N5 Eastern Passage, NS
begirt B3G1R7 Eastern Passage, NS
behest B3H3S7 Halifax Lower Harbour, NS
belike B3L1K3 Halifax Central, NS
belive B3L1V3 Halifax Central, NS
bemist B3M1S7 Halifax Bedford Basin, NS
bemixt B3M1X7 Halifax Bedford Basin, NS
bemata B3M4T4 Halifax Bedford Basin, NS
berime B3R1M3 Halifax South, NS
betime B3T1M3 Lakeside, NS
betise B3T1S3 Lakeside, NS
bezils B3Z1L5 Tantallon, NS
bezels B3Z3L5 Tantallon, NS
bezant B3Z4N7 Tantallon, NS
baning B4N1N6 Kentville, NS
banana B4N4N4 Kentville, NS
clasts C1A5T5 Charlottetown Southeast Prince Edward Island Provincial Government, PE
elates E1A7E5 Dieppe Moncton East, NB
gibers G1B3R5 Beauport North, QC
givens G1V3N5 Sainte-Foy Northeast, QC
givers G1V3R5 Sainte-Foy Northeast, QC
gelate G3L4T3 Saint-Raymond, QC
genoms G3N0M5 Sainte-Catherine-de-la-Jacques-Cartier, QC
genips G3N1P5 Sainte-Catherine-de-la-Jacques-Cartier, QC
garage G4R4G3 Sept-Îles Southeast, QC
hiking H1K1N6 Anjou East, QC
hikers H1K3R5 Anjou East, QC
hiring H1R1N6 Saint-Léonard West, QC
hirers H1R3R5 Saint-Léonard West, QC
hiving H1V1N6 Maisonneuve, QC
hegira H3G1R4 Downtown Montreal Southeast, QC
hereat H3R3A7 Mount Royal Central, QC
hexers H3X3R5 Hampstead, QC
haeing H4E1N6 Ville Émard, QC
haling H4L1N6 Saint-Laurent Inner Northeast, QC
halite H4L1T3 Saint-Laurent Inner Northeast, QC
halers H4L3R5 Saint-Laurent Inner Northeast, QC
halest H4L3S7 Saint-Laurent Inner Northeast, QC
halala H4L4L4 Saint-Laurent Inner Northeast, QC
haring H4R1N6 Saint-Laurent Central, QC
haslet H4S1E7 Saint-Laurent Southwest, QC
hawing H4W1N6 Côte-Saint-Luc West, QC
jebels J3B3L5 Saint-Jean- sur-Richelieu Central, QC
jalaps J4L4P5 Longueuil Southeast, QC
japing J4P1N6 Saint-Lambert North, QC
jawing J4W1N6 Brossard Northwest, QC
kirtle K1R7L3 Ottawa (West Downtown area), ON
kiting K1T1N6 Gloucester (Blossom Park / Hunt Club East / Leitrim), ON
kiters K1T3R5 Gloucester (Blossom Park / Hunt Club East / Leitrim), ON
kabiki K4B1K1 Cumberland Township, ON
liaise L1A1S3 Port Hope, ON
ligans L1G4N5 Oshawa Central, ON
limina L1M1N4 Whitby North, ON
liming L1M1N6 Whitby North, ON
linacs L1N4C5 Whitby Southeast, ON
lipins L1P1N5 Whitby Southwest, ON
liters L1T3R5 Ajax Northwest, ON
living L1V1N6 Pickering Southwest, ON
livens L1V3N5 Pickering Southwest, ON
livers L1V3R5 Pickering Southwest, ON
livest L1V3S7 Pickering Southwest, ON
lebens L3B3N5 Welland East, ON
lepers L3P3R5 Markham Central, ON
levins L3V1N5 Orillia, ON
levels L3V3L5 Orillia, ON
levers L3V3R5 Orillia, ON
levant L3V4N7 Orillia, ON
lexica L3X1C4 Newmarket Southwest, ON
lacing L4C1N6 Richmond Hill Southwest, ON
lacers L4C3R5 Richmond Hill Southwest, ON
lagers L4G3R5 Aurora, ON
laking L4K1N6 Concord, ON
lakers L4K3R5 Concord, ON
lamina L4M1N4 Barrie North, ON
laming L4M1N6 Barrie North, ON
lanose L4N0S3 Barrie South, ON
lanate L4N4T3 Barrie South, ON
lapels L4P3L5 Keswick, ON
larine L4R1N3 Midland, ON
latens L4T3N5 Mississauga (Malton), ON
lawine L4W1N3 Mississauga (Matheson / East Rathwood), ON
lawing L4W1N6 Mississauga (Matheson / East Rathwood), ON
lazies L4Z1E5 Mississauga (West Rathwood / East Hurontario / SE Gateway), ON
lazing L4Z1N6 Mississauga (West Rathwood / East Hurontario / SE Gateway), ON
milers M1L3R5 Scarborough (The Golden Mile / Clairlea / Oakridge / Birchmount Park East), ON
milage M1L4G3 Scarborough (The Golden Mile / Clairlea / Oakridge / Birchmount Park East), ON
mimics M1M1C5 Scarborough (Cliffside / Cliffcrest / Scarborough Village West), ON
miming M1M1N6 Scarborough (Cliffside / Cliffcrest / Scarborough Village West), ON
minima M1N1M4 Scarborough (Birch Cliff / Cliffside West), ON
minims M1N1M5 Scarborough (Birch Cliff / Cliffside West), ON
miners M1N3R5 Scarborough (Birch Cliff / Cliffside West), ON
miring M1R1N6 Scarborough (Wexford / Maryvale), ON
miseat M1S3A7 Scarborough (Agincourt), ON
misact M1S4C7 Scarborough (Agincourt), ON
macles M4C1E5 East York (Woodbine Heights), ON
macing M4C1N6 East York (Woodbine Heights), ON
macers M4C3R5 East York (Woodbine Heights), ON
macaws M4C4W5 East York (Woodbine Heights), ON
magics M4G1C5 East York (Leaside), ON
making M4K1N6 East Toronto (The Danforth West / Riverdale), ON
makers M4K3R5 East Toronto (The Danforth West / Riverdale), ON
malice M4L1C3 East Toronto (India Bazaar / The Beaches West), ON
maline M4L1N3 East Toronto (India Bazaar / The Beaches West), ON
manics M4N1C5 Central Toronto (Lawrence Park East), ON
manila M4N1L4 Central Toronto (Lawrence Park East), ON
marine M4R1N3 Central Toronto (North Toronto West), ON
marina M4R1N4 Central Toronto (North Toronto West), ON
mavies M4V1E5 Central Toronto (Summerhill West / Rathnelly / South Hill / Forest Hill SE / Deer Park), ON
mavins M4V1N5 Central Toronto (Summerhill West / Rathnelly / South Hill / Forest Hill SE / Deer Park), ON
mawing M4W1N6 Downtown Toronto (Rosedale), ON
maxima M4X1M4 Downtown Toronto (St. James Town / Cabbagetown), ON
maxims M4X1M5 Downtown Toronto (St. James Town / Cabbagetown), ON
nihils N1H1L5 Guelph Northwest, ON
native N4T1V3 Woodstock North, ON
pomelo P0M3L0 Algoma, Sudbury District and Greater Sudbury (Chelmsford), ON
pipits P1P1T5 Gravenhurst, ON
pecans P3C4N5 Greater Sudbury (Gatchell / West End / Little Britain), ON
penile P3N1L3 Greater Sudbury (Val Caron), ON
panics P4N1C5 Timmins Southeast, ON
paries P4R1E5 Timmins West, ON
parles P4R1E5 Timmins West, ON
paring P4R1N6 Timmins West, ON
psalms P5A1M5 Elliot Lake, ON
rococo R0C0C0 North Interlake (Stonewall), MB
realms R3A1M5 Winnipeg (Centennial), MB
rebops R3B0P5 Winnipeg (Chinatown / Civic Centre / Exchange District), MB
recons R3C0N5 Winnipeg (Broadway / The Forks / Portage and Main) Manitoba Provincial Government, MB
regime R3G1M3 Winnipeg (Minto / St. Mathews / Wolseley), MB
regina R3G1N4 Winnipeg (Minto / St. Mathews / Wolseley), MB
regive R3G1V3 Winnipeg (Minto / St. Mathews / Wolseley), MB
reject R3J3C7 Winnipeg (St. James-Assiniboia SE), MB
relics R3L1C5 Winnipeg (River Heights East), MB
relict R3L1C7 Winnipeg (River Heights East), MB
reline R3L1N3 Winnipeg (River Heights East), MB
relist R3L1S7 Winnipeg (River Heights East), MB
relive R3L1V3 Winnipeg (River Heights East), MB
remote R3M0T3 Winnipeg (River Heights Central), MB
remove R3M0V3 Winnipeg (River Heights Central), MB
remint R3M1N7 Winnipeg (River Heights Central), MB
remise R3M1S3 Winnipeg (River Heights Central), MB
remits R3M1T5 Winnipeg (River Heights Central), MB
remelt R3M3L7 Winnipeg (River Heights Central), MB
renigs R3N1G5 Winnipeg (River Heights West), MB
renins R3N1N5 Winnipeg (River Heights West), MB
repose R3P0S3 Winnipeg (Fort Garry NW / Tuxedo), MB
repine R3P1N3 Winnipeg (Fort Garry NW / Tuxedo), MB
repins R3P1N5 Winnipeg (Fort Garry NW / Tuxedo), MB
rerose R3R0S3 Winnipeg (Assiniboine South / Betsworth), MB
rerise R3R1S3 Winnipeg (Assiniboine South / Betsworth), MB
retime R3T1M3 Winnipeg (Fort Garry NE / University of Manitoba), MB
retire R3T1R3 Winnipeg (Fort Garry NE / University of Manitoba), MB
retene R3T3N3 Winnipeg (Fort Garry NE / University of Manitoba), MB
retake R3T4K3 Winnipeg (Fort Garry NE / University of Manitoba), MB
revise R3V1S3 Winnipeg (Fort Garry South), MB
revive R3V1V3 Winnipeg (Fort Garry South), MB
rewins R3W1N5 Winnipeg (Grassie / Pequis), MB
rewire R3W1R3 Winnipeg (Grassie / Pequis), MB
senile S3N1L3 Yorkton, SK
seniti S3N1T1 Yorkton, SK
saning S4N1N6 Regina Northeast and East Central, SK
sanest S4N3S7 Regina Northeast and East Central, SK
satori S4T0R1 Regina West, SK
satang S4T4N6 Regina West, SK
satara S4T4R4 Regina West, SK
saving S4V1N6 Regina Southeast, SK
strobe S7R0B3 Saskatoon Northwest, SK
stroke S7R0K3 Saskatoon Northwest, SK
striae S7R1A3 Saskatoon Northwest, SK
tibias T1B1A5 Medicine Hat South, AB
timing T1M1N6 Coaldale, AB
tiring T1R1N6 Brooks, AB
tabors T4B0R5 Airdrie West, AB
tables T4B1E5 Airdrie West, AB
tabers T4B3R5 Airdrie West, AB
tangle T4N6L3 Red Deer Central, AB
taping T4P1N6 Red Deer North, AB
tapirs T4P1R5 Red Deer North, AB
tapeta T4P3T4 Red Deer North, AB
tarocs T4R0C5 Red Deer South, AB
taring T4R1N6 Red Deer South, AB
taxies T4X1E5 Beaumont, AB
taxing T4X1N6 Beaumont, AB
taxite T4X1T3 Beaumont, AB
vibist V1B1S7 Vernon East, BC
vicars V1C4R5 Cranbrook, BC
vigils V1G1L5 Dawson Creek, BC
viking V1K1N6 Merritt, BC
vimina V1M1N4 Langley Township North, BC
vining V1N1N6 Castlegar, BC
vinals V1N4L5 Castlegar, BC
vitals V1T4L5 Vernon Central, BC
vixens V1X3N5 Kelowna East Central, BC
veloce V3L0C3 New Westminster Northeast, BC
velars V3L4R5 New Westminster Northeast, BC
venine V3N1N3 Burnaby (East Big Bend / Stride Avenue / Edmonds / Cariboo-Armstrong), BC
venins V3N1N5 Burnaby (East Big Bend / Stride Avenue / Edmonds / Cariboo-Armstrong), BC
venire V3N1R3 Burnaby (East Big Bend / Stride Avenue / Edmonds / Cariboo-Armstrong), BC
verist V3R1S7 Surrey North, BC
versal V3R5A1 Surrey North, BC
verste V3R5T3 Surrey North, BC
vesica V3S1C4 Surrey East, BC
vestal V3S7A1 Surrey East, BC
vexils V3X1L5 Surrey Lower West, BC
vexers V3X3R5 Surrey Lower West, BC
vacate V4C4T3 Delta Northeast, BC
vagina V4G1N4 Delta East Central, BC
vakils V4K1L5 Delta Central, BC
valise V4L1S3 Delta Southeast, BC

Tuesday, March 29, 2016

Most characteristic words in successful and unsuccessful petitions

In 2011, the White House launched We The People, a platform for citizen to submit and sign petitions. Once petitions reached a certain threshold (25,000 signatures at first, then raised to 100,000 in 2013), the Administration composed an official response. Famously, the White House even responded after a petition from 2012 for the U.S. government to build a Death Star received enough signatures.

I was curious to explore the differences between the successful petitions (i.e. the ones that garnered enough signatures to warrant a response) and the unsuccessful. There are, of course, many factors, including how well the petitioner publicized it, but I was particularly interested in the differences in words used.

Here are the results, followed below by some observations and finally the technobabble for those who want to know how and why I performed the analysis this way:


First of all, some numbers: there were 294 successful petitions with a total of 21,527 words (only counting each word once per petition, as explained below), and 3,868 unsuccessful petitions with a total of 304,706 words.

The words most characteristic to successful petitions have more extreme log-likelihood values; the top 25 range from 11.94 to 29.72, while the top 10 among unsuccessful petitions range from 5.06 to 12.91. This could be due to the fact that there are so many more unsuccessful petitions, and probably a greater range of reasons they were unsuccessful.

If you'd like to see a random selection of five petition titles containing each of the words listed in the graphic, head on over to my other, nerdier blog.

"Gun" is the most characteristic word in successful petitions. (Again, "successful" means there were enough signatures for the White House to write a response, few if any of the petitions' requests were enacted.) There were both successful pro- and anti-gun control petitions; this is an issue that people on both sides feel passionate enough about to participate in this project.

There are also four names from Netflix's Making a Murderer series: Avery (the defendant), Halbach (the victim) and both first and last names of Brendan Dassey, the other defendant.

The Westboro Baptist church is a popular item, as are "tragedy," "Connecticut" and "CT" from the Newtown shootings. (Interestingly, "Newtown" had a somewhat lower log-likelihood keyness of 8.54, indicating it was a little more common in unsuccessful petitions than the other terms.)

Some superlative words seem to be common in successful petitions: "imperative" and "definitely".

As for the unsuccessful petitions, the most overrepresented word is "say". A skim through these petitions (and there are a lot of them), reveals that many of them are motivated by some perceived injustice of perception, e.g. "people say X, but I believe Y".

The word "I" is associated with unsuccessful petitions, as well. Those who write (possibly rant) in the first person don't get a lot of support; perhaps the fact that they reference themselves is an indication they don't have enough organizational support to get a lot of signatures.

"Genocide" turns up a lot in unsuccessful petitions. Most of the time, it's in the phrase "white genocide".

"2014" turns up more characteristically than any other year because of the number of unsuccessful petitions calling for a boycott of the Sochi Olympics and/or action regarding Ukraine.


There are plenty of ways one could approach this project of differentiating between the words contained successful and unsuccessful petitions, like topic modeling or exploratory machine learning, but sometimes simple is best, especially since, as I mentioned, the specific words used in a petition is likely to be a secondary feature or at best a proxy feature (i.e. one that implies something else that is actually causal) in the determining factors of what makes a petition successful.

So, since we have two corpora (words used in successful petitions, words used in unsuccessful petitions), and a frequency metric, why not do a simple keyness analysis? Dunning log-likelihood keyness is one of two approaches (the other being chi-squared) to determining the significance of differences in item frequencies between two datasets; it's used a lot in corpus linguistics, although it's falling out of favor as more advanced techniques become computationally feasible.

The nice thing about keyness is that it is a measure of significance, combining the effect of ratio and absolute difference between frequencies, and the size of the corpora. This makes it unnecessary to eliminate stopwords, which is an arbitrary process that always rubs me the wrong way. If "and" is used very often, it still might be more significantly found in one corpus than another, but the threshold of difference in frequencies is much higher than for a less common word. Conversely, if an extremely rare word is found 10 times in one large corpus and once in another, it's unlikely to pass the threshold of significance even though the ratio between the frequencies is high.

Here is a Github gist containing the code used in this analysis.

Here's the 4.4 MiB csv file made from the SQL dump in March 2016.

Comments and criticism are very welcome. I never claim to have found the perfect solution for an analysis, and am always open to suggestion.

Wednesday, March 23, 2016

My personality, according to IBM Watson

I just signed up for a 30-day trial of IBM Watson's Bluemix, a set of mostly language processing APIs that are, in some cases, quite illuminating, and in other cases, rather entertaining.

One of the tools is Personality Insights, which will take any text and algorithmically predict the personality traits of the author. When I saw there was even a one-click tool to submit the RSS feed of a blog, how could I resist? I submitted

The results are... interesting. Like a horoscope, I wonder how much is overgeneralization; I hope there is some overfitting going on, because some of it is not at all flattering!

For one, my emotional range is only 26.8%. My reaction: Meh.

And while I agree I'm not the most outgoing person in the world, a 0.8% score for "extraversion" is a little extreme! Ah well, at least I can comfort myself that unlike IBM's Jeopardy!-winning supercomputer, I know how to spell "extroversion". Well, at least today I learned that the proper (but less common) spelling of what most people call "extroversion" is "extraversion", because Latin.


    Conservation: 100%  
    Self-enhancement: 91%  
    Hedonism: 3%  
    Openness to change: 1.3%  
    Self-transcendence: 0.9%  
    Structure: 66.7%  
    Curiosity: 48.1%  
    Challenge: 29.2%  
    Ideal: 12.5%  
    Self-expression: 4.9%  
    Practicality: 4.5%  
    Closeness: 4.3%  
    Harmony: 3.6%  
    Stability: 2.5%  
    Love: 2.5%  
    Liberty: 1.8%  
    Excitement: 0.9%  
The Big 5      
    Openness: 97.6%  
        Imagination: 100%
        Authority-challenging: 100%
        Intellect: 98.8%
        Adventurousness: 97.7%
        Artistic interests: 1.5%
        Emotionality: 0.8%
    Conscientiousness: 72.6%  
        Cautiousness: 100%
        Achievement striving: 88.8%
        Self-discipline: 48.1%
        Self-efficacy: 9.5%
        Dutifulness: 2.9%
        Orderliness: 1.1%
    Emotional range: 26.8%  
        Self-consciousness: 32.2%
        Immoderation: 27.8%
        Susceptible to stress: 18.6%
        Prone to worry: 17.4%
        Fiery: 12.8%
        Melancholy: 2.6%
    Agreeableness: 2.1%  
        Cooperation: 95.6%
        Trust: 93%
        Uncompromising: 32%
        Sympathy: 1.6%
        Altruism: 0.9%
        Modesty: 0.7%
    Extraversion: 0.8%  
        Excitement-seeking: 0.9%
        Activity level: 0.8%
        Assertiveness: 0.8%
        Gregariousness: 0.8%
        Outgoing: 0.7%
        Cheerfulness: 0.6%

Popular Posts

Scroll To Top