In 2011, the White House launched We The People, a platform for citizen to submit and sign petitions. Once petitions reached a certain threshold (25,000 signatures at first, then raised to 100,000 in 2013), the Administration composed an official response. Famously, the White House even responded after a petition from 2012 for the U.S. government to build a Death Star received enough signatures.
I was curious to explore the differences between the successful petitions (i.e. the ones that garnered enough signatures to warrant a response) and the unsuccessful. There are, of course, many factors, including how well the petitioner publicized it, but I was particularly interested in the differences in words used.
Here are the results, followed below by some observations and finally the technobabble for those who want to know how and why I performed the analysis this way:
First of all, some numbers: there were 294 successful petitions with a total of 21,527 words (only counting each word once per petition, as explained below), and 3,868 unsuccessful petitions with a total of 304,706 words.
The words most characteristic to successful petitions have more extreme log-likelihood values; the top 25 range from 11.94 to 29.72, while the top 10 among unsuccessful petitions range from 5.06 to 12.91. This could be due to the fact that there are so many more unsuccessful petitions, and probably a greater range of reasons they were unsuccessful.
If you'd like to see a random selection of five petition titles containing each of the words listed in the graphic, head on over to my other, nerdier blog.
"Gun" is the most characteristic word in successful petitions. (Again, "successful" means there were enough signatures for the White House to write a response, few if any of the petitions' requests were enacted.) There were both successful pro- and anti-gun control petitions; this is an issue that people on both sides feel passionate enough about to participate in this project.
There are also four names from Netflix's Making a Murderer series: Avery (the defendant), Halbach (the victim) and both first and last names of Brendan Dassey, the other defendant.
The Westboro Baptist church is a popular item, as are "tragedy," "Connecticut" and "CT" from the Newtown shootings. (Interestingly, "Newtown" had a somewhat lower log-likelihood keyness of 8.54, indicating it was a little more common in unsuccessful petitions than the other terms.)
Some superlative words seem to be common in successful petitions: "imperative" and "definitely".
As for the unsuccessful petitions, the most overrepresented word is "say". A skim through these petitions (and there are a lot of them), reveals that many of them are motivated by some perceived injustice of perception, e.g. "people say X, but I believe Y".
The word "I" is associated with unsuccessful petitions, as well. Those who write (possibly rant) in the first person don't get a lot of support; perhaps the fact that they reference themselves is an indication they don't have enough organizational support to get a lot of signatures.
"Genocide" turns up a lot in unsuccessful petitions. Most of the time, it's in the phrase "white genocide".
"2014" turns up more characteristically than any other year because of the number of unsuccessful petitions calling for a boycott of the Sochi Olympics and/or action regarding Ukraine.
There are plenty of ways one could approach this project of differentiating between the words contained successful and unsuccessful petitions, like topic modeling or exploratory machine learning, but sometimes simple is best, especially since, as I mentioned, the specific words used in a petition is likely to be a secondary feature or at best a proxy feature (i.e. one that implies something else that is actually causal) in the determining factors of what makes a petition successful.
So, since we have two corpora (words used in successful petitions, words used in unsuccessful petitions), and a frequency metric, why not do a simple keyness analysis? Dunning log-likelihood keyness is one of two approaches (the other being chi-squared) to determining the significance of differences in item frequencies between two datasets; it's used a lot in corpus linguistics, although it's falling out of favor as more advanced techniques become computationally feasible.
The nice thing about keyness is that it is a measure of significance, combining the effect of ratio and absolute difference between frequencies, and the size of the corpora. This makes it unnecessary to eliminate stopwords, which is an arbitrary process that always rubs me the wrong way. If "and" is used very often, it still might be more significantly found in one corpus than another, but the threshold of difference in frequencies is much higher than for a less common word. Conversely, if an extremely rare word is found 10 times in one large corpus and once in another, it's unlikely to pass the threshold of significance even though the ratio between the frequencies is high.
Here is a Github gist containing the code used in this analysis.
Here's the 4.4 MiB csv file made from the whitehouse.gov SQL dump in March 2016.
Comments and criticism are very welcome. I never claim to have found the perfect solution for an analysis, and am always open to suggestion.