Thursday, May 7, 2015

Most characteristic words in pro- and anti-feminist tweets

 Here are, based on my analysis (which I'll get to in a moment) clouds of the 40 words most characteristic of anti-feminist and pro-feminist tweets, respectively.


anti-feministpro-feminist

Word clouds my may be only semi-quantitative but they have other virtues, like recognizability and explorability. For the purists, there's a bar chart below.

I'll mostly talk about my results here; the full methodology is available on my other, nerdier blog, which links to all the code so you can reproduce this analysis yourself, if you so desire. (We call ourselves data scientists, and science is supposed to be reproducible, so I strongly believe I should empower you to reproduce my results if you want ... or improve on them!) Please also read the caveats I've put at the bottom of this post.

Full disclosure: I call myself a feminist. But I believe my only agenda is to elucidate the differences in vocabulary that always happen around controversial topics. As CPG Grey explains brilliantly, social networks of ideologically polarized groups like republicans and democrats or atheists and religious people mostly interact within the group, only rarely participating in a rapprochement or (more likely) flame war with the other side. This is fertile ground for divergent vocabulary, especially in this case when one group defines itself as opposed to the other (as if democrats called themselves non-republicans). I am not going into this project with a pro-feminist agenda, but of course I acknowledge I am biased. I worked hard to try to counter those biases, and I've made the code available for anyone to check my work. Feel free to disagree!

A brief (for me) description of the project: In January, I wrote a constantly running program that periodically searches the newest tweets for the terms 'feminism', 'feminist' or 'feminists' (and random intervals and random depth, potentially as often as 1500 tweets within 15 minutes), and collected almost 1,000,000 tweets up to April 2015. Then with five teammates (we won both the Data Science and the Natural Language Processing prizes at the Montreal Big Data Week Hackathon on April 19, 2015), We manually curated 1,000 tweets as anti-feminist, pro-feminist or neither (decidedly not an obvious process, read more about it here). We used machine learning to classify the other 390,000 tweets (after we eliminated retweets and duplicates, anything that required only clicking instead of typing), then used the log-likelihood keyness method to find which words (or punctuation marks, etc.) were overrepresented the most in each set.

And here are my observations:

1. Pro-feminists (PFs) tweet about feminism and feminist (adjective), anti-feminists (AFs) tweet about feminists, as a group.
Since they're search terms so at least one of those words was in every tweet, their absolute log-likelihood values are inflated so I left them out of the word clouds. However, the differences between them are valid, and instructive. (But see the caveats below) AFs seem to be more concerned with feminists as a collective noun (they tweetabout the people they oppose, not the movement or ideology), while PFs tweet about feminism or feminist (usually as an adjective).
2. PFs use first- and second-person pronouns, AFs use third-person pronouns
Similarly to #1 above, and inevitably when one group defines itself as not belonging to the other, AFs tweet about feminists as a plural group of other people, while feminists tweet about and among themselves. Note that in NLP, usually pronouns are so common they're considered "stopwords", and are eliminated from the analysis. But with 140-character tweets, I figured every word was chosen with a certain amount of care.
3. The groups use different linking words to define feminism
PFs talk about what feminism is for or about, why we need feminism, what feminism is and isn't, what feminists believe; AFs tweet about what feminists want, ask can someone explain why feminists engage in certain behaviors which they don't get, say feminists are too <insert adjective>, and often use the construction With <this, then that>.
4. PFs link to external content, AFs link to local and self-created content.
PFs link more in general to http content via other websites; AFs use the #gamergate hashtag, reference @meninisttweet,  and link to @youtube videos rather than traditional media (that term doesn't appear in the word cloud, but it has a log-likelihood of 444 in favor or AFs). AFs also reference their platform, Twitter, a lot; feminists don't, presumably because they're also interacting in other ways.
5. AFs use more punctuation
Besides "feminists", the number-one token for AFs was the question mark; they have a lot of questions for and about feminists, many of them rhetorical. The exclamation point wasn't far behind, followed by the quotation mark, both to quote and to show irony. PFs start tweets with '+' and "=" (usually as '==>') for emphasis. Rounding out the non-alphabetic characters, AFs use 2 as a shorter form of 'to' or 'too', while PFs link more often to listicles with 5 items.
6. AFs tweet more about feminist history.
Unsurprisingly, PFs tweet about their goals, equality and rights, and defend themselves against accusations of misandry. But it's the AFs who tweet about modern and third-wave feminism, displaying knowledge about the history of the movement.
7. PFs use more gender-related terms
This one is all PF: they reference gender, genders, sexes, men and women more than AFs.
8. AFs use more pejorative terms
AFs use fuck, hate, annoying and, unfortunately, rape a lot; they also use derisive terms like lol, the "face with tears of joy" emoji and smh (shaking my head, not in the top 40 but still a high log-likelihood value of 484).
Caveats:
  • Selection bias: the dataset does not include any tweets with pro- or anti-feminist sentiment that do not include the search terms 'feminist', 'feminists' or 'feminism'
  • Noise in the signal, part 1. It's difficult to analyze tweets for the underlying attitude (pro- or anti-feminism) of the author; it involves some mind-reading. We tried to mitigate this by using a "neither pro nor anti" category classifying tweets we had the slightest doubt of thusly. Of course, that just shifts the noise elsewhere, but hopefully keeps down the misclassifications between our two groups of interest, pro- and anti-
  • Noise in the signal, part 2. We used 1,000 tweets to predict the attitudes of 390,000 tweets. Obviously this is going to be an imperfect mapping of tweet to underlying attitude. This kind of analysis does not require anywhere near 100% accuracy (we got between 40% and 60%, depending on the metric, both of which are better than random choice, which would give 33%). The log-likelihood method is robust, and will tend to eliminate misclassified words. In other words, we may not be confident these top 40 words and tokens are the same top 40 words and tokens that would result if we manually curated all 390,000 tweets, but we are confident these top 40 words and tokens are significantly characteristic of the two groups we identified in our curated tweets.
  • If you have doubts as to my methods or results, great, that's what science is all about. Please feel free to analyze the code, the dataset, the manual curation, and the log-likelihood results linked to in my other blog.
  • It is not my goal to criticize or mock anti-feminists, and I hope I've kept my tone analytical. There's a Venn diagram between stuff feminists say (and of course they don't all say anywhere near the same thing), stuff anti-feminists say, and things I agree with, and it's not straightforward. What interested me here was the language. That said, I hope I've contributed a little bit to understanding the vocabulary surrounding the issue, and in general, I believe more knowledge is better than less knowledge.
Word clouds made with Tagxedo.

26 comments:

  1. Interesting, though not surprising.
    Query: Why is the anti-feminist cloud recognizably less symmetrical? A consequence of the magnitude of the question-mark?

    ReplyDelete
    Replies
    1. I think you're right. The question mark is huge but doesn't take up much space, then the next seven words are still reasonably large. Whereas in the cloud on the right, the first three words take up a lot of space, but then the size drops off quickly, allowing the algorithm to fit smaller words to fill space more symmetrically.

      Delete
  2. This is amazing, but I'm curious as to how much your self-identified feminism impacts your choices in the curation step. It's not that you specifically are biased; IIRC it's well known in psychology that prior biases can affect your thinking way more than you think. E.g, a feminist could be far more likely to consider agreeable and well written tweets to be representative of feminism.

    In any case I think it would be fascinating to see how the results change when those with different ideologies do the curation step.

    ReplyDelete
    Replies
    1. I think prooffreader did an excellent job making unbiased observations from the data. If anything, the dataset was slightly skewed, in that the two groups are not directly comparable (ie: apples to oranges).

      Delete
    2. It was readily apparent without them confirming, this was created by a feminist. Cognitive biases everywhere.

      Delete
    3. Thanks for commenting, Marvin. Could you give me an example of how you think my cognitive biases were so readily apparent? I'm truly interested.

      Delete
    4. This comment has been removed by the author.

      Delete
    5. Well, I'm not typing what I had typed out all again, you can just look at the list here:

      http://en.wikipedia.org/wiki/List_of_cognitive_biases

      You hit pretty much every single one, I was just retyping them all in, obviously fruitlessly.

      Scientific studies done using the scientific method are scientific for a reason, your piece is not scientific at all. Essentially op ed. STRONG emphasis on opinion.

      Part III: http://teacher.nsrl.rochester.edu/phy_labs/appendixe/appendixe.html

      https://explorable.com/research-bias

      Delete
    6. I'm sorry there was some sort of glitch in your posting. Unfortunately, posting a list of biases is in no way evidence that this project has demonstrated evidence of any of them. If you have specific examples with specific evidence, I'd love to hear it.

      Delete
    7. > If you have specific examples with specific evidence, I'd love to hear it.

      That's a slightly insufferable response.

      Oh well, here goes the dime tour.

      Ambiguity effect—The tendency to avoid options for which missing information makes the probability seem "unknown".

      - You have incomplete information and are claiming it as a representative set.

      Attentional bias—The tendency of our perception to be affected by our recurring thoughts.

      - The mere fact you did this, means it's on your mind/thoughts.

      Automation bias—The tendency to excessively depend on automated systems which can lead to erroneous automated information overriding correct decisions

      - Do I really have to?

      Availability cascade—A self-reinforcing process in which a collective belief gains more and more plausibility through its increasing repetition in public discourse (or "repeat something long enough and it will become true")

      - Literally


      Backfire effect—When people react to disconfirming evidence by strengthening their beliefs.

      - A hallmark of what you are doing.


      Bandwagon effect—The tendency to do (or believe) things because many other people do (or believe) the same. Related to groupthink and herd behavior

      - You have no business doing this study.


      Base rate neglect—The tendency to ignore base rate information (generic, general information) and focus on specific information (information only pertaining to a certain case)

      - Limiting an already limited sample based on your biases skews the data.


      Belief bias—An effect where someone's evaluation of the logical strength of an argument is biased by the believability of the conclusion

      - Once again, you shouldn't be doing this study and your mere presence to the data, corrupts it.


      Bias blind spot—The tendency to see oneself as less biased than other people, or to be able to identify more cognitive biases in others than in oneself

      - The mere fact that you didn't study these biases yourself and instead required me to call you out on them.


      I'm stopping here, You've already wasted dozens of dollars of therapy time and this is just the A's and B's. And there's 19 letters to go yet.

      Delete
    8. I learned a lot from your post (possibly including a few things you didn't intend). Thanks for taking the time!

      Delete
  3. lol anyone can put fake data

    ReplyDelete
    Replies
    1. Here's a link to the 988,000 tweets I collected (250 MB): do you really think I faked them? http://dtdata.io/femtwitr/twitter_feminism_201501_201504.csv

      Delete
    2. I admire your effort in trying to formulate a comment by only using words form one cloud.

      Delete
  4. Do you have any gender distribution data on the tweets? I wonder if the word choice difference might be correlated to gender rather than their ideological stance.

    ReplyDelete
  5. I think there's some bias in your sample, and while some conclusions you draw are still valid, others I don't think are supported in the data.

    Because you used only tweets explicitly using the word “feminism” or “feminists,” you've gathered a sample where all the anti feminist tweets are about something/someone they disagree with, while all the feminist tweets are about someone/something “on their side.” I would expect the comments that disagree with something/someone to be more negative, dismissive, reactionary, etc regardless of the specific ideologies involved. So I think that while you can draw conclusions about the AF and PF tweets in this sample, I don't think you can say anything about how “AF's tweet more about...” or “PFs use more...” in general, as you do here.

    We all know how polarized the internet is, and how mean and dismissive internet commenters and twitter users can be, when they're talking about those with whom they disagree. In this analysis, you're trying to draw conclusions about groups of people, but your data only uses tweets from each in certain circumstances, circumstances that, I would argue, tend to be more likely to produce certain types of language.

    I would be interested to see a parallel analysis, one in which you sampled feminists tweeting about, e.g. MRAs. In order to have valid conclusions, I think you need tweets in both groups about their own, and about the opposing ideology.

    ReplyDelete
  6. The word "fuck" seems to repeat a lot on the side you've designated as anti-feminist, which makes the word-art feel aggressive and provocative. Is it possible that you've perhaps manually taught your algorithm to think "fuck" as anti-feminist?
    When I looked at your manually selected data, I noticed line such as this:
    (line 936) "5.609232117735424e+17,Just scrolled down this moron's anti feminism twitter now I need a fucking drink.,-1"
    That and other swearing lines like it seem to be tagged "anti" when in fact is pro-feminist?

    ReplyDelete
    Replies
    1. I certainly won't claim the curation is anywhere near perfect (we're only human, and it can be an error-prone and subjective process -- with a little more money, I'd use Mechanical Turk and cross-validate it), but I will point out for your specific example that the tokens were not stemmed or lemmatized, i.e. "fucking" was analyzed totally separately (and was used with similar frequencies by both classes) from "fuck".

      Delete
    2. I have extensive knowledge of MTurk, and by using it, you will add another statistical bias called, "hidden population". Once again, your results will not represent a true cross-section as they will only apply to the "hidden population" and not the "real population".

      Delete
  7. > it's the AFs who tweet about "modern" and "third-wave" feminism, displaying knowledge about the history of the movement

    A great many antifeminists seem to refer to "modern" and "third-wave" feminism, etc, in purely rhetorical ways. The implication is that feminism was great until some time around 1985, but then somehow someone replaced all the good authentic feminists with crazy extremists while nobody was looking. "Why yes, I do happen to take the most antifeminist position possible on every issue in the feminist agenda that is actually up for serious debate in 2015, but goodness no, I'm not *against feminism!* I'm standing up for moderation and common sense against these lunatics who have *hijacked* feminism."

    ReplyDelete
  8. Very interesting -although the results are unsurprising :P
    Thanks for doing this!

    ReplyDelete
  9. "Selection bias: the dataset does not include any tweets with pro- or anti-feminist sentiment that do not include the search terms 'feminist', 'feminists' or 'feminism'"

    It's obvious feminists are gonna talk about feminism like chirstians talk about Bible. This data is not surprising and because of that, it caresses of importance.

    ReplyDelete
  10. Thank you for this thought-provoking analysis and for the link to "This video will make you angry".

    ReplyDelete
  11. This is a rather belated comment, but I figured you might be open to a small correction to your very enlightening post even now: the initials "SMH" usually stand for "so much hate", not "shaking my head".

    ReplyDelete
  12. Wow, dude, you could have just not made a post, you are super wrong. I'm probably twice as old as you, and it's ALWAYS meant Shaking My Head. Like since the 90's. A very simple google search would have demonstrated that. 42,000 results for hate vs 66,000 for head.

    SMH meaning is an acronym for “shaking my head”. ... SMH can stand for “so much hate” as well, although it is much less common than “shaking my head”.

    ReplyDelete

Please leave comments & corrections here. Courtesy is appreciated.

Popular Posts

Scroll To Top