The Domination of "Dominatrix" among feminine -trix endings in American English

The idea for this analysis came from the wonderful Lexicon Valley podcast, in whose episode Sex Workers last year U. Mich. professor Anne Curzan discusses the rise and fall of English words with feminine endings like -ess, -ette and -trix. Curzan points out that in recent decades the suffix has become indelibly associated with the word dominatrix, at the expense of other words except for a few legal terms.

This sounded quite plausible, therefore I was suspicious and had to check -- and it turns out Prof. Curzan was right:


As you can see from this stream graph, the red area representing dominatrix starts to expand in the 1970s, and by the mid-2000s represents the majority of feminine-ending -trix words.

The data comes from the Corpus of Historical American English (COHA), a curated set texts from of books, magazines and newspapers from 1810 to 2010.

The purplish areas on top represent Latin words used in English texts, e.g. victrix, the feminine of victor. There is no hard and fast rule to differentiate these words from English words; I classified them thusly if the feminine in English is much rarer than the feminine in Latin. The bottom, green, areas are words mostly used in legal contexts, often in probate law, to signify the feminine of executor, mediator and administrator.

Which leaves the orange area, aviatrix, which interestingly peaks in the 1930s, the time of the mysterious disappearance of indubitably the most famous person to be given that title, Amelia Earhart. There appears to be a dip in the 1950s followed by a rise in the 1980s, but given the very low frequency of the word (all of the -trix words in the 1930s are about equal in frequency to the words extirpate or peregrinations), the exaggerated dip may be due to sampling error.

A quick word about methodology: I removed words ending in -trix that are not feminine endings, such as matrix (even in Latin, it's a derivative of the already feminine mater; there is no word mator for it to be the feminine of). COHA is compiled on a per-decade basis, so I assigned the middle year of the decade to each data point, interpolated every 2.5 years and smoothed with the Hamming algorithm with a window of 10 years (to smooth out the sampling error somewhat and get a sense of the signal behind the noise).

Had I used a corpus larger than COHA, that would have helped for the sampling error, but I don't have access to any good candidates. The much-vaunted Google Ngrams corpus is, as in many applications, particularly misleading for this analysis -- as an uncurated corpus, it suffers greatly from availability bias. The Google Books files it is based on are heavily weighted towards the books found in libraries especially university libraries, where there will be many different editions of highly technical books and only a few representatives of anything else (for example, fiction and news). There is a supposedly fiction-only version of the corpus, but it actually gives very similar words frequencies, indicating their classification algorithm is problematic.

Here's what the above graph looks like from the Google Ngrams corpus:

You can see that the purple (Latin) and green (legal) areas are huge, which is to be expected when university books make up the bulk of the corpus. There is in addition a new, blue area representing advanced mathematics texts, where the feminine versions of director, tractor and motor have specific meanings in that domain. The conclusion that directrix is almost as popular as dominatrix doesn't pass the smell test. You can see that the phenomenon of the recent surge for dominatrix is still visible, although aviatrix does not have a surge in the 1930s (expected, given the paucity of news sources in this book corpus).


All of the so-called "Vatican" Ashley Madison clients are Virginians or Canadians with fat fingers

In the week or so since the Ashley Madison hack data dump, there have been many news stories that have reported its contents uncritically, by journalists who have obviously never seen the data.

One of the characterizations of the data that has been bandied about is the fact that there were numerous e-mail addresses from the Vatican; and indeed, there were 3 from vatican.com and 219 with the .va suffix (as a nation, the Vatican gets its own two-letter top level domain).

However, even a casual perusal of these latter addresses reveals that something isn't right. Does the Vatican have schools named after cities in Virginia or ISPs with the same names as those in Canada?

The answer is simple: There are plenty of addresses in Virginia that end in .va.us or .va.gov, and the people who signed up left off the end (Ashley Madison never verified e-mails, so people didn't have to use a real one). And then because C and V are next to each other on the keyboard, some fat-fingered Canadians (a demofraphic to whuch I belomg) simply hit the wrong key.

I went through all 219 .va addresses someone kindly posted to Pastebin, and checked (with Google) to see if there were .va.us, .va.gov or .ca equivalents (there were also fat-fingered, forgetful Californians who typed .va instead of .ca.gov), and I've arranged the results here. There was only one e-mail address that did not have an equivalent (vatican.va), but that domain does not seem to exist (they use vatican.com instead). There were also 55 that I couldn't link elsewhere, but that still did not appear to belong to the Vatican.

So here's a lovely pie chart of the results (my peers hate pie charts, and I recognize their weaknesses, but they have the admirable advantage that everyone understands what they're looking at) as well as a list of the 219 .va domains (I stripped the front part before the @ for privacy's sake) reported from the Ashley Madison hack along with my categorizations, among which of course may be errors, I'm sadly human.

Disclaimer: I was brought up Catholic, but I totally rebelled, so I don't have a vested interest in protecting the Vatican from scandal, quite the opposite. I just believe in accurate reporting.



The domains:

Domain                   Origin
======================== ========
abingdonus.va            Virginia
acps.k12.va              Virginia
alexandria.lib.va        Virginia
alleghany.k12.va         Virginia
alum.va                  Virginia
apsva.va                 Virginia
augusta.k12.va           Virginia
campbell.k12.va          Virginia
ccpsd.k12.va             Virginia
ci.grottoes.va           Virginia
ci.manassas.us.va        Virginia
ci.richmond.va           Virginia
ci.staunton.us.va        Virginia
ci.va                    Virginia
city.suffolk.va          Virginia
city.va                  Virginia
co.arlington.va          Virginia
co.campbell.va           Virginia
co.frederick.va          Virginia
co.henry.va              Virginia
co.roanoke.va            Virginia
cohenrico.va             Virginia
cvgs.k12.va              Virginia
edu.va                   Virginia
edu.va                   Virginia
edu.va                   Virginia
edu.va                   Virginia
edu.va                   Virginia
floyd.k12.va             Virginia
gmail.k12.va             Virginia
henrico.k12.va           Virginia
henrico.k12.va           Virginia
henrico.k12.va           Virginia
hopewell.k12.va          Virginia
lcs.k12.va               Virginia
lcs.k12.va               Virginia
meck.k12.va              Virginia
med.va                   Virginia
med.va                   Virginia
med.va                   Virginia
nn.k12.va                Virginia
nn.k12.va                Virginia
nn.k12.va                Virginia
nn.k12.va                Virginia
nn.k12.va                Virginia
nn.k12.va                Virginia
nps.k12.va               Virginia
nps.k12.va               Virginia
nps.k12.va               Virginia
patrick.k12.va           Virginia
powhatan.k12.va          Virginia
powhatan.k12.va          Virginia
pps.k12.va               Virginia
pps.k12.va               Virginia
rcs.k12.va               Virginia
richmond.k12.va          Virginia
richmond.k12.va          Virginia
richmond.k12.va          Virginia
russell.k12.va           Virginia
sootsylvania.k12.va      Virginia
student.hampton.k12.va   Virginia
student.hampton.k12.va   Virginia
waynesbror.k12.va        Virginia
wcs.k12.va               Virginia
wcs.k12.va               Virginia
wcs.k12.va               Virginia
wcs.k12.va               Virginia
lva.lib.va               Virginia
racsb.state.va           Virginia
rrj.state.va             Virginia
spotsyvania.va           Virginia
tax.state.va             Virginia
va.med.va                Virginia
vadoc.va                 Virginia
vava.va                  Virginia
vba.va                   Virginia
vba.va                   Virginia
albertahealthservices.ca Canada
alpha.va                 Canada
alumni.va                Canada
cgocable.va              Canada
dildo.va                 Canada
dre.va                   Canada
hot.va                   Canada
hotmail.va               Canada
hotmail.va               Canada
hotmail.va               Canada
hotmail.va               Canada
live.va                  Canada
live.va                  Canada
live.va                  Canada
live.va                  Canada
live.va                  Canada
live.va                  Canada
live.va                  Canada
live.va                  Canada
live.va                  Canada
live.va                  Canada
live.va                  Canada
live.va                  Canada
live.va                  Canada
luve.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail.va                  Canada
mail3.va                 Canada
mail3.va                 Canada
mail5.va                 Canada
ham.va                   Canada
mx4.va                   Canada
nao.va                   Canada
pp.va                    Canada
rainyday.va              Canada
sales.va                 Canada
shaw.va                  Canada
shaw.va                  Canada
skola.va                 Canada
skola.va                 Canada
sympatico.va             Canada
videotron.va             Canada
videotron.va             Canada
videotron.va             Canada
wer.va                   Canada
yahoo.com.va             Canada
yahoo.va                 Canada
yahoo.va                 Canada
yahoo.va                 Canada
yahoo.va                 Canada
yahoo.va                 Canada
yahoo.va                 Canada
yahoo.va                 Canada
yahoo.va                 Canada
yahoo.va                 Canada
yahoo.va                 Canada
yahoo.va                 Canada
bulk.va                  Unknown
gmail.com.www.va         Unknown
m01.va                   Unknown
m01.va                   Unknown
m01.va                   Unknown
mobileemail.va           Unknown
mobileemail.va           Unknown
mobileemail.va           Unknown
mobileemail.va           Unknown
mobileemail.va           Unknown
mobileemail.va           Unknown
nfksfbfn.va              Unknown
owen.va                  Unknown
owen.va                  Unknown
owen.va                  Unknown
owen.va                  Unknown
owen.va                  Unknown
owen2001.va              Unknown
sryrshrshrs.va           Unknown
student.va               Unknown
student.va               Unknown
student.va               Unknown
student.va               Unknown
student.va               Unknown
student.va               Unknown
student.va               Unknown
student.va               Unknown
utb.va                   Unknown
utb.va                   Unknown
v001.va                  Unknown
v001.va                  Unknown
v002.va                  Unknown
v002.va                  Unknown
v003.va                  Unknown
v005.va                  Unknown
v005.va                  Unknown
v006.va                  Unknown
v006.va                  Unknown
v006.va                  Unknown
v006.va                  Unknown
v101.va                  Unknown
v101.va                  Unknown
v103.va                  Unknown
1999yahoo.com.va         Unknown
a.va                     Unknown
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
atlas.va                 California
thevatican.va            Fake

Osama bin Laden's infodump is as long as the first Harry Potter book, and other revelations of a preliminary text analysis

Yesterday as of this writing (late May 2015), the U.S. Office of the Director of National Intelligence released 103 documents written by Osama bin Laden, captured during the raid on Abbotabad. Of course, my first thought was, "Great, a corpus!".

Bearing in mind that these are translations, here's my preliminary text analysis of the infodump. If you want to see how I did the analysis, check out this IPython notebook.

First of all, to get a taste of the flavor of the whole collection, here's some randomly generated text based on it using a Markov chain:


The secretary of Muslims, but it the new environments at your news, always to take statements from our command of the issue is no harm, or is preferable to convince any operation must take Hamzah arrives here, such expression without an exhortation and all the world. This should be using such as a documentary about Ibn ‘Abbas… In the list you and so that viewed it comes from God. We, thank you and those companions and reads and ready to live in Gaza? Connect them all that he returned to spread the hearts [of the women who may respond to hearing them. He is what it is established system. Usually, these conditions, even with it. He might take me to give support such that it in the path for an end is a dangerous and they are you wish I am just a brother to be upon him very important of the family and that rose up the frontiers area (Islamic Maghreb); so were the reality of those who follows this great hypocrite. I am blessed with his companions… Furthermore, To the world as in missing the tenth anniversary in Algeria more religious issues.

Now the word clouds. Everybody loves word clouds. Except for the people who hate word clouds because they're semi-quantitative at best. Well, you can't have everything.

Here's the word cloud of the vocabulary in all of the documents:


Unsurprisingly, OBL uses a lot of religious terms; 'god' is used so frequently it's in most of the top results when we look at bigrams (two-word phrases):



For the purists, there are bar graphs at the end of the post.

Here's the distribution of document length among the 103 letters/documents:


Most of the documents are short letters, but a couple of them are really long (about 20 pages, double spaced). The two long ones are A Letter to the Sunnah People in Syria and Letter to Shaykh Abu Abdallah dated 17 July 2010.

The total length of the correspondence is 74,908 words. Here's how that length compares to some well-known novels:




Osama bin Laden's correspondence is about as long as the first Harry Potter book (which you may know as Harry Potter and the Sorcerer's Stone on the left side of the pond).

I also compared the reading level of the (and I stress, translated) text using the Flesch-Kincaid formula, which must be taken with a grain of salt since it sometimes gives weird results, like scoring Shakespeare below children's books:

It's about as hard to read as Mark Twain, so it's got that going for it, which is nice.

I also did a quick topic modelling using the NMF algorithm, which determines which words best separate the 103 documents into similar groups, or clusters. Here are the results, in no particular order; the topics were named by me, based on nothing but intuition.

Topic 1, "feminine and prayerful": god, peace, dear, praise, sister, blessing, mercy, willing, letter, prayer

Topic 2, "addressing those in power":  al, shaykh, brother, abu, letter,  wa, mahmud, muhammad, god, informed

Topic 3: "family and god":  allah, brother, mercy, ask, wa,  al, praise, father, child, know

Topic 4: "jihad":  god, said, people, ha, crusader,  jihad, nation, war, ye, islam

Topic 5: "Arab spring":  revolution, people, muslim, regime, ummah,  egypt, opportunity, blood, ruler, wa
Again, if you want to see my methodology, look here. Finally here are the bar graphs corresponding to the word clouds at the top of the post (click to enlarge):

Word clouds made with Tagxedo


Percentage of women in European national legislatures, 2014


Slovenia, Serbia and FYR Macedonia were kind of a surprise to me. Spain, too, a bit, and Belarus. Hungary, you got some explaining to do. France, vous me d├ęsappointez aussi.

As it does in every possible measure of national success, of course, Scandinavia rocks.

Mostly, this was a chance to take my hex grid chloropleth of Europe out for a spin!

Most characteristic words in pro- and anti-feminist tweets

 Here are, based on my analysis (which I'll get to in a moment) clouds of the 40 words most characteristic of anti-feminist and pro-feminist tweets, respectively.


anti-feministpro-feminist

Word clouds my may be only semi-quantitative but they have other virtues, like recognizability and explorability. For the purists, there's a bar chart below.

I'll mostly talk about my results here; the full methodology is available on my other, nerdier blog, which links to all the code so you can reproduce this analysis yourself, if you so desire. (We call ourselves data scientists, and science is supposed to be reproducible, so I strongly believe I should empower you to reproduce my results if you want ... or improve on them!) Please also read the caveats I've put at the bottom of this post.

Full disclosure: I call myself a feminist. But I believe my only agenda is to elucidate the differences in vocabulary that always happen around controversial topics. As CPG Grey explains brilliantly, social networks of ideologically polarized groups like republicans and democrats or atheists and religious people mostly interact within the group, only rarely participating in a rapprochement or (more likely) flame war with the other side. This is fertile ground for divergent vocabulary, especially in this case when one group defines itself as opposed to the other (as if democrats called themselves non-republicans). I am not going into this project with a pro-feminist agenda, but of course I acknowledge I am biased. I worked hard to try to counter those biases, and I've made the code available for anyone to check my work. Feel free to disagree!

A brief (for me) description of the project: In January, I wrote a constantly running program that periodically searches the newest tweets for the terms 'feminism', 'feminist' or 'feminists' (and random intervals and random depth, potentially as often as 1500 tweets within 15 minutes), and collected almost 1,000,000 tweets up to April 2015. Then with five teammates (we won both the Data Science and the Natural Language Processing prizes at the Montreal Big Data Week Hackathon on April 19, 2015), We manually curated 1,000 tweets as anti-feminist, pro-feminist or neither (decidedly not an obvious process, read more about it here). We used machine learning to classify the other 390,000 tweets (after we eliminated retweets and duplicates, anything that required only clicking instead of typing), then used the log-likelihood keyness method to find which words (or punctuation marks, etc.) were overrepresented the most in each set.

And here are my observations:

1. Pro-feminists (PFs) tweet about feminism and feminist (adjective), anti-feminists (AFs) tweet about feminists, as a group.
Since they're search terms so at least one of those words was in every tweet, their absolute log-likelihood values are inflated so I left them out of the word clouds. However, the differences between them are valid, and instructive. (But see the caveats below) AFs seem to be more concerned with feminists as a collective noun (they tweetabout the people they oppose, not the movement or ideology), while PFs tweet about feminism or feminist (usually as an adjective).
2. PFs use first- and second-person pronouns, AFs use third-person pronouns
Similarly to #1 above, and inevitably when one group defines itself as not belonging to the other, AFs tweet about feminists as a plural group of other people, while feminists tweet about and among themselves. Note that in NLP, usually pronouns are so common they're considered "stopwords", and are eliminated from the analysis. But with 140-character tweets, I figured every word was chosen with a certain amount of care.
3. The groups use different linking words to define feminism
PFs talk about what feminism is for or about, why we need feminism, what feminism is and isn't, what feminists believe; AFs tweet about what feminists want, ask can someone explain why feminists engage in certain behaviors which they don't get, say feminists are too <insert adjective>, and often use the construction With <this, then that>.
4. PFs link to external content, AFs link to local and self-created content.
PFs link more in general to http content via other websites; AFs use the #gamergate hashtag, reference @meninisttweet,  and link to @youtube videos rather than traditional media (that term doesn't appear in the word cloud, but it has a log-likelihood of 444 in favor or AFs). AFs also reference their platform, Twitter, a lot; feminists don't, presumably because they're also interacting in other ways.
5. AFs use more punctuation
Besides "feminists", the number-one token for AFs was the question mark; they have a lot of questions for and about feminists, many of them rhetorical. The exclamation point wasn't far behind, followed by the quotation mark, both to quote and to show irony. PFs start tweets with '+' and "=" (usually as '==>') for emphasis. Rounding out the non-alphabetic characters, AFs use 2 as a shorter form of 'to' or 'too', while PFs link more often to listicles with 5 items.
6. AFs tweet more about feminist history.
Unsurprisingly, PFs tweet about their goals, equality and rights, and defend themselves against accusations of misandry. But it's the AFs who tweet about modern and third-wave feminism, displaying knowledge about the history of the movement.
7. PFs use more gender-related terms
This one is all PF: they reference gender, genders, sexes, men and women more than AFs.
8. AFs use more pejorative terms
AFs use fuck, hate, annoying and, unfortunately, rape a lot; they also use derisive terms like lol, the "face with tears of joy" emoji and smh (shaking my head, not in the top 40 but still a high log-likelihood value of 484).
Caveats:
  • Selection bias: the dataset does not include any tweets with pro- or anti-feminist sentiment that do not include the search terms 'feminist', 'feminists' or 'feminism'
  • Noise in the signal, part 1. It's difficult to analyze tweets for the underlying attitude (pro- or anti-feminism) of the author; it involves some mind-reading. We tried to mitigate this by using a "neither pro nor anti" category classifying tweets we had the slightest doubt of thusly. Of course, that just shifts the noise elsewhere, but hopefully keeps down the misclassifications between our two groups of interest, pro- and anti-
  • Noise in the signal, part 2. We used 1,000 tweets to predict the attitudes of 390,000 tweets. Obviously this is going to be an imperfect mapping of tweet to underlying attitude. This kind of analysis does not require anywhere near 100% accuracy (we got between 40% and 60%, depending on the metric, both of which are better than random choice, which would give 33%). The log-likelihood method is robust, and will tend to eliminate misclassified words. In other words, we may not be confident these top 40 words and tokens are the same top 40 words and tokens that would result if we manually curated all 390,000 tweets, but we are confident these top 40 words and tokens are significantly characteristic of the two groups we identified in our curated tweets.
  • If you have doubts as to my methods or results, great, that's what science is all about. Please feel free to analyze the code, the dataset, the manual curation, and the log-likelihood results linked to in my other blog.
  • It is not my goal to criticize or mock anti-feminists, and I hope I've kept my tone analytical. There's a Venn diagram between stuff feminists say (and of course they don't all say anywhere near the same thing), stuff anti-feminists say, and things I agree with, and it's not straightforward. What interested me here was the language. That said, I hope I've contributed a little bit to understanding the vocabulary surrounding the issue, and in general, I believe more knowledge is better than less knowledge.
Word clouds made with Tagxedo.

Baby Boom: An Excel Tutorial on Analyzing Large Data Sets

tl;dr: I wrote a data science tutorial for Excel for the good folks at Udemy: click here!


The usual progression I've seen in data science is the following:

  1. Start out learning data analysis with Microsoft Excel
  2. Switch to a more powerful analysis environment like R or Python
  3. Look down one's nose at everybody still using Excel
  4. Come to realize, hey, Excel's not so bad
I'll admit, I was stuck at Step 3 for a few weeks, but luckily I got most of my annoying pooh-poohing (if you're not a native English speaker, that expression might not mean what you think it means) out of my system decades ago when I was a proofreader (hence my nickname, if you were curious).

I think most mature data scientists see Excel as an essential and useful part of the ecosystem; I think the way it brings you so close to your raw data is essential in the early stages to develop data literacy, and later on when you're munging vectors and dataframes it can still be useful to fire up a .csv and have a look-see with no layers of abstraction above it.

Feedback is welcome. I'm not involved with the rest of the Excel course, but I have taken the Complete Web Developer course from Udemy and recommend it. I get absolutely no money for referrals or anything like that (or for page visits for my tutorial for that matter), so this is honest, cross my heart.


Dialogue plot of Star Trek: The Original Series

First, the plot. Hover over the points to see the character names.



Why Star Trek? Well, I'm working on an in-depth analysis of all of Shakespeare's plays, so I'm vetting my method on Star Trek because (a) the size of the corpus is much smaller so each step in development takes less time and (b) I'm, sadly, more immediately familiar with the minutiae of Star Trek because reruns were on every day after school when I was growing up, so I'm more able to notice trends and problems.

This isn't the finished product, but I thought it was interesting enough to warrant an interim blog post. All of the guest characters along the bottom appeared in one episode (except for a handful like those in both parts of The Menageries and Harcourt Fenton Mud who appeared in two). Trelane (if you're too young for TOS, he's sort of like a proto-Q from TNG) has the most dialogue per episode of any TOS character, guest or regular (if you've seen the episode, this will not surprise you). The super-speed Scalosian Queen Deela is the female character with the most dialogue; in fact, most of the high-dialogue guest stars are antagonists. Edith Keeler is the largest Kirk-love-interest part (ah, Joan Collins in the '60s); in general, Kirk was attracted to women due to the size of things other than their vocabularies, it seems (sorry, sorry, couldn't resist).

I tend to think of TOS as an ensemble drama, but Kirk is really the only regular with more dialogue than most of the main guest stars. Kirk and Spock are the only characters who appear in all 79 episodes (McCoy is missing from one... I challenge you to leave a comment below saying which episode that is). Uhura is in more episodes than the rest of the supporting cast, but speaks less ("Hailing frequencies open, Captain" is only four words, after all). Interestingly, Yeoman Janice Rand has more dialogue per episode than any supporting character except Scotty, but she's way down the vertical axis because she was fired after 15 episodes, either (a) because they'd exhausted her flirtiness potential with Kirk, (b) because she was showing up to work drunk, or (c) because she objected to being sexually assaulted by a TV executive, depending on the version of events.

Finally, the Enterprise computer voice has slightly more words per episode than Nurse Chapel; they were voiced and played, respectively, by the same actress, Majel Barrett, beloved of Trek fans and of series creator Gene Roddenberry.

I got the scripts from www.chakotaya.net; they appear to be fan-transcribed scripts (hey, in the '60s, that's all you could do. I myself made one in 1996 of my favorite X-Files episode, Jose Chung's From Outer Space). They're rather error-prone (as is to be expected), so if you want to see the gory details of how I cleaned them up and made the graph in Bokeh, check out this GitHub repo or go directly to this IPython notebook.