Monday, September 28, 2015

Pope Francis's speech to congress was more similar to the founding fathers' inaugural addresses than to those of Republicans and Democrats

Within minutes of Pope Francis's Sept. 2015 address to U.S. both Republicans and Democrats were claiming the speech vindicated their worldviews. This is not surprising, as modern Catholic values don't map to the American binary: helping the poor (Dem) and immigrants (Dem), against abortion (Rep) and gay marriage (Rep), in favour of action against climate change (Dem), etc.

Surely there must be a way to quantitatively determine who wins this tug of war? Turns out there is: natural language processing.

By comparing the word choice and frequency in the Pope's speech to those of all presidential inaugural addresses since 1789, we can see which speeches are most similar. I chose to use the inaugural addresses because I thought they were more in the same spirit of the Pope's address, outlining hopes and dreams for the nation, as opposed to the more pragmatic, say, state of the union addresses.

Compare the dots closest to the Pope's in the center; mouseover to see the three most characteristic words each inaugural speech had in common with the Pope, compared to all other speeches.

Comparison of Pope Francis's address to congress with presidential inaugural addresses throughout history
(mouseover to see names)

This analysis (and of course others might differ) seems to show that Democrats have a slight lead. 42% of Republican speeches are of above average similarity to the Pope's, while the same is true of 50% of Democratic speeches. But get this: a whopping 100% of 'Other' speeches (by early presidents, including the founding fathers, before the modern two-party system started) are of above-average similarity to the Pope's compared to the rest. 

There are a few reasons this could happen; since as we'll see below, the most common correspondences for the top Pope terms are Republicans and Democrats, my hypothesis is that these same Republicans and Democrats also use a lot of terms the pope didn't use, while the 'Others' stuck in general to a more restricted, common vocabulary with the Pope, talking in generalities about the human condition rather than specifics about partisan issues. (This is borne out by the fact that the Jaccard similarity results very closely match the cosine similarity results; read more here if you want)


The code I used is in this gist, and a more detailed description of my methodology is in my other, nerdier blog. Briefly, I built a TF-IDF* matrix 58 rows down (one for each of the 57 presidential inaugural addresses, and one for the pope) and 8896 columns across (one for each unique word used at least once in any address, minus extremely common words like 'the' and 'and'). I then calculated the similarities of each pair of rows in the matrix, and projected them onto two dimensions using the t-SNE** algorithm. I hacked the algorithm a bit so that it would reflect, in general, the distances to the pope more faithfully at the expense of the distances between presidents. I used a hacked version of Dunning log-likelihood to determine the three most characteristic words in common.

* Term Frequency-Inverse Document Frequency
** t-Distributed Stochastic Neighbor Embedding

Addendum: Top Pope words and most similar presidents for each word

A few things to note about words that were never used in presidential inaugural addresses: 'ibid' ranks so highly because the copy of the Pope's address that was published had inline references, 'merton' is a monk, Thomas Merton, and 'dorothy' is Dorothy Day, founder of the Catholic Worker Movement.
0.32  dialogue   
0.21  ibid       
0.14  people     Cleveland[1893][D], Cleveland[1885][D], Adams[1797][O] & 52 more
0.13  merton     
0.11  dorothy    
0.11  solidarity 
0.1   like       Pierce[1853][D], Bush[1989][R], F.D.Roosevelt[1941][D] & 24 more
0.1   family     Reagan[1985][R], Buchanan[1857][D], Polk[1845][D] & 15 more
0.1   luther     Clinton[1997][D]
0.09  social     Grant[1873][R], Harding[1921][R], Harrison[1889][R] & 19 more
0.09  good       F.D.Roosevelt[1937][D], Bush[1989][R], Jefferson[1801][O] & 45 more
0.09  martin     Reagan[1981][R], Clinton[1997][D]
0.09  human      Reagan[1985][R], Carter[1977][D], F.D.Roosevelt[1941][D] & 32 more
0.09  world      Clinton[1993][D], Truman[1949][D], Harding[1921][R] & 49 more
0.09  common     Obama[2009][D], Bush[2001][R], Eisenhower[1953][R] & 34 more
0.08  thomas     Bush[2001][R], Clinton[1993][D], Reagan[1981][R]
0.08  king       Obama[2013][D], Clinton[1997][D], Garfield[1881][R]
0.08  women      Wilson[1913][D], Obama[2009][D], Bush[1989][R] & 11 more
0.08  especially Monroe[1821][O], Eisenhower[1953][R], Taylor[1849][O] & 12 more
0.08  building   Hoover[1929][R], Eisenhower[1957][R], Nixon[1973][R] & 6 more
0.08  spirit     F.D.Roosevelt[1941][D], Carter[1977][D], Harrison[1841][R] & 38 more
0.08  moses      
0.08  lincoln    Reagan[1981][R], T.Roosevelt[1905][R], F.D.Roosevelt[1941][D] & 1 more
0.08  life       F.D.Roosevelt[1941][D], T.Roosevelt[1905][R], Wilson[1913][D] & 44 more
0.08  dream      Carter[1977][D], Clinton[1997][D], Reagan[1985][R] & 7 more
0.08  dignity    Reagan[1985][R], Bush[2005][R], Eisenhower[1957][R] & 14 more
0.08  god        Lincoln[1865][R], Reagan[1985][R], Nixon[1969][R] & 34 more
0.08  activity   Cleveland[1893][D], McKinley[1901][R], Truman[1949][D] & 2 more
0.08  want       Eisenhower[1957][R], Harding[1921][R], Coolidge[1925][R] & 15 more
0.07  culture    Hoover[1929][R], Obama[2009][D], Reagan[1981][R] & 3 more

Tuesday, September 1, 2015

The Domination of "Dominatrix" among feminine -trix endings in American English

The idea for this analysis came from the wonderful Lexicon Valley podcast, in whose episode Sex Workers last year U. Mich. professor Anne Curzan discusses the rise and fall of English words with feminine endings like -ess, -ette and -trix. Curzan points out that in recent decades the suffix has become indelibly associated with the word dominatrix, at the expense of other words except for a few legal terms.

This sounded quite plausible, therefore I was suspicious and had to check -- and it turns out Prof. Curzan was right:

As you can see from this stream graph, the red area representing dominatrix starts to expand in the 1970s, and by the mid-2000s represents the majority of feminine-ending -trix words.

The data comes from the Corpus of Historical American English (COHA), a curated set texts from of books, magazines and newspapers from 1810 to 2010.

The purplish areas on top represent Latin words used in English texts, e.g. victrix, the feminine of victor. There is no hard and fast rule to differentiate these words from English words; I classified them thusly if the feminine in English is much rarer than the feminine in Latin. The bottom, green, areas are words mostly used in legal contexts, often in probate law, to signify the feminine of executor, mediator and administrator.

Which leaves the orange area, aviatrix, which interestingly peaks in the 1930s, the time of the mysterious disappearance of indubitably the most famous person to be given that title, Amelia Earhart. There appears to be a dip in the 1950s followed by a rise in the 1980s, but given the very low frequency of the word (all of the -trix words in the 1930s are about equal in frequency to the words extirpate or peregrinations), the exaggerated dip may be due to sampling error.

A quick word about methodology: I removed words ending in -trix that are not feminine endings, such as matrix (even in Latin, it's a derivative of the already feminine mater; there is no word mator for it to be the feminine of). COHA is compiled on a per-decade basis, so I assigned the middle year of the decade to each data point, interpolated every 2.5 years and smoothed with the Hamming algorithm with a window of 10 years (to smooth out the sampling error somewhat and get a sense of the signal behind the noise).

Had I used a corpus larger than COHA, that would have helped for the sampling error, but I don't have access to any good candidates. The much-vaunted Google Ngrams corpus is, as in many applications, particularly misleading for this analysis -- as an uncurated corpus, it suffers greatly from availability bias. The Google Books files it is based on are heavily weighted towards the books found in libraries especially university libraries, where there will be many different editions of highly technical books and only a few representatives of anything else (for example, fiction and news). There is a supposedly fiction-only version of the corpus, but it actually gives very similar words frequencies, indicating their classification algorithm is problematic.

Here's what the above graph looks like from the Google Ngrams corpus:

You can see that the purple (Latin) and green (legal) areas are huge, which is to be expected when university books make up the bulk of the corpus. There is in addition a new, blue area representing advanced mathematics texts, where the feminine versions of director, tractor and motor have specific meanings in that domain. The conclusion that directrix is almost as popular as dominatrix doesn't pass the smell test. You can see that the phenomenon of the recent surge for dominatrix is still visible, although aviatrix does not have a surge in the 1930s (expected, given the paucity of news sources in this book corpus).

Popular Posts

Scroll To Top