Machine learning and cultural products: Shakespeare edition

Beginning from my research in music machine listening, I have become more and more aware of applications of machine learning to cultural products, and the pitfalls that accompany such work. I previously critiqued a study applying clustering of image features to photographs of paintings by different artists. Here is a new one: clustering of Shakespeare’s plays into genres by word frequencies. (This work is published in: S. Allison, R. Heuser, M. Jockers, F. Moretti and M. Witmore, “Quantitative Formalism: an Experiment“, Pamphlets of the Stanford Literary Lab, Jan. 2011.)
On its face, this seems reasonable. As Allison et al. comment, certain words are closely associated with genres, like “castle” with “gothic”. However, they discover they are able to automatically and correctly cluster Shakespeare’s plays by using frequencies of only 37 words:

“a”, “and”, “as”, “be”, “but”, “for”, “have”, “he”, “him”, “his”, “i”, “in”, “is”, “it”, “me”, “my”, “not”, “of”, “p_apos”, “p_colon”, “p_comma”, “p_exlam”, “p_hyphen”, “p_period”, “p_ques”, “p_semi”, “so”, “that”, “the”, “this”, “thou”, “to”, “what”, “will”, “with”, “you”, “your”

At this point, it is reasonable to pause before making any claim that the clustering — though correct it may be — is a result of or caused by genre recognition. To accept such a conclusion entails accepting the words above and their frequencies as the mysterious ingredients that separate “tragedy” from “comedy”. Unfortunately, it appears Allison et al. accept just that, calling these word frequency features the observable tips of the “icebergs” that are genres.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s