More Blog Posts

    An appraisal-based ontology of fiction


    I was doing some consulting on “sentiment analysis” of tweets, which means making computers guess whether tweets say good things or bad things about whatever they’re talking about. This led me to The Appraisal Website and their appraisal ontology, which divides appraisals into 3 types:

  • Tuesday
    Inspirational quote for supervillains

    I don’t want civil war. I want this country to survive long enough to be killed by something awesome, like AI or some kind of genetically engineered superplague. Right now I think going out in a neat way, being killed by a product of our own genius and intellectual progress – rather than a product of our pettiness and mutual hatreds – is the best we can hope for. And I think this is attainable! I think that we, as a nation and as a species, can make it happen.

  • Monday
    Changes to two old stories

    I finally rewrote the final chapter of "The Magician and the Detective", taking into account Karen Joy Fowler's critique. (I changed the tea references in chapters 8 and 10 to lavender tea, to remove any doubt about the letter's origin.) I was never content with the pleading, telly contents of that letter. It…

  • 1 week
    Disagreeing With The Dead

    I just listened to Lecture 12, "Structuring a narrative without a plot", from the Teaching Company series Writing Great Fiction by James Hynes. (I don't recommend the series, as it's bloated; Hynes takes a long time getting to his point and then belaboring it, and it's often a trivial or obvious point.)

  • 2 weeks
    Author pie charts graph

    This is a follow-up to Author heat maps, which is a follow-up to Author clusters question.
    I thought I was going to have a very exciting graph for you today, showing how a computer could objectively find authors on fimfiction who had well above-average chances of being good.  It turned out that all it was doing was picking authors who followed a lot of other authors.

The Great Shakespeare / My Little Pony Showdown: Part 2: RESULTS! · 6:34am Dec 12th, 2014

Grant Voth said, in his lecture series The History of World Literature,

Shakespeare created about 1000 characters in his 37 or 38 plays, and yet each character speaks with his own rhythyms, his own accents, his own vocabulary, his own tricks of speech.... Shakespeare's characters are so individualized in terms of speech patterns that it has been said that in any given play, you could take the name tags off of all the speeches, drop them into a hat, shake them up, bring them out one by one, and if you had a good enough ear, you would be able to put all the speeches of a single character together, simply because there are no two characters who sound alike.

So I know the question you're all asking. Who wrote more-distinctive characters: Shakespeare, the Immortal Bard, "the man who invented humanity" [1], "the most influential person who ever lived" [2]--or the writers of My Little Pony: Friendship is Magic?

I can't answer that. What makes a character "distinctive" is a matter of opinion. But I can definitively answer this more-precise question: Which set of characters is more distinguished from each other by the frequency with which they used each individual word?

(This is a continuation of part 1, which discusses whether that precise question is meaningful.)

I found an ASCII file of the complete plays of Shakespeare, and downloaded the transcripts of every My Little Pony: Friendship is Magic episode (seasons 1-4) from mlp.wikia.com. I compiled all the lines for each character into a separate file [3]. Then I used the R library stylo, version, to compare them [4].

stylo finds which texts are most similar to each other in terms of how often they use different words. I split the files for each character into pieces, with similar numbers of words in each piece. Then, once for Shakespeare and once for My Little Pony, I threw all the pieces at stylo and asked it which files most resembled each other.

If a character's vocabulary is distinctive, stylo will say that the different pieces of that character's file are similar. The more that the files for each character cluster together, the more consistent and distinctive that character's vocabulary is. (I did this once with three of Chaucer's narrators from The Canterbury Tales, and the word frequencies matched up all the parts correctly.)

But first, I had some decisions to make.

Which words to count

The more words whose occurrences you count in each file, the more data you have to work with. But the more words you count, the more stylo’s comparison will be comparing the setting or the topic of conversation rather than the way that the characters talk. The default is to count just the most-common 100 words. Including words that group by settings or topics rather than by character would favor Shakespeare, since we’re looking at 95 different MLP episodes, but only 11 Shakespeare plays.

One might argue that Shakespeare’s characters have a more varied vocabulary than ponies, and therefore they require going beyond the default 100 words in order to express their distinct characters. To test this, I counted the number of distinct words needed to account for half of all the words in each set. For Shakespeare, this was 78 words; for MLP; 79. Despite the many claims that Shakespeare had an unusually large vocabulary, the two sets of texts had similar word frequency distributions [8]. Therefore I kept the word count at the default 100.

I inspected the list of words stylo produced (one list for Shakespeare, one for MLP), and disallowed all proper nouns, catch-phrases that I recognized (“totally”, "awesome", "darling", "ain't", “ooh”, “wikey”, etc.), words that identify the relative ranks of Shakespeare’s characters ("master", "lord", “liege”, "sir", “grace”, “thee”, “thou”, “thine”, “you”, “ye”, “yours”, “we”, “our”), and anything else that seemed to give an unfair advantage in distinguishing characters by their surroundings or who they talked to rather than by who they were.

Deciding what counts as a catchphrase and what counts as character is a little arbitrary. I eliminated “apple” and “apples” but decided to allow "farm", "dragons", "book", "Wonderbolts", and "party" because part of what makes MLP characters distinct is that they all have different professions, hobbies, and interests. I could not identify any words denoting interests that distinguished Shakespeare's characters, who are mainly occupied with sex, drinking, and killing each other, so disallowing such words would unfairly handicap MLP. I eliminated “silly”, “ain’t”, and “shucks”, but allowed “ya” and “wanna” because more than one character uses them frequently.

Very few of any of these terms were in the first 100 words. I ended up eliminating 8 words from Shakespeare (you, thou, your, thy, our, we, thee, france) and 9 from MLP (twilight, rainbow, spike, apple, dash, pinkie, princess, rarity, applejack) before taking the most-common 100 remaining words.

Which characters to use

In my experience, stylo needs at least 1000 and preferably 2000 words per file to have much accuracy. Only 14 characters in Shakespeare and 9 in MLP have at least 5000 words each, allowing at least 2 files of 2500 words. (Twilight has the most words, at 31,000).

When I ran stylo, it clustered the Shakespeare characters together that were from the same plays: Iago with Othello, Antony with Brutus, etc. That gave Shakespeare an unfair advantage: stylo would only have a 50% chance of making a mistake with any of those characters.

So I chose only characters from Shakespeare's historical plays about English kings [5] with over 4000 words. Since having fewer characters in one tree would make it more likely for their parts to match up by chance, I used 10 Shakespeare characters and 10 MLP characters. I broke the Shakespeare characters up into files of about 3100 words each, and the MLP characters into files of about 5500 words each, deleting the last Twilight Sparkle file, so that each set had 25 files in all.

Sample size

For statistical reasons, stylo can’t compare small files of different lengths. For a fair comparison, you must use the same number of words from each file. The smallest file in the sample was one half of the Duke of York, with 2054 words. I set the sample size to 2000 words, chosen at random without replacement, from each file.

I decided that I’d rather check for each character's consistency overall, rather than consistency between multiple plays. If Henry V came across very differently in Henry IV and Henry V, that could be due to character development. So for each sample, I shuffled the lines randomly. Shakespeare’s secondary characters (dukes and earls) match themselves very poorly across different plays [6], but match much better when the lines are shuffled.

Because I’m using random samples, the tree generated by clustering is be different every time. Therefore, I’ll take the average score over the first 3 trees produced for each set.

Whether to adjust for MLP having multiple authors

stylo was designed to identify the true authors of disputed texts. But MLP scripts have many different authors. So stylo should detect differences between Rarity as written by M.A. Larson and as written by Meghan McCarthy. How much of a disadvantage will this give to MLP?

To check this, I first did cluster analysis on the lines written by different writers, using from 1000 to 7000 words per file [7]:

The resulting tree shows the scripts by the same writer cluster perfectly with each other. So which is a stronger influence on vocabulary: writer or character? To test this, I created one separate file for each combination of writer and character, and did a cluster analysis (no sampling) on all resulting parts with over 1000 words:

There is some clustering by writer, but much more clustering by character. I chose not to adjust for MLP characters having different writers. It seemed more useful anyway to test the consistency of the characters overall rather than the separate skill of each writer.

How to score

The measure of success will be the sum, for each file of lines, of its adjacency score. Its adjacency score will be 1 divided by the log base 2 of the number of leaves in the smallest subtree that contains both that file, and another file from the same character. If a file is paired with another file from the same character, which might be written in list form as (A1 A2), its score is 1 / lg(2) = 1. If the tree is given by ((A1 B1) (A2 C2)), the score for A1 is 1 / lg(4) = ½. The score is designed to be inversely-proportional to the amount of information required to find a matching file in the cluster tree.


The pictures below show the first tree produced for each set:

The maximum possible score is 25. The average scores over three trials were:

Shakespeare: 13.9 (14.8, 14.2, 12.8)
MLP FiM: 19.5 (18.8, 20.5, 19.1)

The Winner:

My Little Pony:

Friendship is Magic

I should've guessed Twilight was a Shakespeare fan.

Better luck next time, Bill.

The MLP score wasn't just higher than Shakespeare's score; it was closer to perfect than it was to Shakespeare's score. That's what I'd call a sound thrashing. To visualize the difference, here's a plot of just the Mane 6, using multi-dimensional scaling to try to project them onto a 2-dimensional plane but keep the high-dimensional distances similar. This uses 2600 words per datapoint [9].

And here's the same plot of the 6 Shakespeare characters with the most lines in these plays:

[1] Harold Bloom, Shakespeare: The Invention of the Human)

[2] Stephen Marche, How Shakespeare Changed Everything

[3] This wasn't easy for Shakespeare. He loved to use the same names, and similar names, repeatedly. He has five different Antonios, three Rosalines and a Rosalind, two Ferdinands, two Portias, a Hortensio and a Hortensius, a Cassio and a Cassius, two Juliets and a Julia, two Gloucesters, two Franciscos, two Franciscans, and a Francisca. He has two different Francises and two different Bardolphs in Henry IV alone. There are five parts in The Merchant of Venice with names so similar that scholars usually collapse them into two or three, supposing the ones used least often are typos. I had to separate all such parts out by hand.

[4] I explained stylo here.

[5] Plus The Merry Wives of Windsor, which features the same characters

[6] Yes, I was careful to change the character name when a new character inherited the same earldom.

[7] This time I didn’t take random samples, but used all the words from each file. This makes the clustering stronger. But otherwise it wouldn’t be possible to include the authors who wrote only about 2000 words.

[8] Shakespeare's vocabulary was larger past the first 50% of all words used, using a total of 9040 unique words in 72,550 words, while MLP used 8440 unique words in 120,311 words. The MLP count was elevated some by bad HTML character translations and by words merged together. The Shakespeare count was elevated some by Shakespeare's many arbitrary contractions in the middle of words.

[9] I can't add more characters to the plot without horking up the projection; the number of constraints to satisfy is roughly 7^*6/2 = 21 for 7 characters, but 10*9/2 = 45 for 10 characters.

