Stylometrics: Factoring out Equestria and point of view

My stylometric analysis gave us a sharp division between stories written by professional authors and stories from fimfiction. A very big part of that was using Equestrian vocabulary.

stylo produces a file, 'wordlist.txt', that lists all the words to be used for the analysis. You can take that file, add or remove words, and then check the box Features -> Existing wordlist to use that file instead of generating a new one.

I removed Equestrian vocabulary from wordslist:

ponies pony hooves hoof mane equestria canterlot ponyville celestia anypony mare somepony everypony twilight stallion cutie sparkle horn mares filly pegasus rarity applejack unicorn raising magical wings magic spike library perfect apple bloom hay belle purple trotted spell alright dear darling sweetie colt crusaders snorted saddlebag elements fillies mac flank coat ears stallions crystal shining quill manehattan fillydelphia starswirl bits macintosh

Then I fiddled with the author set. Here's the current list:

Ray Bradbury, whose style I like a lot
Jeannette Winterson, whose style & content I used as inspiration for "Pony Play"
Harper Lee, whose book To Kill a Mockingbird I just finished
Truman Capote, who is sometimes suspected to have written To Kill a Mockingbird because he was Harper Lee's close friend and next-door neighbor and Harper Lee never published anything else despite still being alive today 50 years later
Ernest Hemingway, because Hemingway
Harlan Ellison
Douglas Adams
Terry Pratchett
Arthur Conan Doyle
Hunter S. Thompson

Aquillo, whose style I like a lot
Bad Horse
Cold in Gardez

The result still divided pros from fans. Here's something interesting, though:

After removing Equestrian, Ghost's bureaucracy stories cluster with Douglas Adams and Terry Pratchett, except for two TP stories way off above.

What's distinctive about those two TP stories? They're in first-person present-tense. His others are all third-person past-tense. Looking for POV and tense, I saw it clustered my "Pony Play" with CiG's "For Whom we are Hungry". Those are the only second-person (present tense) stories in the set. It was clustering on the words "you" and "your". The first-person stories "Hard Reset 2", "My Harshwhinnial", "Written on the Body", and "Moments" formed another cluster.

So I removed these words from wordlist.txt:

i he you his her she my we me him us your i^m am


That helps some with the first/second/third-person POV. CiG's stories cluster together now, and horizon's fall in two clusters plus "Melt" (which is poetry), where before their stories were each spread out all over. But it doesn't factor out tense. stylo doesn't stem verbs, so past-tense stories have a bunch of past-tense verbs, and they will not cluster with present-tense stories.

Removing pronouns made the division between pros and fans even sharper. It moved Ernest Hemingway, Harlan Ellison, Hunter Thompson, & most of Ray Bradbury from the fan branch into the pro branch, it moved "Canterlot Carol" out of the pro branch far, far away into the fan branch, and it kicked "WTPWDestroy" out of Terry Pratchett's cluster. The only pros left in the fan branch are two short stories by Terry Pratchett, one short story by Ray Bradbury, and one by Harlan Ellison. The only fan left in the pro branch is "Whom the Princesses Would Destroy" by GhostOfHeraclitus.

I don't believe that the pros all cluster together because they're good and we're bad. stylo doesn't pick up on good versus bad. I wouldn't worry too much about it. The pro writers are all much older, and not writing pony fiction, and there are many reasons we might all have something in common that they don't.

UPDATE: I added contemporary authors David Foster Wallace, Jhumpa Lahiri, and Michael Chabon. They all went into the pros cluster. :twilightangry2:

FURTHER UPDATE: I went through wordlist.txt, ALL THE WAY TO THE END, and it still had lots of words like "twilight's", "celestia's", "twilicorn", "fluttershy's". Also lots of words that are probably biased, like "pie", "goin'", "car", "sniffed", "tv", "airplane", "george", "david", and so on. Terry Pratchett and Douglas Adams' works cluster together because stylo doesn't remove names like Zaphod or Ludmilla.

Sigh. There are just too many words whose usage is biased in Equestria for this to ever group fans and pros together--except when it does because of other irrelevant stuff like number of male and female characters, or country of origin.

On the bright side, the division between pros and fans says nothing at all about writing quality. Just the fictional universe in use.

Also note that each pro writer forms tight clusters. They write the same way every time. Now, we expect this to happen to some extent, because their texts are novels, while the fan texts are short stories. Novels have more words. Short stories have more randomness in their counts.

But I think the pros would cluster pretty tightly even if you looked at their short stories. Professional writers almost always find a single voice and stick with it their entire careers. Hemingway never wrote anything but Hemingway stories. John Updike never wrote anything but John Updike stories. Et cetera. You could suppose this means they've mastered their style. But some of the fan fiction authors from my earlier analysis also had all their stories clustered very tightly. Often, pros are pressured to write more of the same by their publishers, or their fans. Or they just don't see any need to change. Most professionals never even try a different point of view once they've been published.

This would be worth a post on its own, if I knew the answer. How come most professional writers go on publishing dozens of stories, decade after decade, without ever trying first person (if they write third person), or without ever trying third person (if they write first person)?

Look at the cluster at the bottom of the fan branch, with Lost Cities, Princes Luna Guards a Field, Where Have the Stars Gone, Elpis, & Twilight Sparkle Makes a Cup of Tea. Those are all moody, melancholy stories. Perhaps we're beginning to see some ability to cluster things by style.

Remember up above that I mentioned people have suspected Truman Capote wrote "To Kill a Mockingbird" because of circumstantial evidence?


Some background: Lee & Capote were next-door neighbors growing up. In 1959, Harper Lee was 33 and Truman Capote was 35. Capote was already famous from his novel Breakfast at Tiffany's, published in 1958. Harper Lee finished Mockingbird in 1959. It was published in 1960 and won the Pulitzer in 1961. In 1960 Lee went on the road with Capote to help research his book In Cold Blood, which was published in 1966.

It would have been odd for Capote not to help Lee with Mockingbird in some capacity, particularly since she immediately after writing it spent 4 years helping Capote on his next book. The question is how much he helped her. This stylometric analysis suggests that he helped her a lot.

Or perhaps it was the other way around. Perhaps Lee helped Capote a lot.

Or perhaps they cluster together because both authors grew up in the same years in the same town. (Though I added Flannery O'Connor and Carson McCullers, and they grouped around Harlan Ellison.) Or because those are the only two "normal" authors amidst a collection of authors with very distinctive, immediately-recognizable styles.

Comments ( 32 )

Professional writers have to deal with stricter requirements from publishers and have professional editors; this, I believe, has a good amount of influence on the vocabulary their works use, and likely is the main cause of that professional to amateur divide. A fanfic author that wanted to be classified by that software among the professionals would likely have to stick to those same strict requirements, which I see as being equivalent to closely following a specific manual of style.

As for each professional author clustering together while fic authors are more spread, I believe the length of the compared works has as much, or more, influence in this as whether the author is professional or not. Larger works allow the author's voice to be better represented.

I'm not sure how many pony fic authors have multiple 50K+ words fics to use in such a comparison, but if there are more than a few, it might be interesting testing the software with such a sample.

I can't speak as a professional author (I wish I could), but at this point I like changing the voice to suit the story, and I've never felt beholden to first- or third-person in general. The nature of the story should dictate the narrative voice, IMHO, not the other way around.

I'm less flexible with tense, but I wrote one story on this site in present tense because I felt it fit the mood.


Larger works allow the author's voice to be better represented.

More specifically, longer works will smooth out the bumps in an author's style. As the length of a work grows, a style variation going one direction is more and more likely to be accompanied by a style variation going in the opposite direction. Longer works allow the style to regress towards the mean for that author.

Say what you want, but I'm taking this as proof that Ghost is the best author on FiMFiction, and no one can convince me differently. (So don't even bother trying, Ghost.) :ajsmug:

So I'm dying to know: what are the remaining differences between us and the pros?

Do we use bigger words or shorter words? More or fewer different words? Adjectives? Obscenities? What?

(This has all built very nicely, BTW! :-) )

2369976 Nonsense. It merely proves that Ghost was the slightly premature reincarnation of Douglas Adams.

And don't you go feeding that big ego of his.

Very interesting, Bad Horse. I'm afraid I don't have anything interesting to say in reply except that there is one flaw in your body of data: Douglas Adams didn't write Starship Titanic, sort of.

Adams wrote 'Starship Titanic' the (distinctly mediocre) game. He was also contracted to write the accompanying book, but Adams never got along with deadlines. He realised that he wasn't going to make it, and with the book linked with the game, something needed to be done. So Terry Jones (ex-Python) stepped in and turned out the book in very short order (I think it was days or at most a few weeks). It shows.

That will tell you why that one DA book is in a different area to the others.
Reading it will also convince you that Adams didn't have much to do with it. I have a great deal of respect for Terry Jones, and writing a book that quick is in no way easy, but even so that book is a strong contender for the worst book I've ever read.

perhaps they cluster together because both authors grew up in the same years in the same town.

I suppose you'd see a similar effect if you analysed C.S. Lewis and Tolkien.

Also, all the professional authors you listed started writing a few decades ago. I imagine that if you selected some younger ones, you might start seeing some cross into the ponyfic cluster.

IDGI. I wanted to come in here and say something like:

Well, obviously that's because fanfic authors are less risk-adverse, and open to experimenting with style and voice! If I were to say "a John Grisham novel", you know exactly what that looks like and feels like. Fanfic authors are not beholden to such samey consistency!

But then I looked at CiG, whose comedies and dramas are snuggled right next to each other. And I look at Bad Horse himself, who's all over the map. And then I'm just like "fukkit, I'ma go eat some cornbread."

I'm absolutely not the best author on FimFic. I'm not even the best author on whose name starts with 'G.' I'm probably the best author whose is a reference to pre-Socratic philosophy.


What I think may be the case is that this is possibly picking out some feature of the style, but I don't think it's style or quality. I think it's age. My writing occasionally skews much, much older sounding, I'm told, and so could cluster with older writers or writers who wrote some time in the past.

All me Douglas Adams and me have in common is our inability to get the hang of deadlines.


Work has eaten me alive this week, but I did get to play around with the software a bit over last weekend (though not enough to write up anything). I added a bunch of other FIMFiction authors to test hypotheses I had. Here's one test run with some interesting results.

• I added Blueshift, since Bad Horse mentioned him in the original post.
• I added my two Mature stories, Social Lubricant and One Knight Stand, to see whether they would cluster with my other work. I also added a selection of darf's porn (and one of Eakin's), to see how strong the clustering effect of talking about sex would be.
• While I was over at darf's page, I grabbed the Pony Verse anthology to get a sort of poetry "baseline" (it contains works from ten different authors). I also grabbed Defoloce's "6 Deeds of Harmony", a long-form poetic epic. I labeled all of the poetry.
• While I was at defoloce's page, I grabbed his "Friendship is Optimal" novel, grabbed Iceman's original "Friendship is Optimal", and then went over to Eakin's page to get a selection of both his FiO work and regular pony work. (The Friendship is Optimal universe is mostly about humans in the human world, so I wanted to see whether that would produce a clustering effect.) I labeled all the FiO works.
• Finally, I played with the options a little, and discovered that you can change the number of most common words it considers. I discovered that the clustering effects appeared stronger the more words that it considered (i.e. the more data it examined per author), so I turned it up to 1000 instead of the default 100.

What I found out:

• By turning up the MFW from 100 to 1000, my own stories clustered much more strongly that they originally did — but into two very distinct groups. After some research and a wordlist comparison, I found the same thing Bad Horse did: this is explained almost entirely by half of my work being in present tense and half being in past.
• Either the cluster effect of poetry was weak, or Melt really is just that weird. More data needed.
• The clustering effect of Friendship Is Optimal stories seems pretty high, but Eakin's FiO and non-FiO stuff all clustered strongly together, so it's not overpowering authorship.
• The cluster effect of Mature stories is nowhere near as high as the cluster effect of authorship.
• Horizon clearly is not a changeling; factor out the tense shift and his stories all bind together strongly. The REAL changeling, as determined by his ability to appear anywhere on the chart, is actually Bad H——

(*mmph mmrph MMPH!*)

Personally, I'd be interested to see where Device Heretic falls in regards to all this, simply because I enjoy his writing so much.

And you know, you took out pony related words, but did you also take out their human counterparts, like hands, feet, fingers, pants, cars, shirts, etc? Because otherwise there would still be a lot of words that pony fanfic authors wouldn't be using that the pro's would be, and this might be producing a false difference. Though perhaps not. I really have no idea.

If there truly is a difference, and the data isn't lying to us somehow, I wonder if in some measure it couldn't be due to the differing mentalities behind writing original work vs fanfiction. Writing about characters who you already love and know for an audience who already loves and knows them, as opposed to only you already knowing the characters, and possibly not loving them. Most fanfiction I think carries behind it a spirit of worship or celebration. You're not writing so much to express how you see life as you are because you enjoy the world and characters.

Perhaps it's because you're more creatively confined when it comes to fanfiction.

Or, what I personally consider to be the likeliest reason: where we get our teaching from. Like how a few blogs back you mentioned that many editors or proofreaders advocated using body language over anything else. Such ideas generally manage to pervade most of the writing community to some degree or other (thanks in part probably to large websites that host writing guides which advocate one style over another). And so a particular form of writing becomes what is most acceptable. The pro's I don't think have that sort of foundation or force thrusting them in one direction, though I suppose you could argue for them it's the library of literature. But at least in bronydom, I think there's more of a Big Brother when it comes to "how to write" than for the pro's writing their own original work.

It would be interesting to see where pro's of other genre's fall, as well as original works by current fanfic authors, to see if they fall in line with their fanfics or not.

See my post just above yours. I tried it with ponyfic authors only, and picked a selection of Friendship is Optimal works (an AU set in the human world with Equestria being an MMORPG ruled over by a friendly Celestia AI). That didn't seem to overwhelm the clustering effect of authorship.


And you know, you took out pony related words, but did you also take out their human counterparts, like hands, feet, fingers, pants, cars, shirts, etc?

I meant to, but I see that I didn't.

Did, just now. No effect other than to say Flannery O'Connor was secretly Ernest Hemingway.

There are just too many words that get used with a different frequency in Equestria. Cart, plastic, cow, school, girls, feather, stepped, carrot, animal...

Author Interviewer

All I know is that I have been clustered with "Repent, Harlequin!" and everything feels like it was worthwhile.


Look at the cluster at the bottom of the fan branch, with Lost Cities, Princes Luna Guards a Field, Where Have the Stars Gone, Elpis, & Twilight Sparkle Makes a Cup of Tea. Those are all moody, melancholy stories. Perhaps we're beginning to see some ability to cluster things by style.

Lost Cities and Twilight Sparkle Makes a Cup of Tea have, if I recall, no dialogue whatsoever. Princess Luna Guards a Field and Elpis only have one or two lines apiece. Where Have The Stars Gone has dialogue, but an abnormally small amount in among all of Celestia's churning thoughts. They're all extremely showy, scenic stories — I would be dubious that the emotional tone has much to do with it.

Granted, extreme showiness is a style unto itself.

Have you tried fanfic authors who have published professional works? The only ones that come to my mind are AugieDog, Skywriter and Applejinx, but maybe you know of others.

2370881 Skywriter forms one giant cluster, with Ghost's comedies stuck in the middle, and Philomeanie and Hoardsmiths as distant outliers

This reminds me about a dumb movie which came out a few years ago:

It is Anonymous. It is1 of those movies so stupid that upon hearing about it I decided never to watch it. The synopsis is that Shakespeare might not be some guy named Shakespeare. I agree that he might truly be John Smith. Then the synopsis listed a bunch of suspects ruled out by statistical analysis over a decade before the movie hit the screen.

Shakespeare is either the name of some guy, or is the pseudonym of an unknown person. Perhaps, in the future, we might find out who Shakespeare is, assuming that Shakespeare is not Shakespeare, but we shall not discover that Shakespeare is Harlow, the Earl of Oxford, or Queen Elizabeth.

For those not remembering the trailer for Anonymous, it is as stupid as the trailer for Lucy:

“¡Humans only use 10% of their brains!”

¡Wrong! This movie is obviously stupid. I have no interest in ever watching it. The trailer for anonymous causes the same reaction.

2371150 And that reminds me of one of my favorite jokes, which no one else ever laughs at: "Evidence suggests the Iliad was not written by Homer, but by someone else of the same name."

(The joke is that "Homer" only means "the guy who wrote the Iliad and the Odyssey". Tho the joke is less funny if I explain that I don't believe the same person wrote both stories, or indeed that any one person wrote either story...)


If you do not feel like getting murdered, not not bring up analyses about the authorship of religious texts.

On a sidenote, I understand why ponies could be religious, although they show no signs of religion, since they can meet their Goddesses and Discord (I figure that Q got bored, created the Universe of MLP:FIM , and incarnated himself into it as Discord), but ¿why do humans believe things without evidence and reject things with evidence? In Red States, one must believe without evidence things and reject things with evidence to win elections (it is no mystery why Red States are worse off than Blue States and theocracies are worse off than secular nations).

I'm surprised at how tightly my stories clustered. It makes me wonder if my writing is stale, somehow.

Also, seconding Horizon's motion that Bad Horse is the changeling.

2372392 I've never thought you were in a rut. I was very surprised that it clustered "Naked Singularity", "The Glass Blower", "I'm afraid of changeling", & "The carnivore's prayer" together. You can use the oppose() function to figure out why your stories cluster. It could be picking up on a single word that you reliably misspell.

PresentPerfect is equally changeliferous. It doesn't show up as well because I used fewer of his stories.


No effect other than to say Flannery O'Connor was secretly Ernest Hemingway.

I knew it! :trollestia:

Intriguing. But I wonder if the fact that Friendship is Optimal is still a fanfic matters here. I dunno.

Here's another idea. Try trimming all the entries to about the same word count. This might show us how much length affects things, as long as that's the only variable you alter.

Given that stylo clustered My Harshwhinnial (my trollfic ponification of the legendarily bad My Immortal) together with Hard Reset 2 (my homage/reboot of Eakin's dark timelooping adventure) and 18th Brewmare (a Pratchettesque comedy that was widely mistaken for GhostOfHeraclitus' work while it was still being anonymously judged in the writeoffs), I can quite confidently state that what stylo measures is completely orthogonal to staleness.

What stylo is doing is tallying words throughout the entire length of a work, and then ranking them based on frequency of usage. As a general statistical principle, when you are trying to draw conclusions, more data will give you better information than less data.

True, and I'm not very knowledgeable on statistics, but I just figured that equal length would allow equal representation in the data. I mean, a 100,000 word story is going to have a higher frequency for pretty much any word than a 5,000 word story, unless the way it calculates the frequency prevents it from changing much due to length. Like if it calculates it as a percentage of the total word count, i.e. "the" has a frequency of 1000/5000=20%, and this stays consistent whether the author writes 5000 or 50000. *shrugs* :derpytongue2:

I haven't specifically examined stylo's code but I would be shocked if that's not already how they do it. Otherwise story length would override authorship as a clustering factor (because you're infinitely more likely to see approx. 10 "said"s in a 1000-word story than a 100,000-word story).

But, for now, off to the writeoff! :twilightsmile:

2375448 2373660 Story length is very important in that four novels from the same author should cluster more tightly than 4 short stories.

If you took a bunch of stories from another fandom, like Harry Potter or Twilight (factoring out some of the common vocabulary they use), would they form their own separate clusters, or would they mix with stories from this fandom?

2372392 2373528 While trying to use stylo on the current write-off, I found it doesn't work with small stories, and it can't compare small stories to large stories. It fails completely on this 450-700 word write-off, grouping all the small stories together and all the large stories together.

PresentPerfect & I are all over the graph because we write very short stories. Your outliers are probably your shortest stories.

