• Member Since 11th Apr, 2012
  • offline last seen Last Wednesday

Bad Horse

Beneath the microscope, you contain galaxies.

More Blog Posts758


Stylometrics · 5:27am Aug 6th, 2014

In the last write-off, I guessed the authors of 5 stories, got 3 of them right, and was right with my 2nd-place guess on a fourth.

How did I do it? Simple: I'm smart. Smart enough to cheat.

I used "stylo", a stylometric analysis program developed in Poland. It counts the frequencies of different words, and tries to help you guess who wrote what based on which words they use more often.

Even though this is called "stylometrics", it doesn't look at sentence length, paragraph length, grammatical constructions, frequency of adjectives / adverbs / nouns / etc., punctiation, fraction of dialogue, or any of the other elements of style. It just counts word frequencies.

This is fortunate and unfortunate. Fortunate, because word frequency is harder to disguise than style. That's why they use it. Stylometrics were developed to figure out who really wrote documents. Unfortunate, because I'd like to use it to see whether I successfully mimicked someone else's style, not their word frequencies.

Basic instructions:

1. Download R for your OS & install it. (Note: stylo will not run under 64-bit Cygwin.)

2. Create a directory, e.g., C:/writing/fiction/fan/write-off

3. Create a subdirectory "corpus".

4. Download lots of fimfiction text files into "corpus".

5. Optionally create this perl file. It uses all the existing Perl libraries to convert Unicode to ASCII, none of which convert Unicode to ASCII, so I also put in search/replace lines for the most-important Unicode characters. If you don't use it, the results will be slightly, but barely noticably, different.

use Text::Unidecode;
use Unicode::String;
use Encode;

use strict;
use warnings;

my @lines;
while (my $in = <>) {
chomp($in); $in =~ s/\r$//;
my @sublines = split(/<br \/>/, $in);
foreach my $line (@sublines) {

$line =~ s/Â\s+//g; # doesn't work
$line =~ s/\x{00c2}\x{00A0}/ /g;
$line =~ s/&#32;/ /g;
$line =~ s/&#x20;/ /g;

$line =~ s/“/"/g;
$line =~ s/”/"/g;
$line =~ s/‘/'/g;
$line =~ s/’/'/g;
$line =~ s/–/-/g;
$line =~ s/\bâ€.//g; # catch more unicode crap
$line =~ s/\bQQ([A-Z])/ $1/g;
$line =~ s/\bQQ([A-Z])/ $1/g;
$line =~ s/\bQ([A-Z])/ $1/g;
$line =~ s/\bQ(\s)/$1/g;

$line =~ s/\s{2,}/ /g;
$line =~ s/^\s+//;
$line =~ s/\s+$//;

# Why can't I get rid of the darn unicode?!
my $ascii = &unidecode($line);
my $u8 = Unicode::String::utf8($ascii);
$ascii = $u8->latin1;

print "$ascii\n";

6. If you use files containing UTF-8, you'll get an error message when you run stylo. Convert them to ASCII like this (in Unix):
for x in *.txt; do
echo $x;
dos2unix $x;
( iconv -f UTF-8 -t ASCII -c "$x" | perl unidecode.pl > "new/$x" ) ;
mv new/* .
If you didn't create unidecode.pl, change that line to
( iconv -f UTF-8 -t ASCII -c "$x" > "new/$x" ) ;

7. Run R.

8. Within R, do this:
> install.packages("stylo")
> q()

Now whenever you want to run stylo, run R, and then:
> library(stylo)
> setwd("C:/writing/fiction/fan/write-off")
> stylo(plot.font.size = 8)
Choose stuff in the GUI & click OK.
> q()

The GUI which pops up when you type stylo() lets you set various parameters. Use

INPUT = plain text
LANGUAGE = English (contr.)

Cluster analysis: Construct a "dendrogram" tree
PCA: Do principal component analyss & plot in 2 dimensions

I filled the directory "corpus" with the write-off entries, and stories by the prime suspects, and also some stories with styles that some stories tried to emulate. Specifically, these stories were supposed to have similar styles:

BHorse_Magician+Detective, CDoyle_*
BHorse_Burning Man Brony, HThompson_Fear+Loathing in Las Vegas
writeoff_The-18th-Brewmare, GOH_*, DAdams_*, TPratchett_*
writeoff_Elpis, HEllision_Deathbird

Here's the dendrogram tree. You're gonna have to zoom in:

And here's the PCA projected onto 2 dimensions. Rather than explain what that means, just know it's supposed to cluster similar stories together:

Some authors cluster tightly together, usually in both the dendrogram and the PCA (Conan Doyle, Harlan Ellison, Douglas Adams, Terry Pratchett, GhostOfHeraclitus, Silent Strider, Yipyapper). Others are all over the place (horizon, PresentPerfect, me).

The left side of the PCA analysis seems to indicate Englishness; we have Pratchett, Doyle, and Douglas Adams all on that side. They're also all clustered together in the dendrogram. It would be interesting to see where Blueshift falls. It could also be due to fimfiction formatting. Fimfiction text files have DOS line ends and Unicode punctuation marks. Converting them to ASCII as above just strips them out, joining contractions and words together.

(Or you could notice that all the professional authors plus GhostOfHeraclitus are on the left, while all the other fan-fiction authors are on the right.... Nah. I'm sure that's just coincidence. Dammit.)

I guessed Pascoite for "Naval Gazing" largely because it was stuck in between 3 Pascoite stories in the dendrogram. "Memories of a Star" was similar in content to a Silent Strider story, and was stuck right in the middle of all the Silent Strider stories (except the one it was similar in content to!) "My Big Brother, a Stranger Forever" was in the middle of a cluster of Present Perfect and Bad Horse stories, and I knew I didn't write it.

"The 18th Brewmare" was in a cluster of 4 stories by horizon and 3 by me. I should have believed the computer, but the footnotes and Pratchettness fooled me into thinking it was GOH. In the PCA diagram, it was right in the middle of GOH's stories.

So I guessed right when I trusted the stylo dendrogram results, and wrong when I didn't. The computer, looking only at word frequency, trounced me. (Except on all the other write-off entries; it guessed that either I or Pascoite wrote all the remaining ones that I didn't, and that GOH wrote the one that I wrote.) The dendrogram clusters stories by the same writer together more reliably than the PCA plot does. I think this is because the PCA, using only the first two dimensions, in this case uses only 20% of the information. (In retrospect, I should've used MDS (multi-dimensional scaling) instead of PCA.) On the other hand, the dendrogram can make them look spread-out when they aren't, when a cluster is broken across a branch in the tree that represents a small distance but looks like a large distance because the ordering of branches is arbitrary. (The stories by Yipyapper in the tree are an example; they all end up in the same branch if you reduce the number of stories being analyzed.)

Note that stylistic resemblences, even ones readers perceived strongly, were often not reflected in the word frequency clusters! For instance, everybody agrees that GhostOfHeraclitus' Dotted line stories are Pratchett-like, yet they cluster with "Thou Goddess" and "The Magician & the Detective", far from Terry Pratchett & Douglas Adams' cluster. Similarly, "Magician & Detective" is not near Conan Doyle. Douglas Adams' non-fiction is clustered with his fiction. "Elpis" was pretty close to "The Deathbird" in the PCA, which makes me a little happy, but "Deathbird" is even closer to "Magician & Detective", which is downright weird. And how did "Burning Man Brony", my Hunter Thompson imitation, instead end up right next to "Snowdrop and Nyx Get Drunk and Make Out"? The clustering seems to bear little resemblance to style perceived by readers.

Also, see how "Melt" is way, way off by itself in the PCA graph? That means horizon's weird.

Report Bad Horse · 1,747 views · #stylometrics #data
Comments ( 58 )

Fanfiction writing: It's officially rocket surgery.

EDIT: Also Horizon isn't weird, the graph just happens to think there are several different versions of him living inside his own head, which is just more of him to love.

I think the thing you should take away from this is that the program mistook one of your stories for GoH's writing.

Any way to get a list of often used words for each author out of it? I mean, other than the 'thes' and similar common words.

It's very surprising how accurate word frequencies are in stylometry, even with a relatively small corpus. It seems too simple to work, but somehow it does.

I wouldn't say that word frequencies can't be disguised like style can. If you're intentionally trying to dodge word frequency stylometry, it shouldn't be too difficult (tally your word frequencies and bust out a thesaurus on all your favourite words). Rather, authors will usually have several different styles depending on the context, but their building blocks (word choices) remain more or less the same. Aside from that, parsing English grammar is really, really, really hard. If word frequencies are "good enough", then avoiding having to consider the actual language makes things a lot easier.

Definitely an interesting subject!

Shit like this is exactly why I follow you. I just might try this thing on my own stories.

(Your stories aren't bad either, just generally a bit melancholy, but you already knew that. I'd probably still follow you anyway if you didn't write awesome blog posts.)

2347704 I think the easiest way to do this would be compile lists of words most used by the author least used by a control sample. A ratio would work nicely.

2347638 I thought it was because he's not a literary changeling, able to imitate styles on a whim?

I was more curious if you could get the software to spit out that data automatically.

Nooo too much knowledge for this late at night! D:

Very fascinating though, and strangely unnerving and frustrating. For some reason I don't like the idea of computers being able to do this.

Anyway, if you ran that thing on my contest entries, I'm sure they'd be all over the map. My writing has shifted a lot.

And now everyone's gonna start doing this come writeoffs. Yay no more anonymity!

Well, I know what I'm doing tomorrow.

It is unusual that it cannot run under 64-bit and cannot handle UniCode nowadays. Someponies have distinctive punctuation punctuation. I myself like ⸘InteroBangs‽ and inverted punctuation such as ¿Question-Marks?, ⸘InterroBangs‽, and ¡Exclamation-Points!.

I love data-based literary analysis. Numbers are (marginally) harder to lie with.

I'm pretty sure that all you really need to do to determine if I wrote something is search for semicolons, as apparently I use them quite often. Temptation, a story which is 2,658 words long according to FIMFiction, has 31 semicolons (and only 113 full stops and 196 commas and 25 question marks, along with 4 ellipses, a lonely colon, and a whopping zero exclamation points); my unpublished story, which is about 10k words long, has slightly north of 100. Normal English composition is about 2% semicolons, though it varies somewhat, but Temptation's punctuation is 8.4% semicolons.

Fascinating. I had been meaning for a while to download the signature software and see what it could say about fimfiction, after reading about how it was used to show that Robert Galbraith and JK Rowling are the same writer (but Barack Obama is not Bill Ayers)

You might be conflated with me on that criterion, however.

So what you're saying is, the graph is correct?

Your stories, from a brief survey, don't seem that heavily laden with the finest of all punctuation marks, but maybe I'm just looking at the wrong ones.

To be fair, my own stories vary on that count; some of them are super semicolon heavy, others are not. Crêpes, for instance, has a mere 17 semicolons in 23.5k words; the first chapter of We Can't Turn Back Time has four semicolons in 2.9k words, while the second chapter, clocking in at only 685 words, has 6; the difference being that the first is very oriented on the outside, on things happening around them in the environment and conversation, while the second chapter is very internally driven.

Apparently, I think people think with semicolons. I know I think with semicolons.

Reason #1255 why I love Bad Horse, right there.


My stories are described as Pratchett-like, it's true, but I'm not sure my vocabulary is Pratchett-like. For instance, I suspect that a confounder is my reluctance to use phonetic accents or extensive regionalisms in my dialogue. I also shy away from onomatopoeia. Pratchett uses the first quite a bit, and the second less, but still more than me. My similarities to the Great Master, such as they are, are more macroscale.

Burning Man Brony[1], I have to say, doesn't sound like Hunter S. Thompson to me. It lacks the sort of manic energy that to me characterizes the man's writing. The teeth-on-edge, stimulant-soaked 4AM writing tinged with sweat, anxiety, and desperation. Your stuff is deliberate. Meditative. Sombre.

Interesting. One thing I've noticed is that I tend to mimic the style of writers that I'm reading at the moment (I tend to interleaf writing with catching up with my library book pile before I get late notices). It explains a lot of my Bujold/Weber tendencies when I was working on Monster, and a swing to Pratchett footnote style on Tutor.

Author Interviewer

Dear Princess Celestia, today I learned that I am actually Bad Horse.

I suspect that a great deal of the complexity that this program seems to ignore will be reintroduced into the equation by the fact that there are only so many ways to use a word and so many context to use them in. Using any jargon for example will dramatically narrow the possibilitie.There are some words that almost never get used outside of certain geners - fantasy, sci-fi, millitary, etc.
Adjectives and nouns are self explainatory but dialogue uses words like "hey" and "hello" that work better for finding out the chattiness of the author than trying to count quotation marks.

This is a pretty neat thing you've posted, BH. I'll have to play with it when I get home and have access to actual internet.

Comment posted by yamgoth deleted Aug 6th, 2014


*dives out through stained glass window of Cadance, shattering it completely while somehow getting stuck*


Wait, can this computer sorcery be used on forum posts? I bet it can be used to detect sock puppets, keeping the world safe from catfishing.

Spoiler: I am Twifight Sparkill's alt :]

And you just made me spend a fair amount of time installing that software[1], downloading those same texts (at least the ones from FIMFiction), and converting all my fics, posted or not, into text files :twilightoops:

The result when analyzing my entries from the previous Writeoff were interesting; the software tells that all three are most likely written by either you, Pascoite, Horizon, or Ghost of Heraclitus :scootangel:

Also, if I write the next one in the style of Fireworks for a Princess, will you peg it as from PP or Pascoite? :trollestia:

(Though it's impressive; those four are the only ones among all my fics, published or not, that it doesn't precisely nail as mine. I guess my vocabulary bias due to English not being my native language is showing in a very particular way.)

BTW, one of the reasons for MLP fanfiction authors to be somewhat off compared with traditional ones might be the use of pony words, as well as the larger than normal usage of terms related to horses. I doubt Terry Pratchett or Douglas Adams ever wrote something like "everypony" in a book, but nearly every MLP fanfic will use some of the same subset of pony words :scootangel:

[1] If installing from the R environment homepage the 64 bits Windows version works flawlessly, BTW.

BH, can you share an archive of the "corpus" directory you used for these results? Would love to add things to it starting from the same base you used.


Burning Man Brony[1], I have to say, doesn't sound like Hunter S. Thompson to me. It lacks the sort of manic energy that to me characterizes the man's writing.

Sigh. You're right. I wanted it to, but it doesn't. In retrospect, I don't think it could have worked well that way.

I assume that footnote you referenced will pop up later in a comment or blog post.

:facehoof: Sorry. Got interrupted when writing my post.

[1] Which is, despite not quite meeting what you set out to do, an excellent story with considerable depth. It's also quite Bad Horsean especially inasmuch as the writing is so sharp it draws blood.

A very BH trait, that.

BTW, I just added five MLP books from G.M. Berrow to the sample, and they land tightly grouped but smack dab in the middle of the fanfiction authors.

My own version is here. It lacks the professional writers, has a few more samples from Ghost (his Obiter Dicta separated in chapters), and a lot more samples from me (including unpublished stories). It also adds the five G.M. Berrow books I mentioned above.

If doing analysis about other authors here, though, it might be better to remove my samples. They tend to land so far from the average that they distort the charts. I'm not even joking, if you draw a diagonal line through the PCA chart with all my samples, the lower half has only two Present Perfect fics and all of my fics except one.

2348769 2348629 Read the new version of the post, which I posted while you wrote your comment, to include dos2unix and the unidecode.pl script. This will make fimfiction text files more like ASCII text files. It won't make much difference, because the punctuation gets stripped anyway, but I don't want a fimfiction / non-fimfiction dimension in the data due to all fim-fiction contractions being conjoined by the unicode-stripper.

Hm. This is definitely interesting! I think the dendrogram is meant to be most comprehensive; the two axes of the PCA analysis are the two biggest factors, but together they only explain about 20% of the sorting that it takes into account. And the dendrogram does cluster five of my seven stories pretty tightly — with the two outliers being the ones I myself would have ranked as outliers from my "usual style". (What surprises me is that My Harshwhinnial, which is at least as outrageously atypical as Melt — being a deliberate badfic which makes a wreck of English — falls squarely within my main cluster.)

Still, I think this isn't meant to be a "style" analysis. I think it's more like trying to tell apart (American) football teams by the in-game stride lengths of their players — it will loosely correlate with play style, because running and passing plays and defensive strategies will affect whether players are shuffling or sprinting, but it will more strongly correlate with the heights of the players, which is a useful metric for distinguishing teams but not for predicting what they'll do on the field.

I don't think PC1 is "Britishness," I think it actually maps much more successfully to date of birth. Language changes over time, so it's also pretty self-evident why that would make a significant difference in stylometry. The leftmost cluster is Doyle's books from the late 1800s, Harlan Ellison's works were from the 1960s (he's very PC1-left even though he's not British), the HHGTTG was published in the 1980s, and Pterry peaked around the 90s. The rest are contemporary, but if I'm not mistaken, you, Ghost, Present, and I are all in our late-ish 30s. My works are all over the place vertically, but they're all a bit right of 0 on the PC1 axis, except for Melt which was secretly ghostwritten by Robert Galbraith.

Oddly, The Lotus Eaters contains zero semicolons in ~3k words, which startled me because I thought I was a relative abuser. My other stories which I checked seem to run closer to the averages you cite.

2347638 2347735 2347882
Darn, my secret's out. :applecry:

2347638 2347735 2347882

Wait, what?

2348791 2348791 2348791
Would you three stop bickering? I'm trying to sleep.


Any way to get a list of often used words for each author out of it? I mean, other than the 'thes' and similar common words.

It always produces a file table_with_frequencies.txt that lists all the words it used and their frequencies in each file. The documentation, here, explains how to do what you want to do in section 7, "oppose()". The "classify()" function might also generate that info.

Oddly, running with the corpus from 2348769, minus his stories and adding in my own, Melt didn't end up an outlier relative to Horizon's other stories at all - in fact, it ended up pretty much smack-dab in the middle of them, and they were mostly fairly well clustered on PCA (with 18th Brewmare and Social Lubricant being the furthest off, though neither were super far away from each other).

GMB clustered fairly tightly as well but was way over on the right, which may or may not have compressed the left-hand side.

I have to second the demand for Bad Horse's corpus, but maybe I can find some of these online.

Anyway, for what it it is worth:


Apparently most of my stories cluster with themselves and no one else, with the exception of Temptation and the Stars Ascendant, which hang around with stories from four different authors; The Collected Poems of Maud Pie, which is an outlier (possibly because it is a poetry collection written by Maud Pie); Better Lairs and Landscaping, which hangs out in a group with three other authors; and Proper Anatomical Terminology, which is written as a (fake) scientific paper, and clusters with Bad Horse's Trust and Pony Play and more closely still with Ghost's The Nature of War.

I really don't know how to interpret that last result; that is a very bizarre grouping of stories. :rainbowderp:


The PCA throws me over on the extreme left, here, (though, as you may note, the axes on this chart are quite different from Bad Horse's, being smaller - because of the exclusion of the professional stuff, presumably) but my stuff goes all the way from the top to the bottom. As I noted, Melt isn't an outlier at all on this chart, whereas it is an enormous one on Bad Horse's, which is more than a little odd and makes me think that the PCA is being mislead by something.

You used PCA Covariance, which puts Melt close to the center; Bad Horse used PCA Correlation, which puts Melt as an outlier.

(And I really need to read the papers that explain what is being calculated in order to get a better grasp on what is exactly being presented in each chart.)

If I understood correctly, any such correlation between one of the axis and any kind of real world influence — place of birth, age, etc — is coincidental.

BTW, a better analogy might be gait analysis. Very little relationship with actual physical prowess (as long as it's not so deviant as to be pathological), but nearly unique.

Playing around with the cluster analysis, I found that one story of mine - We Can't Turn Back Time - falls in with my other stories on the cluster analysis when a large number of writers are used, but if you diminish the number of writers, it actually falls out of my "standard" tree and in with the other stories (Temptation, The Collected Poems of Maud Pie, The Stars Ascendant, We Can't Turn Back Time, Better Lairs and Landscaping). I dunno why that is; I guess with more writers, it picks up that it is more like my other stories than with a smaller sample size.

Ah, no wonder. Though the different charts both present value; covariance (correctly) clusters all of Horizon's stories, sticks all of my stories together on the far left (though there are some others which hang out over there), but it seems like several writers (GMB, Horizon, Present Perfect) cluster much better under covariance than under correlation. I guess my stories are a grouping under correlation... if you say "they're the stories that have wandered off away from the cluster".


I mean, I guess you can draw a continuous line around all of my stories without hitting anyone else's... but I'm not sure that any reasonable person is going to throw The Collected Poems of Maud Pie in with the rest of mine looking at that chart.

I guess that's what I get for cribbing off Maud.

Though in all fairness, correlation does group Ghost's stories quite well, and YipYapper's stories are all pretty close as well, though they're also close under covariance and much more distinct as a grouping.

Neither chart does a very good job of grouping Bad Horse's stories, and Pascoite is fairly all over the place as well (though covariance does do a bit better job with him, he still wouldn't be identifiable).

Again, I must note that this thing seems sensitive to some additions, though. Adding in YOUR stories, Silent Strider, makes the chart significantly messier:


It does make The Collected Poems of Maud Pie fall in with my other stuff, but it also seems to make everything get mixed in more, other than your own stuff, which is a fairly obvious swath with only a few things falling in with it.

Of course, all this REALLY says is that using more than one tool gives you better results... which we all already knew.

Well, unless we're talking about narcissists, anyway.


You used PCA Covariance, which puts Melt close to the center; Bad Horse used PCA Correlation, which puts Melt as an outlier.

I just read the StackExchange answer about when to use PCA with the covariance matrix, & when with the correlation matrix. The answer is that you should use the covariance matrix if the different dimensions use comparable units and have comparable variance. If you need a fast & dirty way of normalizing the data, use the correlation matrix.

In this case, all the dimensions are word frequencies, so I should've used PCA covariance.

(and also 2348791 from here on)
"Birthday" is a good hypothesis, but the first (and probably second and maybe third) dimension of any set involving ponyfiction and non-ponyfic stories will be poniness. I say this confidently because I already used the oppose() function to compare fan and non-fan authors, These are the top words most favored by fans vs. pros:


Yeah, I think gait analysis was what I was shooting for with my analogy but not quite hitting.

The more I look at these results, the more I trust the dendrograms over the 2d graphs. Keep in mind that the PCI is only looking at the two most significant factors upon which it based its decision, and represents only about 20% of the work the program's doing. It would be cool to see what exactly those PCI factors represent (who uses "the" and "of" most?) but in general the dendrograms seem to be doing a much better job.

The dendograms are better, but that is not to say that PCA isn't useful; while it is a bit squashed, it can also see around some issues that the dendograms can't, and as such bring stuff together which ends up on different branches of the dendogram but which ends up fairly closely together on the chart - the dendogram's weakness is that it has to make a choice about where something falls, but if it is borderline, it may end up very far away from something which is relatively close. You can see this with some analysis; some stories will "swap" which major branch of the tree they're on depending on what other stories are around. Melt, for instance, if you isolate it down to just you and Bad Horse, ends up in a tree with Hope, Alicorn Cider, Petty Pony Princesses, and Experience, but if you go with the set that I did in my first set of trees, while Melt still ends up with Hope, Pretty Pony Princesses and Experience are on an entirely separate branch of the original tree. I see a similar effect with We Can't Turn Back Time, which switches between the two major groups of my stories depending on how many other stories I'm throwing in with it.

Basically they're a good sanity check, I think; if something ends up on a totally different branch of the tree but close in PCA, that might alert you that you need to examine that one more closely.

I was just playing around with the thing, and looking at pairings of authors; some of them are very readily separable, while others are less so. You and Bad Horse, for instance, don't separate out very well, but, say, Ghost and I are much more separable, which you might not guess from the tree with all the authors. If you're trying to separate out just two people, it seems to work fairly well.

It may also be that if you were a bit clever and messed with what it was looking at a little you might get better results.


Well, there was a dimension clearly indicating whether the stories used, or not, unicode apostrophes in the corpus I uploaded earlier. I'm not really in the mood to set a Cygwin environment right now just to fix that, so instead I just used Notepad++ to do a quick search and replace on nearly 100 files, turning unicode apostrophes and double quotes into their ASCII equivalents. Not as good as the procedure that Bad Horse supplied, but works in a pinch.

The fact the whole procedure took less than a minute and needed just two searches is why I love notepad++ :twilightsmile:

I updated my corpus file with this basic de-Unicodifying. The difference is not that glaring, but it does serve to push G.M. Berrow closer to the center and causes my own work to mingle a bit better with that of the other fanfic authors; my first fic ever written, as well as my single (and clumsy :facehoof:) attempt at including poetry inside a fic, now end up outside the branch my other works claim as their exclusive domain, and the covariance graph now shows many of my fics making a beachhead for the center.

Comment posted by Titanium Dragon deleted Aug 7th, 2014

Bad Horse seems to be the hardest to get to group up; doing two-author comparisons, Bad Horse's pretty much always look mixed (though he and Ghost only have one mix-up - it thinks Ghost wrote Hope for some reason), whereas other people seem to do better in terms of clustering.

Clearly Bad Horse is a master forger. Why do you think he has all these tools?

"But how will we know the ransom letter is his?"

"This software identifies author by frequency of word usage!

"Word usage, you say? Seems awfully simple."

"Yet amazingly effective, and impossible to thwart unless he's got a--"




the two axes of the PCA analysis are the two biggest factors, but together they only explain about 20% of the sorting that it takes into account.

Yes; I should've used multi-dimensional scaling instead, which tries to fit it all into 2 dimensions.

Well, there was a dimension clearly indicating whether the stories used, or not, unicode apostrophes in the corpus I uploaded earlier.

Check whether it isn't just fimfiction / not fimfiction. This dimension is heavy on pony words, and also on female vs. male pronouns.

For the record, I don't recommend trying to get this working on Mac OS X. It's possible, and I'm doing it, but I'm a systems administrator for a living.

R required me to install the XQuartz X11 package; dos2unix doesn't exist as a commandline command for your text converter and so I had to grab a perl script to make the same conversions; the Text::Unidecode PERL module (among others) isn't installed by default, and to install it CPAN required both sudo rights and to install Apple's CLI developer tools for access to 'make' … basically I've been yak shaving all morning. :facehoof:

2351354 dos2unix will have no effect--stylo will strip those endlines off. Sorry. I think the unidecode is useless also. The only concern is that a unicode apostrophe may be treated different than a regular apostrophe, causing "I'm" to be parsed as "i'm", "im", or "i m". You can avoid all preprocessing problems by reading the stylo manual and choosing the safest contraction setting.

Eh, no big deal. In case anyone is insane enough to follow in my footsteps, here is the final text-cleaner.sh shell script that I wrote to replace your Step 6:

# This should be located in the stylo directory as something like "text-cleaner.sh",
# and "unidecode.pl" should be in that same directory with it.
# 20140807 horizon; ref. http://www.fimfiction.net/blog/359859/stylometrics
cd testcorpus
mkdir _new
for x in *.txt; do
echo "$x";
# dos2unix $x; ##this is not installed by default on Mac OS X, the following is equivalent:
perl -pe 's/\r\n|\n|\r/\n/g' "$x" > "_new/$x";
( iconv -f UTF-8 -t ASCII -c "_new/$x" | perl ../unidecode.pl > "_new/$x.final" ) ;
mv "$x" "$x.bak";
mv "_new/$x.final" "_new/$x";
mv _new/* .
rmdir _new
cd ..

(It also leaves backup files of the untouched originals just in case anything goes wrong.)

Note that this still does require the Text::Unidecode and Unicode::String modules to be installed. You can do so by typing "sudo cpan Text::Unidecode" and "sudo cpan Unicode::String" at the Terminal commandline, and then yak-shaving for a while to install all the support CPAN needs to build the modules. :rainbowwild:

Working though!

Well, this is interesting!

I started with the stories in the Bad Horse corpus, and added the rest of my (non-porn) stories (One Knight Stand was another major outlier because none of the rest of the stories are explicit), as well as a selection of Blueshift's more famous works due to BH's curiosity in the original post. Blueshift's stuff strongly clustered, but what's even more curious is that the additional data pushed Thou Goddess and The Magician And The Detective out into the professional-author clusters. MAD got thrown in with Doyle (as you were intending), and Thou Goddess got thrown in with Harlan Ellison (which I have no idea what to make of).

Dendrogram, multi-dimensional scaling (not embedding due to file sizes). The MDS suggests that GOH strongly leans toward the DouglasAdams!Pterry cluster, even though the tree doesn't show it.

Melt is still an outlier. :twilightsmile:

Here's something. I was cross-checking the files in the updated 2349436 corpus (which I was getting Unicode errors on, and had to run my program above) against the files in the Bad Horse corpus (which came in free of Unicode errors for me). I realized that the BH corpus file turned all my em dashes (—) into the letter a:

Not enough a not enough! a for yet it shone!
My reassurance fell upon a heart of stone.
I needed time enough to win Her back.
I sought out Dream, and struck by majesty,
Drawn by bashful whispers,
Once a| twice a| my muzzle kissed her,

Thou Goddess also had some odd artifacts from the BBCode:

hel[size=1.1em]p[/size] [size=0.9em]m[/size]e [size=1.1em]p[/size]lease

I replaced all the horizon stories with their clean Silent-Strider-processed equivalents, though, and basically nothing changed. Maybe it doesn't check articles.

Double-check your input, though.

One thing I noticed is that, the smaller the story, the higher the chance it will become an outlier. Melt is your smallest story in the sample.

This works with my own stories too. If I add the three minifics from the previous Writeoff, all three become outliers, Notebook in particular ending quite further than Melt. The program also guesses the author wrongly for them in the dendrogram.

(This might make this kind of analysis quite less effective for the minific rounds.)

It makes sense; with a smaller number of words to count, the chance of a small hiccup in word usage sending the values overboard is fairly high.

(BTW, I didn't go after unicode errors in my corpus because I wasn't getting them. Seems like the native Windows 64 version does not have that issue, though it still counts contractions with ’ (unicode apostrophe) and with ' (ASCII apostrophe) as being different words.)

Surprisingly I was actually able to follow these instructions and successfully run the program using about 50 of my stories. I'm not seeing a ton of order among the chaos, though all of my dialogue-only stories are grouped together, probably due to the preponderance of quotation marks.

Now my question is, if I give you my email address (ponytones@gmail.com), do you think you could email me whatever your largest collection of fimfiction text files is so I can compare my stories to those of other authors?

4102514 You can download the fimfic archive torrent and get /all/ the files, but of course that is too many to use all at once. I'll email you a zip file of some stories.

When given stories by a single author, it will group them by tense and point of view, then setting. Small stories will be outliers.

Login or register to comment