• Member Since 11th Apr, 2012
  • offline last seen Yesterday

Bad Horse

You shall love your crooked neighbor with your crooked heart. -- W. H. Auden

More Blog Posts689


Author clusters question · 4:10am Dec 11th, 2016

I tried clustering fimfiction authors based on who follows them. See BIG BOX OF STUFF YOU DON'T NEED TO READ below for details.

What would you say, before looking, are the different categories of stories with romance, sex, and clop?


Here's the clusters that came out when k=5. Do these make sense to you? Can you come up with names for the clusters? I can't, because there are lots of author names here I've never heard of. These are (roughly) all the authors who had 1000 watchers in 2015, had been online in 2015, and don't crash my code or wreck my PCA.

5.1 ?

Jet Howitzer, GentlemanJ, StreakTheFox, JaydexTheShadowKnight, Crowley, Path_of_cloud, Kody910, ROBCakeran53, SamRose, Demon Eyes Laharl, Lithe Kamitatsy, DarkWing

5.2 Long-form adventure, crossover, & human?

Bucking Nonsense, GnollReader, ed2481, Speven Dillberg, Rust, psychicscubadiver, Harry Leferts, Law Abiding Pony, Loyal2Luna, BlackWing, AdmiralTigerclaw, Niaeruzu, Arad

5.3 the decent upstanding citizens of fimfiction:

mr maximus, Blueshift, Eakin, Cosmonaut, Midnightshadow, Wanderer D, bookplayer, The Descendant, Cold in Gardez, Chengar Qordath, Chuckfinley, Bad Horse, RavensDagger, Skywriter, bats, GaPJaxie, RainbowDoubleDash, AbsoluteAnonymous, DeiStar, Alexstrazsa, defender2222, DawnFade, Hoopy McGee, Squeak-anon, Kwakerjak, Ponydora Prancypants, SteampunkBrony, SleeplessBrony, InsertAuthorHere, Ponky, Steel Resolve, Andrew Joshua Talon, MrNumbers, Soundslikeponies, Conner Cogwork, Darth Link 22, Imploding Colon, xTSGx, Pegasus Rescue Brigade, JasonTheHuman, Tchernobog, Capn_Chryssalid, Daetrin, GhostOfHeraclitus, Dennis the Menace, Razorbeam, Come Hither

5.4 the cool kids:

Regidar, Akumokagetsu, shortskirtsandexplosions, GeodesicDragon, totallynotabrony, Yukito, Overlord-Flinx, RainbowBob, MythrilMoth, TittySparkles, BronyWriter, Art Inspired, Anonymous Pegasus, ocalhoun, Obselescence, Rated Ponystar, Kaidan, McSquirmy, Georg, The Abyss, Nostalgia Schmaltz, Lapis-Lazuli and Inky J, Aegis Shield, Mr101, Bad_Seed_72, Draconian Soul, Comet Burst, Shakespearicles, Loyal, Bronystories, Pen Stroke, Aragon, Marshal Twilight, The Wizard of Words, Spacecowboy, KnightMysterio, MerlosTheMad, milesprower06, the parasprite, RealityCheck, Brony2893, meme-asaurus, Daemon of Decay

5.5 cloppers? why are DisneyFanatic23 and obabscribbler here?

kudzuhaiku, Sir Hat, TheNewYorkBrony, Tatsurou, Justice4243, Holy, Pedro Hander, Zamairiac, Vengeful Spirit, obabscribbler, little big pony, Distorted Flare, Onomonopia, MadMaxtheBlack, DisneyFanatic23, whatmustido

k = 12

Here's the clusters with k=12. Do these categories make useful, finer discriminations than the previous 5? Can you come up with names for them?

12.1. people with >3000 followers and MrNumbers

shortskirtsandexplosions, totallynotabrony, Obselescence, mr maximus, Eakin, Georg, Aegis Shield, Wanderer D, The Descendant, Chengar Qordath, Pen Stroke, Hoopy McGee, MrNumbers

12.2. romance and classy clop

Cosmonaut, bookplayer, Chuckfinley, bats, AbsoluteAnonymous, Alexstrazsa, DawnFade, SleeplessBrony, Steel Resolve, Soundslikeponies, Darth Link 22, Tchernobog, Come Hither

12.3. cloppers

Sir Hat, Tatsurou, Holy, Zamairiac, little big pony, Distorted Flare, Onomonopia, MadMaxtheBlack, whatmustido

12.4. me and I guess other people too

Blueshift, Cold in Gardez, Bad Horse, Skywriter, GaPJaxie, RainbowDoubleDash, Kwakerjak, Ponydora Prancypants, InsertAuthorHere, Ponky, Imploding Colon, xTSGx, Capn_Chryssalid, Daetrin, GhostOfHeraclitus

12.5. long SF&F adventures, with lots of humans+crossovers

Bucking Nonsense, GnollReader, ed2481, psychicscubadiver, Harry Leferts, Law Abiding Pony, Loyal2Luna, AdmiralTigerclaw, Arad

12.6. people popular in 2013? also, 5.1

Jet Howitzer, StreakTheFox, JaydexTheShadowKnight, Crowley, Path_of_cloud, ROBCakeran53

12.7. clop and disney?

TheNewYorkBrony, Justice4243, Pedro Hander, Vengeful Spirit, obabscribbler, DisneyFanatic23

12.8. comic adventure?

Midnightshadow, RavensDagger, DeiStar, defender2222, Squeak-anon, Rust, SteampunkBrony, Andrew Joshua Talon, Conner Cogwork, Pegasus Rescue Brigade, Niaeruzu, JasonTheHuman, Dennis the Menace, Razorbeam

12.9. the other half of 5.1

GentlemanJ, Speven Dillberg, Kody910, SamRose, BlackWing, Demon Eyes Laharl, Lithe Kamitatsy, DarkWing

12.10. odd mix, nearly all taken from 5.4

Regidar, MythrilMoth, kudzuhaiku, ocalhoun, Rated Ponystar, The Abyss, Lapis-Lazuli and Inky J, Bad_Seed_72, Aragon, Marshal Twilight, the parasprite, Daemon of Decay

12.11. cutesy cloppy romance

Yukito, Overlord-Flinx, Art Inspired, Anonymous Pegasus, McSquirmy, Nostalgia Schmaltz, Shakespearicles, Loyal, Bronystories, The Wizard of Words, KnightMysterio, milesprower06, Brony2893, meme-asaurus

12.12. what do these people have in common?

Akumokagetsu, GeodesicDragon, RainbowBob, TittySparkles, BronyWriter, Kaidan, Mr101, Draconian Soul, Comet Burst, Spacecowboy, MerlosTheMad, RealityCheck

Here's a picture of the k=12 groups, plotted in colors using all 12 groups, at points given just by the first 2 principal components. It shows you something about which groups are close to each other.  The arrows show which authors watch which other authors. This shows that most clusters are highly intra-connected, but some (12.3, 12.9) are not.

I had another, similar graph which I originally posted on this blog, which by chance came out with the authors watched by most other authors in the upper left, surrounded by clusters each of which watched the super-popular authors in the upper left, but didn't watch any of the other clusters.  But it was in postimg, and is now lost forever.


1. For each pair of authors A & B, I computed the probability of a watcher of author A watching B = P(w(X,B) | w(X|A)), and vice-versa.

2. I computed the probability of a watcher of author A watching a random person: P(w(X,_) | w(X|A)). This isn't the same as the above because it isn't computed for a specific person B.

3. I computed m(B,A) = P(w(X,B) | w(X|A)) / P(w(X,_) | w(X|A)) . This is the increase in likelihood that someone who watches A will watch B that is due to the fact that it is B, not to the fact that people who watch A watch a lot of people.

4. Let m2(A,B) = sqrt(m(A,B) * m(B,A))
m2(A) is then a vector of likelihood ratios for author A, saying how similar A is to each other author in the set.

5. I normalized this to have a flat distribution using math. It's late. Details are boring.

6. I ran PCA on the vectors, looked at the variance each resulting PCA component accounted for, and decided to keep the first 4 component vectors of PCA. Don't ask me to explain this. You can Google PCA, but it is gonna take a long time to understand.

7. I did k-means clustering on the authors using those 4 PCA components. Again, Google it. To pick k, the number of clusters, I looked at how the within groups sum of squares decreased as I increased k, and also I listed some groups I thought should form, and had the program compute how close its groups were to mine. I decided k=5 and k=12 were pretty good.

Report Bad Horse · 1,460 views · #math #fimfiction
Join our Patreon to remove these adverts!
Comments ( 63 )

Huh. So many authors.

Demon Eyes Laharl

the decent upstanding citizens of fimfiction

I don't understand how these two aren't a part of the same group.

Wonder how many of them are Mature fic writers. (Thinking because I'm mostly interested in the Light Side of the Horse, I don't follow them, so that's why I don't recognize them.)

(does some checking) Hm. Interesting that I don't show up in the Romance section (a phrase I *never* thought I'd say, ever). A quick search of Romance by Rating shows 2 of the top 10 as mine. (another phrase I never thought I'd say) Ponies is educational! :pinkiehappy:

Do the principal components seem to have any physical meaning? For example, is there a PC that splits clop writers from non-clop writers or splits different writers of different genres?

Also, it's interesting to note that shortskirtsandexplosions and his alt, Imploding Colon, partition into different clusters.

Oh God. I'm afraid to look.

4335559 See the figure I posted after you asked. If the dimensions have a simple meaning, I don't know what it is.

I suppose Imploding Colon groups separately bcoz ss&e made that account to post a different kind of stuff.

4335564 You aren't on there bcoz you didn't have 1000 followers in Oct. 2015. Also, we're tied for number of followers and 116th place. Dammit, I thought killing you at Bronycon would stop you.

Didn't you know that killing me only made me stronger?

And it's probably just as well that I'm not on your list; I might have broken your algorithm.

This is an interesting analysis.

groups 12.5. and 5.2 : I know a few of these people post on the Spacebattles/Sufficient Velocity forums, and a few of the others seem to have science fiction stuff. Doesn't fit all of them, but I thought I'd throw it out there.

Based on: Bucking Nonsense, GnollReader, psychicscubadiver, Harry Leferts and AdmiralTigerclaw, I'm guessing 'section 5.2' is either 'human' or 'crossover'.

...if I'm correctly remembering what I've seen by those authors.

4335601 Good catch! I checked 4, and they all write SF & F adventure crossovers.

4335583 Interesting. High pc2 and low pc1 definitely predicts me having heard of the authors, and even though the red cluster with high pc2 and the cyan cluster with low pc1 I would both consider popular authors whose fics often get featured, they write very different types of stories for probably different core audiences. Authors in the green cluster with both high pc2 and low pc1 probably bridge those two audiences. For example, it contains authors of some very famous fics plus two fimfic moderators.

Apparently, the Fallout:Equestria universe must have messed with your analysis as kkat seems to be missing from your lists.

> I probably should have clustered stories instead of authors.

There can be a next time :derpytongue2:

Fun read! Thanks for describing your process to the detail that you did.

4335627 Oh, yeah... minimum of 4 stories. That was to cut down on teh number of authors, but I guess it makes little sense. Better to up the watcher req.

12.1. people with >3000 followers and MrNumbers

I was playing DnD when I read this, and my group learned I wasn't paying attention when I read that K12 grouping and fell over laughing.

Question: How do these group clusters look when you solve for "Follower to Fic ratio" my personal favourite metric.

Huh... didn't realize I was a clopper. Although, given all this mature shit I wrote recently, I guess it makes sense.

Can you plot which of these authors follow each other on the same chart?

I suspect these should also clump into groups, but more visually.

I've been following you since we met at a tiny MLP convention in 2013ish near D.C. and I hadn't gotten any kind of "fame" yet. I think Lee Tockar was the biggest name there, or was that Tsitra360? It was like sub-200 people total...

Anyway, following you all this time has finally paid off because this graph is awesome... and I have no idea what it means. Can someone get in touch with the folks from Reddit's "Explain it like I'm 5" please? :pinkiehappy:

Kidding aside, so I talked regularly with people from 5.4, 12.10, 12.11, 12.12, and almost no one outside those groups. It makes sense authors that hang out tend to promote each other or have similar interests, which means similar stories for followers who end up following them. I just wonder why when K=5 all the people I regularly chat with are grouped so nicely and all in one spot, and when K=12, suddenly my friends are split into three very different groups.

I'll be having a talk with my "friends", clearly they're not as good a friend as I thought if they ditch me as soon as K >= 6. :pinkiecrazy:

Also two things are clear:
1. You need to take over the algorithm for the "Similar stories" side bar on this site because that thing is honestly just fucking awful. (And in contrast, this graph and the algorithms you used seem spot-on.)
2. I've finally found someone who loves doing math more than me.

That code change you mentioned that kicked GhostofHeraclitus out of your group did a lot more than just that. Most egregiously, it halved the membership of 12.7, and put those authors in 12.3 instead. It also threw Bucking Nonsense and Daemon of Decay in there as well, neither of whom really fit that well there, IMHO. There's other discrepancies as well, but it's currently 3am, and I need sleep.

I was sent here by Kaidan.

Can't figure out much more than him, other than this looks like how closely plotted authors share followers...

For that 12.12 group I took the liberty of going though all the names and gathering all the characteristics that stand out.

Draconian Soul,T/E/M,BLOGGER
Comet Burst,E/T/M,BA

The most common similarity I can see is that they all either have a high frequency of mature stories (M in the top two slots), a focus on mature stories (MF - mature focuse/QM - quality mature), or have a lot of effort put into their blog (BLOGGER).

It's all subjective though, and someone who knows the authors better might have something more to add.


MEMES - memes
BA - been away, taken a break, temp banned, or had a period where they just weren't active
DEAD - left the site, deleted their content, banned, etc.
FETISH - writes some fetishes
PS - popular series

T - teen
M - mature
E - everyone
EM - mix sharing the same slot because I didn't want to redo it with more slots

Something that I might find interesting is an investigation of circular relationships in follower lists.
Could use that to pull a network of how authors know each other and see how it coincides with their groupings.

I probably should have clustered stories instead of authors. :fluttershysad:

Yeah... :unsuresweetie:

4335704 is that an extension you are using? What does it do?

Nosey Hound, one of my scripts.

You can use it to see a list of someone's followers and will highlight when any changes occur (arriving/leaving/name changes).

Surprising that MrNumbers and me never share a group -- one would think that, seeing how we might as well be two halves of the same person, we'd have more in common. Oh, well -- he clearly won this one. I have never read anything from the people in 12.10, but apparently we're buddies now?

If a man is to be judged by his company, I fear for the guys at 12.10. Being compared to me must be a hell of an insult. I once lost a fight against a lamppost.


Oh, yeah, Nosey Hound is pretty great. My favorite script for the site by far (which is kind of depressing for Selbi, I assume) -- althought I gotta admit, I don't recognize that screenshot you shared. I have the "List" option, but what's up with the arrows? Is that a new version or something?

Comment posted by Trick Question deleted Dec 11th, 2016
Comment posted by Trick Question deleted Dec 11th, 2016
Comment posted by Trick Question deleted Dec 11th, 2016

Okay! I'm deleting my previous three rambling comments and summarizing with maybe less stupidity this time.

I haven't used PCA to turn a set of binary data into a smaller set of real-valued data. It looks valid, but it feels strange. k-means is not the first thing that I would have considered for data like these.

If the resulting clusters are meaningful (they need not be meaningful in a way that makes sense to us after PCA has been used), the three rational factors I'd expect to correlate would be:

0) "type" of story
1) the dates when that author's stories featured (especially the first long feature)
2) who follows that author

The first one you've imagined, but I think the second one is more likely relevant. "Also Liked" box usually contains stories that featured on the same day that your story featured, because readers on a site like this are fugitive. Somepony is likely to follow a bunch of authors at the same time for the short duration they use Fimfiction, then leave the site forever and not come back.

An easy way to test an approximation for the second one would be to compare the authors' join dates within-cluster and see if they correlate. Could cluster 5.1 simply be old-timers...?

The third one is probably weak and may be totally confounded by the approach you're using, but might be a factor just because a lot of readers follow-back each other. I suspect some popular authors do this too.

Maybe, I don't know.

Usually when you hover over a name, you can sniff their followers from their user cards. In the current version the result gets filled in under them on your already opened list instead of spawning a whole new instance of the script.

(I'll later have an update the improves that)


which is kind of depressing for Selbi, I assume

I'm sorry Selbi. ;-;

In k5 I'm in a group with everyone I know, basically. In your graph, I'm in a group where I don't know a single soul.


I figure a major confounder is the age of the person, i.e. the Year of Peak Watcher Accumulation. This isn't a factor that is interesting to humans, but the PCA probably picks up on it a lot, since people leave and enter the fandom, or at least this site. I'm not entirely sure how you fix this. I think you'd remove some of this bias if you only focused on those Watchers[1] who are still active (for some metric of active).

[1] In the Genesis 6:4 sense. Maybe.

Author Interviewer

I literally do not understand a single word in this post. All I know is I'm not in it, so I am confused and sad. :C

4335731 Some of this is explained in the BIG BOX OF STUFF YOU DON'T NEED TO READ, which you need to read.
- The clustering is based on real-valued data, which starts with the fraction of people who watch author A who also watch author B.
- Each author is thus a point in an N-dimensional space, so k-means makes sense.
- The only data used is who follows what author.

As for dates, at first authors segregated into some groups by dates--there was one entire group of "literary authors of 2012", with device heretic, AbsoluteAnonymous, and I forget who else. So I used only users who were members by Jan 1 2013, and had visited fimfiction within the past year the last time I'd checked their user page sometime in 2015. That helped.

It might be worth noting that most of the romance authors in the "romance and classy clop" grouping wrote major fics for mane six ships, and many are mods of mane six shipping groups:

bookplayer - AppleDash, TwiJack, mod of both groups
bats - TwiDash, TwiJack, mod of both groups
AbsoluteAnonymous - early PinkieDash and RariJack (plus a high rated AppleDash fic) (never mind, I was thinking of TAW there.)
DawnFade - early AppleDash
Steel Resolve - Flarity, AppleDash, TwiPie, RariPie, mod of the Flarity group and other shipping groups
Soundslikeponies - TwiDash
Darth Link 22 - TwiJack
Tchernobog - AppleDash, mod of the AppleDash group

I can't check the rest right now, but I know SleeplessBrony has written TwiDash. And Chuckfinley's Alien Shipping Syndrome gets linked all over the place by shipping writers.

With regard to methodology, I'm not quite sure PCA is the best way to go forward after you have the m2 matrix. You treat this as a general vector describing the author's audience wrt other authors' audiences, but what you really have is a pairwise distance matrix. From there you could directly apply some sort of hierarchical clustering. While PCA is a nice way of representing the data in a lower dimensional space, some type of energy minimization of nodes in an n-dimensional might be more appropriate.

Of course, while these changes might make the analysis more theoretically sound, they would probably be a lot of effort to implement, and I'm not sure how much they would change the results. Is there good agreement between the euclidian distances between authors from the PCA and their m2 scores?

Yeah, I skimmed that at first then went back later after my fourth post. :derpytongue2: I'm not in the best frame of mind for reading at the moment (rather nauseous after coming off of a migraine yesterday, which is why the LOR took forever).

I still suspect a correlation between top-story post-dates for authors in the same cluster, however. I don't know an easy way to pull that data other than to write a script that will fetch it.

I'm curious about how well-correlated the regular Writeoff peeps are, but too nauseous even to check that now.

Imma take a nap.

4335995 The write-off peeps are probably very well-correlated, but few have enough watchers to be on that chart. Even PP and horizon aren't on it.


From there you could directly apply some sort of hierarchical clustering.

Did that already. Not very easy to use with large numbers of authors, esp. as the divisions made in the clustering are pretty arbitrary, so some people who are very close together end up far apart. Hierarchical clustering is only helpful if you do it a bunch of times until you see which clusters are staable.

While PCA is a nice way of representing the data in a lower dimensional space, some type of energy minimization of nodes in an n-dimensional might be more appropriate.

Did that already, in 2 dimensions. I don't think it's as revealing as PCA or MDS.


Can you plot which of these authors follow each other on the same chart?

Done; see post.

4335826 At first authors segregated into some groups by dates--there was one entire group of "literary authors of 2012", with device heretic, AbsoluteAnonymous, and I forget who else. So I used only users who were members by Jan 1 2013, and had visited fimfiction within the past year the last time I'd checked their user page sometime in 2015. I also eliminated authors inactive for more than 1 year. That helped.

4335841 Post hoc uno verbo ad verbum ego non intellego. Et ego scio quod non ita sum confusus tristisque recedit.

4335747 If it's accurate, can you come up with a name for group 12.11? There's 7 different romance-and-sex groups, and I don't know if there are real differences between them all, or I just set k too high.

4335672 It's based on who follows you. When your stories go over ~1/3 mature+sex, you end up in one of the mature+sex groups. Not all mature+sex is clop, but I don't read enough of it to know what's "clop" and what's, well, whatever else there is in M+S. There are actually not many people who write only stories with explicit sex. The only ones in this group of 131 are Come Hither, Nostalgia Schmaltz, and SleeplessBrony. whatmustido is only like 1/4 mature+sex. You're just over 1/2.

Author Interviewer

;____; Moooooom! Make Bad Horse stop using weird words I don't understand!

Sweetie, stop teasing your brother. :ajbemused:

4336072 If it were just romance, I'd say "The Fluff Cluster", but since sex is involved, why not just the "Touchy-Feelsy Squad"? :moustache:

4336053 I guess there's a reason why most people default to PCA + k-means clustering when analyzing complex datasets.


There's 7 different romance-and-sex groups, and I don't know if there are real differences between them all, or I just set k too high.

Given what I pointed out to Georg, a better name for 12.2 might be "Ship Captains."

4336186 Or Captain Stubing. (too esoteric for the younger crowd?)
4336169 ... yeah, just what I was going to say.



Post hoc uno verbo ad verbum ego non intellego. Et ego scio quod non ita sum confusus tristisque recedit.


Man, I want 1000 followers so that I can show up on nerdy diagrams.

My ego already wanted 1000+ followers anyway (obviously), but now you've created a pointless yet irresistible bonus reward, damn you!

Alas! The pain of comprehending the words of a genius are too great for us who are foolish and mundane!

And now my time has not been wasted. Lel.

This post reminded me of people who ain't really around any more. Like TittySparkles, who wrote pervy, delicious things akin to the nutritional value of a Twinkie. Also tentacles. Bronystories, who didn't write much pervy crap in the past few years, but had a lot of data on MLP porn (BH I think if you wanted me to pay attention to the big words you should have put PORN in big, tantalizing letters so as to get my attention. Because that's something I would take effort to understand. Sure.). And RainbowBob. Where the FUCK is Bob?! I miss him soooooooooo much. I have like no one to talk to about the stupidest shit anymore. Like the grimdark comedy of 40k. Or of crazy smut. Or Battlebots. Cuz no one else likes Battlebots. Or general nonsense. He was the best online buddy I could ask for. There is a sponge shaped hole in my heart in his absence.

*intense moth crying/mourning/lots of flailing on back*


Author Interviewer

Bob got a life and left, thus proving the rest of us have no life. :C

Interesting. This is totally off the cuff — I'd have to double-check our feature list for specific numbers — but among your author data set, having a Royal Canterlot Library feature strongly correlates with group membership in 12.4, 12.1, 12.2, or 12.10 (starting with most probable).

4337506 In the ball-and-stick diagram in 4336053, names of authors with stories in the RCL are in italics. They're all around Cold in Gardez and Blueshift in the upper left except for Bad_Seed7213. Numbers after names are # of stories on EQD, also clustered in the same area.

A really interesting thing about that is those authors all appear very similar in a heat map--their distance vectors are all similar--yet in a multi-dimensional scaling plot, they spread out while most of the other authors cluster together in a lump. So there's lots more variation in who follows the authors outside that cluster, but it's noise, and cancels out when you combine the dimensions. I interpret this as meaning that good writing can be objectively identified (using the subjective judgements of readers) because their readers' preferences have a systematic, discriminating structure across authors.

Login or register to comment
Join our Patreon to remove these adverts!