Statistics · 3:21am Apr 20th, 2016
There are three kinds of lies: lies, damned lies, and statistics. I’m about to engage in the third one.
That is, I remembered about the Great FimFiction Dump and decided to finally do something interesting with it.
To start with, I wrote a script to unwrap the stories from EPUB and normalize the text for further analysis,1 squishing it into neat little 20-megabyte JSON packets together with accompanying metadata.2 While it’s chewing them, which is going to take a while, but will considerably simplify any further activity, I’m wondering, just what kind of useful things I can do, given that sort of data. Various naive metrics are only moderately interesting.3 While I could just feed it into one clusterizer or another, I doubt simply grouping stories together and hoping something interesting emerges is a particularly promising idea.
There is only one possibility I can currently identify which doesn’t seem like a complete waste of time: Taking a set of stories a given reviewer judges good, and then using them to train a classifier to fish for similar texts among the stories which never received attention, i.e. never passed a certain views/upvotes threshold. But for this, I need a list of story IDs to train it on.
Everything else that comes to mind, while perfectly possible, seems kind of pointless. It should be trivial to filter, for example, Displaced stories, but that wouldn’t be very useful. There’s probably some possibility in sentiment analysis, but it’s not clear what kind of sentiment would be interesting to investigate.
Any ideas for a particular property I could fish for? Preferably, with a long list of stories I can work from.
3.9 gig (zipped) of Ponyfiction. No wonder I'm behind.
3882812
At just ~130000 stories, there's much less of it than I expected, to be honest. :)
I dunno, I think leaving stop words in might actually be more interesting. I'd be interested to know readability metrics or a story's proportion of total words that are stop words relates to its overall story rank, upvotes, and views. Anecdotally, I'm very likely to ignore stories like that. I know the site tends to have more generous tastes than I do, but I'd still expect a pattern.
There's also a general perception, I think, of "downvote tax" on certain story categories. I think this is a thing other people have looked at, but it's still interesting. If you can identify stories with similar word sets and frequencies, and see whether certain tags lead to different audience reception, that could be interesting (though also pretty hopelessly confounded by author effects, I expect, unless you can pick some word sets that'll let you compare over very large numbers of stories).
If you're really looking to have some fun, you could start modeling author effects for story views, or even AR processes based on the success of an author's previous stories.
If you want to train a classifier you'll need a very extended set of stories to begin with. I would suggest to raid the folders of review groups (Seattle angels, The Royal Guard, Equestria Daily) to get a large if imperfect starting sample. Do you want to establish the dimensions by hand or do you let your software figure them out itself?
3882933
Good point. I'll try a different cleanup method next time, after all, the original epubs aren't going anywhere. I think I'll start with trying to fish something out of a simple bag of words, though...
I think this one is pretty much hopeless, except when discounting actual story content entirely.
I don't think I've got the right data for this sort of thing. This would need follower counts and follower count change statistics, which would require me to scrape Fimfiction myself. I'm not exactly up to it. :)
3882950
Establishing things by hand sounds way too much like work! :)
3883208
Nah, not really. Don't think so, anyway. I mean, if you wanted to model them for specific authors, you might well need that—but I'm talking about looking at how much of the variability in story votes can be explained by authors in general, or by previous stories by the same author. For that, I think you just need to know the authorship on the stories (and the date of publication, if you're looking for an AR setup, though this is going to get screwed up by multi-chapter stories published over a long period of time). In its simplest form, we're just talking about something like:
Y_ij = A_i + V_ij
A_i ~ Normal (mu_A, tau_A)
V_ij ~ Normal (mu_V, tau_V)
All A_i's and V_ij's mutually independent
Where Y_ij is, say, the upvote count on Story j by Author i, A_i is the portion of upvotes attributable to the author, and V_ij is the number of upvotes attributable to the story itself. This is probably a pretty bad model (it's not bothering to model A or V beyond a rudimentary probability statement, it's ignoring a lot of important information like genres and word counts, and it's assuming a Normal distribution on things that are pretty clearly skewed), but it's a starting place. Yes, having follower counts would let you build a more sophisticated model for the A_i's, but you could probably learn something just leaving it purely random as well.
3883227
Ah. Yeah, I need to remember (and/or learn) more math than I thought I could get away with. :) On the other hand, this wouldn't involve chewing text, just the index file would be enough to compute that...
3883239
True, dat.
3883208
Defining dimensions is fun, but I see that that might be a way of introducing some sort of bias.
On the other hand the algorithms to work on simple sets of coordinates would be simple and you can probably use some well established image recognition methods.
3883242
Hahahaha!
So I prepared a csv for ANOVA (Gee, I still remember what the acronym means, who'd have thought. I might even remember some R syntax!) by filtering out the interesting values and computing reasonable derivatives - upvotes - downvotes / views, for one.
Then I sorted the csv by the above derived value and was surprised to find that for several stories it was above 1. The one with the highest value was the little gem with story id 289579, which happens to have mature/sex tags, (looks like an adventure story otherwise) so I think I'm not allowed to link to it.
At the time of writing, it has 44 views, but 51 upvotes and 8 downvotes. The same oddity is also observed on https://www.fimfiction.net/story/56407 and 167757, which has apparently been deleted since it got into the archive. Looks like blind upvoting and downvoting without ever opening the story is so much a thing, it can produce ultra-skewed numbers like that.
3883227
I... understood that and kept nodding along.
Jesus. My mind's permanently scarred.
3884032
Give it ten more years, you'll forget it just like I did. :)