Statistics · 3:21am Apr 20th, 2016

There are three kinds of lies: lies, damned lies, and statistics. I’m about to engage in the third one.

That is, I remembered about the Great FimFiction Dump and decided to finally do something interesting with it.

1. I.e. remove all punctuation, clean out stop words, lowercase everything, that sort of thing.
2. And gzipping them, because I think there’s about 20 Gb of raw text here.
3. While debugging it, I did observe that Sombra is mentioned in about 5% of all stories, while Hitler is mentioned in 0.5%. Which is in itself rather notable.

To start with, I wrote a script to unwrap the stories from EPUB and normalize the text for further analysis,¹ squishing it into neat little 20-megabyte JSON packets together with accompanying metadata.² While it’s chewing them, which is going to take a while, but will considerably simplify any further activity, I’m wondering, just what kind of useful things I can do, given that sort of data. Various naive metrics are only moderately interesting.³ While I could just feed it into one clusterizer or another, I doubt simply grouping stories together and hoping something interesting emerges is a particularly promising idea.

There is only one possibility I can currently identify which doesn’t seem like a complete waste of time: Taking a set of stories a given reviewer judges good, and then using them to train a classifier to fish for similar texts among the stories which never received attention, i.e. never passed a certain views/upvotes threshold. But for this, I need a list of story IDs to train it on.

Everything else that comes to mind, while perfectly possible, seems kind of pointless. It should be trivial to filter, for example, Displaced stories, but that wouldn’t be very useful. There’s probably some possibility in sentiment analysis, but it’s not clear what kind of sentiment would be interesting to investigate.

Any ideas for a particular property I could fish for? Preferably, with a long list of stories I can work from.

Report Oliver · 789 views · #discuss #statistics

Comments ( 12 )

Viewing 1 - 50 of 12
- Newest First
- Oldest First

Georg

Georg #1 · Apr 20th, 2016 · 1 · ·

3.9 gig (zipped) of Ponyfiction. No wonder I'm behind.

Oliver

Oliver #2 · Apr 20th, 2016 · · ·

3882812

At just ~130000 stories, there's much less of it than I expected, to be honest. :)

Bradel

Bradel #3 · Apr 20th, 2016 · · ·

I dunno, I think leaving stop words in might actually be more interesting. I'd be interested to know readability metrics or a story's proportion of total words that are stop words relates to its overall story rank, upvotes, and views. Anecdotally, I'm very likely to ignore stories like that. I know the site tends to have more generous tastes than I do, but I'd still expect a pattern.

There's also a general perception, I think, of "downvote tax" on certain story categories. I think this is a thing other people have looked at, but it's still interesting. If you can identify stories with similar word sets and frequencies, and see whether certain tags lead to different audience reception, that could be interesting (though also pretty hopelessly confounded by author effects, I expect, unless you can pick some word sets that'll let you compare over very large numbers of stories).

If you're really looking to have some fun, you could start modeling author effects for story views, or even AR processes based on the success of an author's previous stories.

Orbiting Kettle

Orbiting Kettle #4 · Apr 20th, 2016 · · ·

If you want to train a classifier you'll need a very extended set of stories to begin with. I would suggest to raid the folders of review groups (Seattle angels, The Royal Guard, Equestria Daily) to get a large if imperfect starting sample. Do you want to establish the dimensions by hand or do you let your software figure them out itself?

Oliver

Oliver #5 · Apr 20th, 2016 · · ·

3882933

I think leaving stop words in might actually be more interesting.

Good point. I'll try a different cleanup method next time, after all, the original epubs aren't going anywhere. I think I'll start with trying to fish something out of a simple bag of words, though...

There's also a general perception, I think, of "downvote tax" on certain story categories.

I think this one is pretty much hopeless, except when discounting actual story content entirely.

If you're really looking to have some fun, you could start modeling author effects for story views, or even AR processes based on the success of an author's previous stories.

I don't think I've got the right data for this sort of thing. This would need follower counts and follower count change statistics, which would require me to scrape Fimfiction myself. I'm not exactly up to it. :)

3882950

Do you want to establish the dimensions by hand or do you let your software figure them out itself?

Establishing things by hand sounds way too much like work! :)

Bradel

Bradel #6 · Apr 20th, 2016 · · ·

3883208

If you're really looking to have some fun, you could start modeling author effects for story views, or even AR processes based on the success of an author's previous stories.
I don't think I've got the right data for this sort of thing. This would need follower counts and follower count change statistics, which would require me to scrape Fimfiction myself. I'm not exactly up to it. :)

Nah, not really. Don't think so, anyway. I mean, if you wanted to model them for specific authors, you might well need that—but I'm talking about looking at how much of the variability in story votes can be explained by authors in general, or by previous stories by the same author. For that, I think you just need to know the authorship on the stories (and the date of publication, if you're looking for an AR setup, though this is going to get screwed up by multi-chapter stories published over a long period of time). In its simplest form, we're just talking about something like:

Y_ij = A_i + V_ij
A_i ~ Normal (mu_A, tau_A)
V_ij ~ Normal (mu_V, tau_V)
All A_i's and V_ij's mutually independent

Where Y_ij is, say, the upvote count on Story j by Author i, A_i is the portion of upvotes attributable to the author, and V_ij is the number of upvotes attributable to the story itself. This is probably a pretty bad model (it's not bothering to model A or V beyond a rudimentary probability statement, it's ignoring a lot of important information like genres and word counts, and it's assuming a Normal distribution on things that are pretty clearly skewed), but it's a starting place. Yes, having follower counts would let you build a more sophisticated model for the A_i's, but you could probably learn something just leaving it purely random as well.

Oliver

Oliver #7 · Apr 20th, 2016 · · ·

3883227

but I'm talking about looking at how much of the variability in story votes can be explained by authors in general, or by previous stories by the same author.

Ah. Yeah, I need to remember (and/or learn) more math than I thought I could get away with. :) On the other hand, this wouldn't involve chewing text, just the index file would be enough to compute that...

Bradel

Bradel #8 · Apr 20th, 2016 · · ·

3883239

On the other hand, this wouldn't involve chewing text, just the index file would be enough to compute that...

True, dat.

Orbiting Kettle

Orbiting Kettle #9 · Apr 20th, 2016 · · ·

3883208
Defining dimensions is fun, but I see that that might be a way of introducing some sort of bias.

On the other hand the algorithms to work on simple sets of coordinates would be simple and you can probably use some well established image recognition methods.

Oliver

Oliver #10 · Apr 20th, 2016 · · ·

3883242

Hahahaha!

So I prepared a csv for ANOVA (Gee, I still remember what the acronym means, who'd have thought. I might even remember some R syntax!) by filtering out the interesting values and computing reasonable derivatives - upvotes - downvotes / views, for one.

Then I sorted the csv by the above derived value and was surprised to find that for several stories it was above 1. The one with the highest value was the little gem with story id 289579, which happens to have mature/sex tags, (looks like an adventure story otherwise) so I think I'm not allowed to link to it.

At the time of writing, it has 44 views, but 51 upvotes and 8 downvotes. The same oddity is also observed on https://www.fimfiction.net/story/56407 and 167757, which has apparently been deleted since it got into the archive. Looks like blind upvoting and downvoting without ever opening the story is so much a thing, it can produce ultra-skewed numbers like that.

GhostOfHeraclitus

GhostOfHeraclitus #11 · Apr 20th, 2016 · · ·

3883227
I... understood that and kept nodding along.

Jesus. My mind's permanently scarred.

Oliver

Oliver #12 · Apr 20th, 2016 · 1 · ·

3884032

Give it ten more years, you'll forget it just like I did. :)

Viewing 1 - 50 of 12
- Newest First
- Oldest First

Oliver

More Blog Posts349

110 weeks
Against Stupidity

112 weeks
Good morning, Vietnam

156 weeks
Lame Pun Collection

157 weeks
Rational Magic

164 weeks
A series of unexpected observations

Statistics · 3:21am Apr 20th, 2016

Stats

FIMFiction

Follow & Support Us

Oliver

More Blog Posts349

110 weeks Against Stupidity

112 weeks Good morning, Vietnam

156 weeks Lame Pun Collection

157 weeks Rational Magic

164 weeks A series of unexpected observations

Statistics · 3:21am Apr 20th, 2016

Stats

FIMFiction

Follow & Support Us

110 weeks
Against Stupidity

112 weeks
Good morning, Vietnam

156 weeks
Lame Pun Collection

157 weeks
Rational Magic

164 weeks
A series of unexpected observations