• Member Since 24th Sep, 2015
  • offline last seen Saturday

Oliver


Let R = { x | x ∉ x }, then R ∈ R ⟺ R ∉ R... or is it?

More Blog Posts349

  • 110 weeks
    Against Stupidity

    I figure I’ll do some popular sociology. I’ve reached the limit of what I can do at the present time, and I need to take a break from all the doomscrolling, because there’s only so much war crime bingo I can read before I go do something emotionally motivated and ultimately useless.

    Read More

    16 comments · 1,672 views
  • 112 weeks
    Good morning, Vietnam

    My foreign friends often ask me – the very few that know I’m Russian – what does the average Russian think about Ukraine.

    You can see why I have always kept this private now.

    Read More

    34 comments · 1,276 views
  • 156 weeks
    Lame Pun Collection

    So I decided to trawl conversation logs for throwaway lines I spout on occasion. Because otherwise I’d forget them entirely, and some of them are actually good ideas. Granted, most of them are stupid puns… But I like puns, and I’m still not sure why you’re supposed to cringe at them.

    Read More

    10 comments · 1,346 views
  • 157 weeks
    Rational Magic

    I basically improvised most of this lecture from memory when talking with DannyJ yesterday, but then I thought, why not blog this, should at least be food for thought. It’s not directly pony-relevant, more like a general topic of discussion which one needs to meditate on when writing fantasy – but that includes ponyfic, so you might be interested.

    Read More

    24 comments · 1,598 views
  • 164 weeks
    A series of unexpected observations

    So I’ve been reading things.

    Read More

    15 comments · 1,524 views
Apr
20th
2016

Statistics · 3:21am Apr 20th, 2016

There are three kinds of lies: lies, damned lies, and statistics. I’m about to engage in the third one.

That is, I remembered about the Great FimFiction Dump and decided to finally do something interesting with it.

1. I.e. remove all punctuation, clean out stop words, lowercase everything, that sort of thing.
2. And gzipping them, because I think there’s about 20 Gb of raw text here.
3. While debugging it, I did observe that Sombra is mentioned in about 5% of all stories, while Hitler is mentioned in 0.5%. Which is in itself rather notable.

To start with, I wrote a script to unwrap the stories from EPUB and normalize the text for further analysis,1 squishing it into neat little 20-megabyte JSON packets together with accompanying metadata.2 While it’s chewing them, which is going to take a while, but will considerably simplify any further activity, I’m wondering, just what kind of useful things I can do, given that sort of data. Various naive metrics are only moderately interesting.3 While I could just feed it into one clusterizer or another, I doubt simply grouping stories together and hoping something interesting emerges is a particularly promising idea.

There is only one possibility I can currently identify which doesn’t seem like a complete waste of time: Taking a set of stories a given reviewer judges good, and then using them to train a classifier to fish for similar texts among the stories which never received attention, i.e. never passed a certain views/upvotes threshold. But for this, I need a list of story IDs to train it on.

Everything else that comes to mind, while perfectly possible, seems kind of pointless. It should be trivial to filter, for example, Displaced stories, but that wouldn’t be very useful. There’s probably some possibility in sentiment analysis, but it’s not clear what kind of sentiment would be interesting to investigate.

Any ideas for a particular property I could fish for? Preferably, with a long list of stories I can work from.

Report Oliver · 789 views · #discuss #statistics
Comments ( 12 )

3.9 gig (zipped) of Ponyfiction. No wonder I'm behind.

3882812

At just ~130000 stories, there's much less of it than I expected, to be honest. :)

I dunno, I think leaving stop words in might actually be more interesting. I'd be interested to know readability metrics or a story's proportion of total words that are stop words relates to its overall story rank, upvotes, and views. Anecdotally, I'm very likely to ignore stories like that. I know the site tends to have more generous tastes than I do, but I'd still expect a pattern.

There's also a general perception, I think, of "downvote tax" on certain story categories. I think this is a thing other people have looked at, but it's still interesting. If you can identify stories with similar word sets and frequencies, and see whether certain tags lead to different audience reception, that could be interesting (though also pretty hopelessly confounded by author effects, I expect, unless you can pick some word sets that'll let you compare over very large numbers of stories).

If you're really looking to have some fun, you could start modeling author effects for story views, or even AR processes based on the success of an author's previous stories.

If you want to train a classifier you'll need a very extended set of stories to begin with. I would suggest to raid the folders of review groups (Seattle angels, The Royal Guard, Equestria Daily) to get a large if imperfect starting sample. Do you want to establish the dimensions by hand or do you let your software figure them out itself?

3882933

I think leaving stop words in might actually be more interesting.

Good point. I'll try a different cleanup method next time, after all, the original epubs aren't going anywhere. I think I'll start with trying to fish something out of a simple bag of words, though...

There's also a general perception, I think, of "downvote tax" on certain story categories.

I think this one is pretty much hopeless, except when discounting actual story content entirely.

If you're really looking to have some fun, you could start modeling author effects for story views, or even AR processes based on the success of an author's previous stories.

I don't think I've got the right data for this sort of thing. This would need follower counts and follower count change statistics, which would require me to scrape Fimfiction myself. I'm not exactly up to it. :)

3882950

Do you want to establish the dimensions by hand or do you let your software figure them out itself?

Establishing things by hand sounds way too much like work! :)

3883208

If you're really looking to have some fun, you could start modeling author effects for story views, or even AR processes based on the success of an author's previous stories.

I don't think I've got the right data for this sort of thing. This would need follower counts and follower count change statistics, which would require me to scrape Fimfiction myself. I'm not exactly up to it. :)

Nah, not really. Don't think so, anyway. I mean, if you wanted to model them for specific authors, you might well need that—but I'm talking about looking at how much of the variability in story votes can be explained by authors in general, or by previous stories by the same author. For that, I think you just need to know the authorship on the stories (and the date of publication, if you're looking for an AR setup, though this is going to get screwed up by multi-chapter stories published over a long period of time). In its simplest form, we're just talking about something like:

Y_ij = A_i + V_ij
A_i ~ Normal (mu_A, tau_A)
V_ij ~ Normal (mu_V, tau_V)
All A_i's and V_ij's mutually independent

Where Y_ij is, say, the upvote count on Story j by Author i, A_i is the portion of upvotes attributable to the author, and V_ij is the number of upvotes attributable to the story itself. This is probably a pretty bad model (it's not bothering to model A or V beyond a rudimentary probability statement, it's ignoring a lot of important information like genres and word counts, and it's assuming a Normal distribution on things that are pretty clearly skewed), but it's a starting place. Yes, having follower counts would let you build a more sophisticated model for the A_i's, but you could probably learn something just leaving it purely random as well.

3883227

but I'm talking about looking at how much of the variability in story votes can be explained by authors in general, or by previous stories by the same author.

Ah. Yeah, I need to remember (and/or learn) more math than I thought I could get away with. :) On the other hand, this wouldn't involve chewing text, just the index file would be enough to compute that...

3883239

On the other hand, this wouldn't involve chewing text, just the index file would be enough to compute that...

True, dat.

3883208
Defining dimensions is fun, but I see that that might be a way of introducing some sort of bias.

On the other hand the algorithms to work on simple sets of coordinates would be simple and you can probably use some well established image recognition methods.

3883242

Hahahaha!

So I prepared a csv for ANOVA (Gee, I still remember what the acronym means, who'd have thought. I might even remember some R syntax!) by filtering out the interesting values and computing reasonable derivatives - upvotes - downvotes / views, for one.

Then I sorted the csv by the above derived value and was surprised to find that for several stories it was above 1. The one with the highest value was the little gem with story id 289579, which happens to have mature/sex tags, (looks like an adventure story otherwise) so I think I'm not allowed to link to it.

At the time of writing, it has 44 views, but 51 upvotes and 8 downvotes. The same oddity is also observed on https://www.fimfiction.net/story/56407 and 167757, which has apparently been deleted since it got into the archive. Looks like blind upvoting and downvoting without ever opening the story is so much a thing, it can produce ultra-skewed numbers like that.

3883227
I... understood that and kept nodding along.

Jesus. My mind's permanently scarred.

3884032

Give it ten more years, you'll forget it just like I did. :)

Login or register to comment