• Member Since 11th Apr, 2012
  • offline last seen Yesterday

Bad Horse


Beneath the microscope, you contain galaxies.

More Blog Posts758

Jul
3rd
2015

What tags correlate with popularity on Fimfiction · 3:44am Jul 3rd, 2015

Ever wondered which story tags are most likely to lead to a high view count? You can find out, sort of, by doing a linear regression. You take a big set of stories, and you want to come up with a linear function like

views = a1 * Adventure + a2 * Comedy + a3 * Romance + a4 * WasOnEQD + a5 * Words + a6 * AuthorFollowers + a7 * Age

that predicts upvotes / downvotes with the smallest error. A statistician might tell you this is the wrong thing to do, because some of these variables on the right, like the tags Adventure, Comedy, and Romance, are Booleans (they're either zero or one), and blah blah blah. I forget the details. I did it anyway.

I changed one thing: I did the regression on log(views) instead of on views. That means in the above equations, the left-hand side would instead say

log(views) = ...

log(x) is the number you have to raise e (about 2.718) to the power of to get x. The number 2.718 doesn't matter a lot for my purposes, so don't obsess over it (though it is a very interesting number if you're into that sort of thing). I made that change for a practical reason: it didn't work if I didn't. I think I posted this graph a long time ago:

On the horizontal axis we have authors, from least-viewed on the left to most-viewed on the right. The height of the curve at point X (the Xth least-popular author) shows the cumulative total number of views for all authors from least popular to X.

At the time I made this graph (May 2012), my dataset had a total of about 10 million page views, which I think is less than 1% of what fimfiction has now. But the pattern is probably the same: The line crosses the halfway point (5 million views) way, way over to the right. It shows that half of the views went to about the top 5% of the authors.

If you do a linear regression on views, and most of the views are of a few very popular stories, they dominate the results, so much so that all those other stories might as well not be there. Yet there aren't enough of those super-popular stories to give you good data. You end up saying that My Little Dashie's tags are popular.

So I used log(views) instead. Pretend I used a 10 where there's a 2.718 up above. Using log (base 10) of views on the left would mean that a story with 10,000 views would have a weight of 4 in the regression, and a story with 1,000 views would have a weight of 3, and a story with 100 views, a weight of 2. It just works better that way.

I wrote everything below back in 2012 or 2013 when I'd just started blogging, then forgot to post it. Then I forgot I'd forgotten to post it. Then I remembered, but thought maybe I would run the regression again with updated data. Then knighty changed the site and my scripts stopped working. Then he changed it again. Then my hard drive died and I nearly lost all the data I had. So I guess I should just post this. This data is on stories published on from fimfiction that were still present in, I think, September 2012.


Here are the results of a multiple linear regression of log(views) vs. many variables, including the tags that had the biggest effects, for 6,232,055 views of 10,774 stories:

Regression coefficients for variables affecting log(number of story views): This means that we can estimate the number of views a story will get as

ln(views) = 3.10 -0.26*Ad + 0.19*Co + 0.24*EqD + ... etc.,

where EqD is one less than the number of stars that the story had on Equestria Daily, Ad is 1 if the story has the Adventure tag, Co is 1 if it has the Comedy tag, etc.

Regressing on the natural logarithm of views means that if your story got 5 stars on EqD, ln(views) will be (5-1)*0.24 higher, so e^views will be e^(4 * .24) = 2.6 times as high. This sounds low, but you have to remember that this is a multiple regression. It's comparing a story on EqD to a story not on Eqd, all else being equal, where all else includes the number of stories the author has written, and the ratings and number of views per story on that author's other stories. Authors with stories on EqD typically have written many stories, and gotten good ratings, and get a lot of views on their non-EqD stories. EqD has a much bigger effect on an author's first EqD story, and that effect is carried over to their later stories, which ironically shows up in the regression as EqD having a smaller effect. (For a time-series analysis, you have to pay extra.)

Ad=>-0.261333185004963
Co=>0.192385004256984
Hu=>0.464993386845682
Ro=>0.306165019163999
Tr=>-0.116202254218391
EqD=>0.241215841712334
authorThumbsPerView=>1.05017494110653
age (number of stories released since)=>0.000345047336636074
chapters=>0.0335395898381275
celestia=>0.0998417122472206
cmc=>0.124656406316725
daring_do=>0.243264312144427
dinky=>0.374027835899342
discord=>-0.0435060974451179
twilight_sparkle=>0.211488196048721
main_6=>0.182411425683621
oc=>-0.477984288042451
queen-chrysalis=>0.802708589903285
completed=>0.215298443229625
one-shot=>-0.0856642429966141
number of stories by author=>0.0335689539913519

Some of the tags and characters are missing. Tough. They're missing because they didn't have a strong, consistent effect, and having too many independent variables (the variables on the right-hand side of the equation above) just make the results all mushy. So I took them out. Don't ask for them; they don't exist in this regression. Any tag that's missing is a weak predictor of views.

You can compare things of the same type easily, like tags. The most popular tag is--surprise! Human! This may be because the #1 and #2 stories, My Little Dashie and Brony Hero of Equestria, both had the human tag, and had 3.6% of all the views on fimfiction between them. Or it may be because, guess what, a lot of bronies love Human in Equestria. Second-most-popular are Romance and Comedy. Least-popular are Adventure and Tragedy. Chrysalis was insanely popular shortly after her episode. OCs are hated.

(For the simple interpretation of the numbers for the tags, see my next blog post, Story tag results simplified.)

Comparing things from different types is hard, because the units are different. Age, for instance, is the number of stories newer than this one, so 0.00034 looks small, but it has a big impact, because some stories are "10,000 stories old". An authorThumbsPerView coefficient of 1.05 sounds large, but I normalized that variable to have an average value of zero, and a really good value is 0.05, so it (sadly) isn't a big factor at all. (We can, in fact, see from this data that for a story to be old is about 100 times as important as being good. Call it the My Little Dashie effect.)

To help you compare things of different types, this table gives the average impact per story of each factor on ln(views). (Like, if a story is "4000 stories old", the effect of that age on its predicted ln(views) is 0.000345 * 4000 = 1.38.) It's the same for variables that always take the values 0 or 1, so I didn't list them again.

EqD=>0.0615560855273941 (This is the average contribution per story when including stories that weren't on EqD.)
authorThumbsPerView=>0.0272467473415124
age=>2.42574351227837
chapters=>0.127522327980653
stories=>0.0391090680325234

Theta values available on request. :twilightsmile:

Comments ( 27 )

3201939 That might just be crazy enough to work. :pinkiecrazy:

3201939

Not if the market is efficient. This information is public....

I don't want to write basically anything except comedic adventures. And the occasional ridiculous psychedelic bullshit piece. If only I'd written them two years ago...

3202067 Just because it's public information doesn't mean the vast majority of the public will read it. (Their loss) Also, those of us who do read it, may not alter our reading preferences, so the total variation of the numbers due to observation will be negligible.

I wonder what the breakdown is if the Mature fics are removed from the equation. I presume most clopfics are tagged Romance. (I keep the flag on to save my sanity) I find it fascinating that 3.6% of the total story reads on Fimfiction are two stories, and wonder just where people got 'turned onto' the stories from. (Ok, I admit it. My ego wants to bend some of that stream in my direction. Sue me.)

Hm. Past Sins has more gross first chapter views than Brony Hero of Equestria, and BHOE has very, very short chapters, so if you measure words read instead of chapters...

Your model is valid and well-studied, and it happens to be the same as one used for studies ranging from political science to biophysics.

A statistician might tell you this is the wrong thing to do, because some of these variables on the right, like the tags Adventure, Comedy, and Romance, are Booleans (they're either zero or one), and blah blah blah. I forget the details. I did it anyway.

It's the non-boolean variables that (I assume, since I'm not one and don't know any) statisticians would question, since it's not clear that EqD rating, number of stories by author, author followers, etc are log-linearly correlated with views.

Any tag that's missing is a weak predictor of views.

How did you find that out? I assume you did a (log) linear regression and took out the variables with the smallest coefficients, though that could be dangerous. There's an easy way to tell if it's problematic, at least for the boolean variables. Try running a multinomial logistic regression from your boolean variables to the same boolean variables and remove all of the coefficients that correlate a variable with itself. If all the remaining coefficients are small, you're in the clear. Otherwise, your variables are strongly correlated, and you'd have to use a slightly more complicated model to adjust for that. Your current model would still work for most fics, but it would be less likely to be accurate for anomalous ones.

3201939
Don't forget, you also have to write it three years in the past :moustache:

Completed is more effective than Twi at getting popularity, and practically as useful as EqD. :twilightoops:

I'm not terribly surprised. Write stuff, and finish stuff :raritywink:

3202259
Too bad people hardly comment on finished stories :raritycry:

Fascinating. My only criticism is that using the rating on another site (eqd) as a variable is kind of cheating, since that implicitly selects for quality and for popularity-boosting features.

Hu=>0.464993386845682
oc=>-0.477984288042451

If most humans are OCs, does this mean the collective effect of the HiE obligatory tags is more mediocre than first glance suggests? The multipliers nearly cancel each other out. Of course, it then begs the question of how Hu got its positive correlation in the first place. Not quite sure what to make of this.

3202223 I don't see the connection with the Ising model. While you're looking, note the strong similarity between Ising models and Hopfield neural networks.

How did you find that out? I assume you did a (log) linear regression and took out the variables with the smallest coefficients, though that could be dangerous.

That, and I ran different regressions with different subsets of data, and took out the variables whose coefficients weren't stable.

There's an easy way to tell if it's problematic, at least for the boolean variables. Try running a multinomial logistic regression from your boolean variables to the same boolean variables and remove all of the coefficients that correlate a variable with itself.

What? Like

f(x has tag Ad) = b1*Ad + b2*Co + b3*Ro

then remove b1 * Ad and get

f(Ad) = b2 * Co + b3 * Ro

I didn't do that. Good idea, though.

3202547 That's a good observation. Suppose that every Human story has an OC tag.

- If no other stories had a Human tag or an OC tag, the tags would cancel each other out. Those coefficients would also be unstable: they'd be different each time I ran the program with slightly different data, as long as they summed to -.013. So that didn't happen.

- If lots of other stories have a Human tag or an OC tag, then those values indicate the average effect on stories with one or both of the tags. It would mean that bronies are neutral about Humans, but don't count them as hated OCs.

Searching on fimfiction by tag:

Human: 14,351 OC: 29,173 Human, OC: 7,163 Human without OC: 7,188 OC without Human: 22,010

Most humans on fimfiction now aren't OCs. They're Equestria Girls, humanized ponies, or humans in crossovers. Spiderman isn't an OC, remember.

3202468 It's factoring that out. If I didn't include presence on EQD, the EQD stories would have inexplicably higher viewcounts, and that would get attributed to whatever tags EQD favors (eg., it would make gore, sex, and romance score lower).

I don't know why gore and mature aren't up there. Maybe I didn't check them.

3202547 3202575
This is a rule from years ago, so I don't know if it's still enforced, but at one point it was stated that Humans don't require an OC tag because the "Human" tag implies the existence of an OC. This rule dated from before EqG, but I never saw any kind of update to it. (I believe that was actually in the site blog that gave us Dark Demon King Ravenblood Nightblade . Maybe that's why I still remember it. [Edit: It wasn't that blog, that blog talks about a similar weird thing with crossover. But I could've sworn I read it for human around the same time.])

3202583 I didn't even think of that. So I guess the idea is that because you included it in the equation, we're better able to ignore it when considering other tags.

I learned how much tags can affect a story's views when I realized how I myself searched the front page, and how elements like tags, title, picture and synopses effected my interest in stories. In essence, I judge stories by their cover (sad but true), and I figure most readers on fimfic do as well. After all, you simply don't have all the time in the world to see if a story with your favorite character and one with a pony you care little about are both worth the read. You're more likely to pick the story with your favorite character, unless something else influences you.

The data is very interesting, but I wonder if it's really possible to separate tags from the other variables (because they can influence each other), and thus get reliable data on them. For instance, I'm sure that we've all encountered at least one fic which had no tags that interested us, but its title or picture or synopses or word of mouth did. I think to get a reliable data set, you'd have to collect stories which were equal in all other departments so that you could be sure a difference in views was caused by the tags and not something else--but this seems to me impossible. How do you determine if two stories are equal in the quality of their titles?

3202701 I don't search the front page anymore. I only find stories thru recommendations or by wanting to read stories by a particular person. I'm more likely to read a story because its author made an interesting comment on something unrelated than because it's in the featured box, unless I'm specifically researching stories in the featured box, or it's an author I already know.

The data is very interesting, but I wonder if it's really possible to separate tags from the other variables (because they can influence each other), and thus get reliable data on them.

The results of a linear regression are a statement of fact. For this dataset, the Humor tag has a coefficient of 0.465, meaning stories with the Hu tag have on average a view count that is e^.465 = 1.6 times as great as "similar" stories without the Hu tag. The equation defines what is meant by "similar" stories. The other coefficients from the regression, not presented, give more statistics describing the reliability of this number using measurements that are meaningful for standard statistics distributions, which the underlying data may or may not adhere to.

How you interpret this statement of fact is up to you. I'm comfortable saying people are 1.6 times as likely to click on a comedy as a randomly-chosen non-comedy.

Fascinating. I'd really have thought short one-shots would win out over chaptered stories. I haven't seen any quality difference between the two, and the former are easier to digest (my current "best" story by thumbs and thumb-ratio is a short one).

3202552

While you're looking, note the strong similarity between Ising models and Hopfield neural networks.

An Ising Model is a probablistic Hopfield Network. Given a population size, you can say that the number of views a story gets is directly proportional to the probability that a random viewer will have viewed a story. Your model effectively computes the probability that a random viewer will have viewed a story given characteristics of the story, which according to the Ising Model is equal to z*e^(a dot x + b), where a is your vector of parameters to compute, x is your vector of variables, b is a vector of biases (which is fixed in your case), and z is a normalization constant. Take the logarithm of both sides, and you get the equation that you were solving for.

f(Ad) = b2 * Co + b3 * Ro

That's right.

That, and I ran different regressions with different subsets of data, and took out the variables whose coefficients weren't stable.

That's clever. I think you'll throw out highly-correlated variables by doing this for exactly the reason you mentioned in 3202575, which leads to the gore/sex absence you mentioned in 3202583, but your model I think is a good fit for the variables you ended up selecting.

3202801
I still peruse the feature box from time to time, but I do very little actual reading anymore.

How you interpret this statement of fact is up to you. I'm comfortable saying people are 1.6 times as likely to click on a comedy as a randomly-chosen non-comedy.

This is essentially what I was getting at; not that the numbers aren't mathematically sound or "facts" (though I would have thought you had abandoned this word), but rather, the question for me is what does the "fact" mean precisely? You're comfortable with it meaning what it says; I'm simply not. Perhaps it's my experiences in physics, perhaps it's ignorance. You yourself said "all else being equal" in regards to other variables, and I don't believe it is. Not to mention there are other important factors at play which cannot really be tested accurately (word of mouth, synopses, title, etc). I'm sure a factor of 1.6 for humor tags is certainly an indication of its effect, but I'm not confidant enough to take it at its face value, that's all.

3204246 Your model effectively computes the probability that a random viewer will view a story given characteristics of the story, which according to the Ising Model is equal to z*e^(a dot x + b), where a is your vector of parameters to computer, x is your vector of variables, b is a vector of biases (which is fixed in your case), and z is a normalization constant.

The Ising model is about the interactions between similar particles and an external field. For this to be an Ising model, a would be some variable of the story of interest, and x would be similar one-variable summaries of all adjacent stories. There are no interactions between the stories.

3208734
There would be interactions between stories according to an Ising Model if you treat each story as a particle and allow story-to-story coefficients to be non-zero. Particles in the Ising Model correspond to scalar observables (existence of a tag, a character, probability of a view, etc.). Stories have a valid configuration of observables, to which you tune your parameters. You're solving for the interactions that are most likely to result in the distribution of tags, characters, and views that stories have. No interactions between stories are required, and they definitely don't exist in the model you used or the one I described.

That said, interaction between stories is a thing you can solve for with the Ising Model abstraction.

I don't have access to my desktop at the moment, otherwise I would give the code for computing the model. I can give some rough pseudocode to show you the equivalence.

import theano, theano.tensor as tensor, numpy
inputs = theano.shared(numpy.array(...)) # row vectors of characters, tag, eqd rating, etc
views = theano.shared(numpy.array(...)) # column vector of views as probabilities
# assume some arbitrarily large number of viewers and just divide the actual number of views by that
# we only care about ratios of views, which will hold under the transformation
# it's important to note that the denominator can be arbitrarily high
# which means you can arbitrarily include people that don't read any stories

j = theano.shared(numpy.random.randn(inputs.shape[0], 1))

# log(p / 1-p) = energy cost of flipping the view probability, so p = sigmoid(energy cost)
# since every input*views*j = 0 at views=0, the energy of views=0 is 0, and energy cost = input*j
model_views = tensor.nnet.sigmoid(tensor.dot(inputs, j))
# you can do the reverse as well, but in this case we're assuming that there is no reverse interaction

xent = tensor.nnet.binary_crossentropy
kldiv = lambda output, target: xent(output, target) - xent(target, target)
error = kldiv(model_views, views)

# you get the most accurate Ising Model by finding the j that minimizes error
# the error is minimized when the cross-entropy of model_views against views is minimized
# which happens when views = sigmoid(inputs dot j)
# which happens when log(views / 1-views) = log(views) - log(1-views) = inputs dot j
# since the denominator for views can be arbitrarily high, log(1-views) can be treated as 0
# and you end up solving for log(views) = inputs dot j

Login or register to comment