What makes stories popular on fimfiction? · 10:15pm Feb 15th, 2018
Or more accurately: What factors except writing quality make stories popular?
To answer this question, I extracted the metadata from all published stories on fimfiction.net in July 2016. This includes number of views, likes, dislikes and comments, tagged characters and genres, content rating (everyone, teen, mature), status (complete, incomplete, on hiatus, cancelled), date of publishing and last modification, and whether there is a story image.
For the purposes of this analysis I define popularity as the number of views. At the end I quickly mention what changes if we define it as likes or as likes per view instead. Some years ago, Bad Horse already wrote a blogpost on the results of regression analysis of views, and another one on the effect of having a story image. But I wanted to redo the analysis with more and more recent data, and maybe try some additional statistical methods.
The idea of linear regression is to see how the dependent variable (in our case the number of views) changes when an explanatory variable (in our case for example the date of publishing, the status, ...) changes. Since the number of stories on fimfiction is quite large, the observed relationships are very likely to be close to the true underlying relationships. One problem of linear regression is that it only measures correlation, i.e. the strength and direction of the relationship between two variables, but not the causation, i.e. by what this numerical relationship is actually caused. For example it cannot tell us whether popular stories are more likely to have cover art because the cover art attracts readers, or if it is the other way around and authors retroactively add a cover if they get a lot of views and comments. For many things it's probably both. It also seems probable that views generate even more views, maybe indirectly via number of comments and likes. The model that we assume for linear regression doesn't take any of this into account. It is still useful because while reality is more complex, this model can be easily computed and understood, and is hopefully a reasonable approximation of the real mechanisms.
It is usually a bad idea to assume linear relationships for count data (e.g. views, words etc.). Luckily taking the logarithm of count variables and then fitting a linear model works pretty well, but this has the drawback that the coefficients of the model aren't as intuitive to get anymore. The way to interpret an increase of 0.4902 log(views+1) is multiplicative: If \log(views+1)
goes from 2 to 2.4902, views go from 6.4 to 11.1. But an increase from 7 to 7.4902 \log(views+1)
means going from 1096 to 1789 views. So what it means is to multiply (views+1) by e^{0.4902}
, where e=2.718
.
The linear model we are looking at is\log(views) = b_0 + b_1*first\_published + b_2*last\_modified + b_3*\log(words+1) + \ldots
The names denote the observed variables, the b_i's are the coefficients that we want to determine. The model is called linear because it is linear in the coefficients, which are the things that we are fitting. I fitted the model with R and got the following coefficient values (column "Estimate"):
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 666.6123822 5.6366296 118.26 < 2e-16 *** first_published -0.0015203 0.0000196 -77.47 < 2e-16 *** last_modified 0.0006194 0.0000190 32.54 < 2e-16 *** logwords 0.2369656 0.0030533 77.61 < 2e-16 *** statuscomplete 0.4901738 0.0076534 64.05 < 2e-16 *** statusonhiatus 0.0367372 0.0119909 3.06 0.00219 ** statuscancelled -0.0557838 0.0156382 -3.57 0.00036 *** content_ratingteen 0.0375398 0.0078916 4.76 2e-06 *** content_ratingmature -0.0181064 0.0130343 -1.39 0.16480 has_img 0.4436565 0.0079947 55.49 < 2e-16 *** cat_2nd_Person 0.6327578 0.0382729 16.53 < 2e-16 *** cat_Adventure -0.3301956 0.0085079 -38.81 < 2e-16 *** cat_Alternate_Universe -0.0545752 0.0076597 -7.12 1e-12 *** cat_Anthro 0.2766786 0.0162611 17.01 < 2e-16 *** cat_Comedy 0.0947635 0.0073482 12.90 < 2e-16 *** cat_Crossover 0.0053128 0.0092744 0.57 0.56675 cat_Dark -0.1427224 0.0087215 -16.36 < 2e-16 *** cat_Drama 0.0452714 0.0190422 2.38 0.01744 * cat_Equestria_Girls 0.3464103 0.0221853 15.61 < 2e-16 *** cat_Gore -0.4728614 0.0100367 -47.11 < 2e-16 *** cat_Horror -0.0917762 0.0406754 -2.26 0.02405 * cat_Human 0.2005785 0.0077447 25.90 < 2e-16 *** cat_Mystery -0.1974305 0.0351835 -5.61 2e-08 *** cat_Random -0.1430096 0.0088447 -16.17 < 2e-16 *** cat_Romance 0.0893156 0.0071721 12.45 < 2e-16 *** cat_Sad -0.0900292 0.0086594 -10.40 < 2e-16 *** cat_Sex 0.3924742 0.0110543 35.50 < 2e-16 *** cat_Sci_Fi 0.0044226 0.0308068 0.14 0.88585 cat_Slice_of_Life -0.0190334 0.0076663 -2.48 0.01304 * cat_Thriller -0.1061908 0.0409111 -2.60 0.00944 ** cat_Tragedy -0.1933245 0.0106234 -18.20 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.884 on 82085 degrees of freedom Multiple R-squared: 0.395, Adjusted R-squared: 0.395
Or if you prefer plots:
first_published and last_modified are measured in days. This means that publishing a story one day later goes with an expected decrease in log(views+1) by 0.001520. status has three possible values. incomplete is taken as the baseline, so the coefficient of statuscomplete tells us that being complete goes with an expected increase in log(views+1) by 0.4902 compared to being incomplete. content_rating is similar, with "everyone" as the baseline. The different genre tags are modeled as not mutually exclusive, so they are simply 0-1-variables.
There are two other things of notice in the table above. First, the remaining three columns with numbers. They are all about the same thing, namely trying to quantify the statistical significance of the fitted coefficient. The rightmost column is the p-value: Given that the form of the model is correct and the coefficient is in truth zero, how great is the probability that in a fitted model the coefficient is at least as far away from 0 as we observed. Since we have a lot of data, significance is not a problem for most variables.
Second, the "Multiple R-squared" of 0.395. It tells us how well our model fits the data, and is the fraction of explained variance over total variance (it is between 0 and 1, where 1 means perfect fit). This model only measures the effect of subject choice, length, existence of a cover image, and time. It would therefore be a bit depressing if the R^2 was close to 1, since the model doesn't include the writing quality, and not even the quality of the title, cover image or description.
The numbers themselves aren't all that interesting: We need to interpret them. There are some pitfalls when interpreting the coefficients. For example, the "sex" tag always implies "mature", so it doesn't make much sense to consider the effect of "sex" alone. Mature mostly splits into gore stories and sex stories, so almost all of its effect is absorbed by those tags.
The binary variables are easy to compare: Ranked by coefficient size:
2nd_Person 0.633 Gore -0.473 has_img 0.444 Sex 0.393 EqG 0.346 Adventure -0.330 Anthro 0.277 Human 0.201 Mystery -0.197 Tragedy -0.193 Random -0.143 Dark -0.143 Thriller -0.106 Comedy 0.095 Horror -0.092 Sad -0.090 Romance 0.089 AlternateU -0.055 Drama 0.045 Slice_of_L -0.019 Crossover 0.005 SciFi 0.004
For how little it has to do with the actual story, has_img is very important. A general trend seems to be that people don't like negative stuff.
Among the other categorical coefficients, only "complete" is important when compared with the genre tags, with a coefficient of 0.490. It is rather intuitive that the longer a story has been published, the longer it has had time to gather views, and that given the publishing date, updating later means more activity, which also increases views. logwords also has a quite large effect on views (logviews increases by 0.237 if the number of words is multiplied by e=2.72).
One way to compare different coefficients is by standardizing all variables to unit variance before fitting. For multi-categorical variables this can't be meaningfully done. But we can do it for the binary and continuous variables and get (sorted by absolute value):
Coefficients: Estimate Std. Error t value Pr(>|t|) first_published -0.5961998 0.0076959 -77.470 < 2e-16 *** logwords 0.2675622 0.0034475 77.611 < 2e-16 *** last_modified 0.2503494 0.0076936 32.540 < 2e-16 *** cat_Gore -0.1569475 0.0033313 -47.113 < 2e-16 *** has_img 0.1523097 0.0027446 55.494 < 2e-16 *** cat_Sex 0.1394495 0.0039277 35.504 < 2e-16 *** cat_Adventure -0.1330355 0.0034278 -38.810 < 2e-16 *** cat_Human 0.0756429 0.0029207 25.899 < 2e-16 *** cat_Dark -0.0551465 0.0033699 -16.364 < 2e-16 *** cat_Tragedy -0.0534059 0.0029347 -18.198 < 2e-16 *** cat_Anthro 0.0483905 0.0028440 17.015 < 2e-16 *** cat_Random -0.0479976 0.0029685 -16.169 < 2e-16 *** cat_2nd_Person 0.0456012 0.0027582 16.533 < 2e-16 *** cat_Equestria_Girls 0.0436107 0.0027930 15.614 < 2e-16 *** cat_Comedy 0.0387398 0.0030040 12.896 < 2e-16 *** cat_Romance 0.0374496 0.0030072 12.453 < 2e-16 *** cat_Sad -0.0310630 0.0029878 -10.397 < 2e-16 *** cat_Alternate_Universe -0.0207665 0.0029146 -7.125 1.05e-12 *** cat_Mystery -0.0156822 0.0027947 -5.611 2.01e-08 *** cat_Slice_of_Life -0.0079047 0.0031838 -2.483 0.013039 * cat_Thriller -0.0072872 0.0028075 -2.596 0.009443 ** cat_Drama 0.0067175 0.0028255 2.377 0.017436 * cat_Horror -0.0063105 0.0027968 -2.256 0.024054 * cat_Crossover 0.0017251 0.0030115 0.573 0.566753 cat_Sci_Fi 0.0003968 0.0027642 0.144 0.885850 (Intercept) -0.2450464 0.0070541 -34.738 < 2e-16 *** (statuscomplete 0.4315341 0.0067378 64.047 < 2e-16 *** (statusonhiatus 0.0323423 0.0105564 3.064 0.002187 ** (statuscancelled -0.0491103 0.0137674 -3.567 0.000361 *** (content_ratingteen 0.0330489 0.0069476 4.757 1.97e-06 *** (content_ratingmature -0.0159403 0.0114750 -1.389 0.164797
I also tried using loglikes instead of logviews as dependent variable. There are almost no qualitative changes, except that "mature" has a much stronger positive effect while all other coefficients stay similar.
Changing it to loglikes-logviews (i.e. log((likes+1)/(views+1)) changes a lot:
One interesting thing about it is that now last_modified and first_published both have a positive effect. It makes sense because when you look at early classics, they have very bad likes to view ratio for today's standards, probably because less readers had an account.