• Member Since 19th Oct, 2015
  • offline last seen Nov 11th, 2023

Glitter Hamburger


More Blog Posts2

  • 321 weeks
    What makes stories popular on fimfiction?

    Or more accurately: What factors except writing quality make stories popular?

    Read More

    0 comments · 451 views
  • 415 weeks
    Some statistics on fimfiction story titles

    Ever since I read Bad Horse's blog posts on some fimfiction.net statistics I wanted to do some analysis myself. So when I noticed that I would reach five million read words here very soon, I thought it was the perfect occasion to finally do it and write a first blog post.

    Let's look at some statistics about story titles today.

    Read More

    0 comments · 955 views
Feb
15th
2018

What makes stories popular on fimfiction? · 10:15pm Feb 15th, 2018

Or more accurately: What factors except writing quality make stories popular?

To answer this question, I extracted the metadata from all published stories on fimfiction.net in July 2016. This includes number of views, likes, dislikes and comments, tagged characters and genres, content rating (everyone, teen, mature), status (complete, incomplete, on hiatus, cancelled), date of publishing and last modification, and whether there is a story image.

For the purposes of this analysis I define popularity as the number of views. At the end I quickly mention what changes if we define it as likes or as likes per view instead. Some years ago, Bad Horse already wrote a blogpost on the results of regression analysis of views, and another one on the effect of having a story image. But I wanted to redo the analysis with more and more recent data, and maybe try some additional statistical methods.

The idea of linear regression is to see how the dependent variable (in our case the number of views) changes when an explanatory variable (in our case for example the date of publishing, the status, ...) changes. Since the number of stories on fimfiction is quite large, the observed relationships are very likely to be close to the true underlying relationships. One problem of linear regression is that it only measures correlation, i.e. the strength and direction of the relationship between two variables, but not the causation, i.e. by what this numerical relationship is actually caused. For example it cannot tell us whether popular stories are more likely to have cover art because the cover art attracts readers, or if it is the other way around and authors retroactively add a cover if they get a lot of views and comments. For many things it's probably both. It also seems probable that views generate even more views, maybe indirectly via number of comments and likes. The model that we assume for linear regression doesn't take any of this into account. It is still useful because while reality is more complex, this model can be easily computed and understood, and is hopefully a reasonable approximation of the real mechanisms.

It is usually a bad idea to assume linear relationships for count data (e.g. views, words etc.). Luckily taking the logarithm of count variables and then fitting a linear model works pretty well, but this has the drawback that the coefficients of the model aren't as intuitive to get anymore. The way to interpret an increase of 0.4902 log(views+1) is multiplicative: If \log(views+1) goes from 2 to 2.4902, views go from 6.4 to 11.1. But an increase from 7 to 7.4902 \log(views+1) means going from 1096 to 1789 views. So what it means is to multiply (views+1) by e^{0.4902}, where e=2.718.

The linear model we are looking at is
\log(views) = b_0 + b_1*first\_published + b_2*last\_modified + b_3*\log(words+1) + \ldots
The names denote the observed variables, the b_i's are the coefficients that we want to determine. The model is called linear because it is linear in the coefficients, which are the things that we are fitting. I fitted the model with R and got the following coefficient values (column "Estimate"):

Coefficients:
                          Estimate  Std. Error t value Pr(>|t|)    
(Intercept)            666.6123822   5.6366296  118.26  < 2e-16 ***
first_published         -0.0015203   0.0000196  -77.47  < 2e-16 ***
last_modified            0.0006194   0.0000190   32.54  < 2e-16 ***
logwords                 0.2369656   0.0030533   77.61  < 2e-16 ***
statuscomplete           0.4901738   0.0076534   64.05  < 2e-16 ***
statusonhiatus           0.0367372   0.0119909    3.06  0.00219 ** 
statuscancelled         -0.0557838   0.0156382   -3.57  0.00036 ***
content_ratingteen       0.0375398   0.0078916    4.76    2e-06 ***
content_ratingmature    -0.0181064   0.0130343   -1.39  0.16480    
has_img                  0.4436565   0.0079947   55.49  < 2e-16 ***
cat_2nd_Person           0.6327578   0.0382729   16.53  < 2e-16 ***
cat_Adventure           -0.3301956   0.0085079  -38.81  < 2e-16 ***
cat_Alternate_Universe  -0.0545752   0.0076597   -7.12    1e-12 ***
cat_Anthro               0.2766786   0.0162611   17.01  < 2e-16 ***
cat_Comedy               0.0947635   0.0073482   12.90  < 2e-16 ***
cat_Crossover            0.0053128   0.0092744    0.57  0.56675    
cat_Dark                -0.1427224   0.0087215  -16.36  < 2e-16 ***
cat_Drama                0.0452714   0.0190422    2.38  0.01744 *  
cat_Equestria_Girls      0.3464103   0.0221853   15.61  < 2e-16 ***
cat_Gore                -0.4728614   0.0100367  -47.11  < 2e-16 ***
cat_Horror              -0.0917762   0.0406754   -2.26  0.02405 *  
cat_Human                0.2005785   0.0077447   25.90  < 2e-16 ***
cat_Mystery             -0.1974305   0.0351835   -5.61    2e-08 ***
cat_Random              -0.1430096   0.0088447  -16.17  < 2e-16 ***
cat_Romance              0.0893156   0.0071721   12.45  < 2e-16 ***
cat_Sad                 -0.0900292   0.0086594  -10.40  < 2e-16 ***
cat_Sex                  0.3924742   0.0110543   35.50  < 2e-16 ***
cat_Sci_Fi               0.0044226   0.0308068    0.14  0.88585    
cat_Slice_of_Life       -0.0190334   0.0076663   -2.48  0.01304 *  
cat_Thriller            -0.1061908   0.0409111   -2.60  0.00944 ** 
cat_Tragedy             -0.1933245   0.0106234  -18.20  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.884 on 82085 degrees of freedom
Multiple R-squared:  0.395,	Adjusted R-squared:  0.395

Or if you prefer plots:

first_published and last_modified are measured in days. This means that publishing a story one day later goes with an expected decrease in log(views+1) by 0.001520. status has three possible values. incomplete is taken as the baseline, so the coefficient of statuscomplete tells us that being complete goes with an expected increase in log(views+1) by 0.4902 compared to being incomplete. content_rating is similar, with "everyone" as the baseline. The different genre tags are modeled as not mutually exclusive, so they are simply 0-1-variables.

There are two other things of notice in the table above. First, the remaining three columns with numbers. They are all about the same thing, namely trying to quantify the statistical significance of the fitted coefficient. The rightmost column is the p-value: Given that the form of the model is correct and the coefficient is in truth zero, how great is the probability that in a fitted model the coefficient is at least as far away from 0 as we observed. Since we have a lot of data, significance is not a problem for most variables.

Second, the "Multiple R-squared" of 0.395. It tells us how well our model fits the data, and is the fraction of explained variance over total variance (it is between 0 and 1, where 1 means perfect fit). This model only measures the effect of subject choice, length, existence of a cover image, and time. It would therefore be a bit depressing if the R^2 was close to 1, since the model doesn't include the writing quality, and not even the quality of the title, cover image or description.

The numbers themselves aren't all that interesting: We need to interpret them. There are some pitfalls when interpreting the coefficients. For example, the "sex" tag always implies "mature", so it doesn't make much sense to consider the effect of "sex" alone. Mature mostly splits into gore stories and sex stories, so almost all of its effect is absorbed by those tags.

The binary variables are easy to compare: Ranked by coefficient size:

2nd_Person   0.633
Gore        -0.473
has_img      0.444
Sex          0.393
EqG          0.346
Adventure   -0.330
Anthro       0.277
Human        0.201
Mystery     -0.197
Tragedy     -0.193
Random      -0.143
Dark        -0.143
Thriller    -0.106
Comedy       0.095
Horror      -0.092
Sad         -0.090
Romance      0.089
AlternateU  -0.055
Drama        0.045
Slice_of_L  -0.019
Crossover    0.005
SciFi        0.004

For how little it has to do with the actual story, has_img is very important. A general trend seems to be that people don't like negative stuff.

Among the other categorical coefficients, only "complete" is important when compared with the genre tags, with a coefficient of 0.490. It is rather intuitive that the longer a story has been published, the longer it has had time to gather views, and that given the publishing date, updating later means more activity, which also increases views. logwords also has a quite large effect on views (logviews increases by 0.237 if the number of words is multiplied by e=2.72).

One way to compare different coefficients is by standardizing all variables to unit variance before fitting. For multi-categorical variables this can't be meaningfully done. But we can do it for the binary and continuous variables and get (sorted by absolute value):

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
first_published        -0.5961998  0.0076959 -77.470  < 2e-16 ***
logwords                0.2675622  0.0034475  77.611  < 2e-16 ***
last_modified           0.2503494  0.0076936  32.540  < 2e-16 ***
cat_Gore               -0.1569475  0.0033313 -47.113  < 2e-16 ***
has_img                 0.1523097  0.0027446  55.494  < 2e-16 ***
cat_Sex                 0.1394495  0.0039277  35.504  < 2e-16 ***
cat_Adventure          -0.1330355  0.0034278 -38.810  < 2e-16 ***
cat_Human               0.0756429  0.0029207  25.899  < 2e-16 ***
cat_Dark               -0.0551465  0.0033699 -16.364  < 2e-16 ***
cat_Tragedy            -0.0534059  0.0029347 -18.198  < 2e-16 ***
cat_Anthro              0.0483905  0.0028440  17.015  < 2e-16 ***
cat_Random             -0.0479976  0.0029685 -16.169  < 2e-16 ***
cat_2nd_Person          0.0456012  0.0027582  16.533  < 2e-16 ***
cat_Equestria_Girls     0.0436107  0.0027930  15.614  < 2e-16 ***
cat_Comedy              0.0387398  0.0030040  12.896  < 2e-16 ***
cat_Romance             0.0374496  0.0030072  12.453  < 2e-16 ***
cat_Sad                -0.0310630  0.0029878 -10.397  < 2e-16 ***
cat_Alternate_Universe -0.0207665  0.0029146  -7.125 1.05e-12 ***
cat_Mystery            -0.0156822  0.0027947  -5.611 2.01e-08 ***
cat_Slice_of_Life      -0.0079047  0.0031838  -2.483 0.013039 *  
cat_Thriller           -0.0072872  0.0028075  -2.596 0.009443 **  
cat_Drama               0.0067175  0.0028255   2.377 0.017436 * 
cat_Horror             -0.0063105  0.0027968  -2.256 0.024054 *  
cat_Crossover           0.0017251  0.0030115   0.573 0.566753    
cat_Sci_Fi              0.0003968  0.0027642   0.144 0.885850    
(Intercept)            -0.2450464  0.0070541 -34.738  < 2e-16 ***
 (statuscomplete          0.4315341  0.0067378  64.047  < 2e-16 ***
 (statusonhiatus          0.0323423  0.0105564   3.064 0.002187 ** 
 (statuscancelled        -0.0491103  0.0137674  -3.567 0.000361 ***
 (content_ratingteen      0.0330489  0.0069476   4.757 1.97e-06 ***
 (content_ratingmature   -0.0159403  0.0114750  -1.389 0.164797    

I also tried using loglikes instead of logviews as dependent variable. There are almost no qualitative changes, except that "mature" has a much stronger positive effect while all other coefficients stay similar.
Changing it to loglikes-logviews (i.e. log((likes+1)/(views+1)) changes a lot:

One interesting thing about it is that now last_modified and first_published both have a positive effect. It makes sense because when you look at early classics, they have very bad likes to view ratio for today's standards, probably because less readers had an account.

Comments ( 0 )
Login or register to comment