• Member Since 14th Feb, 2012
  • offline last seen Yesterday

horizon


Not a changeling.

More Blog Posts309

Sep
4th
2014

Confidence intervals (or, a layman's guide to story ranking) · 11:53pm Sep 4th, 2014

EDIT, 2014-10-10: With the recent site update, stories now show their overall site ranking in their Statistics tab. This is based on the same number that controls your sidebar display, so this post also explains how Ranking is calculated. It has been confirmed by Knighty in comments.

Estee recently noted the bizarre way in which upvote/downvote ratios seem to order themselves in the sidebar showing an author's top-rated stories. This is a topic I've briefly mentioned before, so it seemed like an opportune time to dig into the numbers and offer some insight.

Geeking out over FIMFiction statistics follows.

(First, a disclaimer: I have a math degree but I'm trying to present the concept in layman's terms. Statisticians/number geeks are free to sink their teeth into equations in comments.)

Here at the right is a snapshot of the top seven stories in my sidebar. Watch your sidebar for any length of time and it becomes obvious that those ratings are tied largely, or exclusively, to the upvote/downvote ratio of the story; you can watch stories rise and fall as they accumulate votes, and a single downvote can send stories plunging way down the list. Other factors such as views and comments don't seem to make much difference. But if you simply divide the numbers to get an upvote/downvote ratio, this is what you get for those seven stories:

333/4 = 83.25
258/3 = 86.00
138/1 = 138.00
904/28 = 32.29
425/12 = 35.42
283/8 = 35.38
66/0 = (infinitely good; division by zero)

If it was just about upvote/downvote ratio, those stories are wildly out of order. The 0-downvote story should be on top by a large margin, and then it should go strictly from higher ratios to lower.

How come it doesn't work that way?

The short answer: it's based on a statistical technique called a confidence interval surrounding the ratio, rather than the ratio itself. (Source: Discussion here on fimfic a long time back, IIRC from knighty himself, probably in the comments section of yet another news post about the featurebox.)

The idea being: let's say you have a story with 10,000 upvotes and 45 downvotes. This is a pretty huge sample of readers, a significant fraction of FIMFiction. So you can be pretty confident of your actual up/down ratio being really close to what you get if you just mash those two numbers together.

Now take a look at that exact same story earlier on, when it only had 1,000 upvotes. Let's say at that point it had 4 downvotes. If you didn't know the "real" number of downvotes at 10,000 ups, and you had to make a guess of what it would be — based only on your 1000-vote sample — then "10000 up:40 down" would be a pretty good guess, right? But when you're scaling up from a smaller data pool like that, statistics tells us that you probably won't get exactly 40 downvotes. Statistically, your 4-downvote story scaled up to 10x the readers easily might end up with -30 or -50 or anywhere in between, because the data you've got right now is only sampling 1/10 of the people, and you might have accidentally included too many or too few downvoters in your smaller sample set.

A "confidence interval" is basically about saying, "This story will most likely have between X and Y downvotes if it got read enough to accumulate 10,000 upvotes," with a certain degree of confidence (usually 95%, as in "I'm giving numbers wide enough that 95% of my guesses will be correct if we go back and measure later").

Now take the exact same story, except you take your first sample when it has only 100 upvotes. Based on our special omniscient knowledge of 10,000 readers, we know it "should" have had 0.45 downvotes at that point — except that's impossible, because each individual person either downvotes it or not. Most likely, your 100-vote story will either end up with 0 or 1 downvotes. (You might even get really unlucky in your sample and have, say, 3.)

This is why confidence intervals are so important. "100 upvotes to 0 downvotes" is infinitely good because you can't divide by 0, but by turning it into a confidence interval, we can say the equivalent of, "Yeah, it'll probably have somewhere between 0 and 100 downvotes if we find 10,000 readers."

And thus the rub: everything I've seen about the sidebar measurements is consistent with placing stories by measuring the up/down ratios at the pessimistic end of their confidence interval rather than at their midpoint.

Looking at our three measurements of our sample story:

The "+10000/-45" story has no confidence interval because you're already sampling everyone. So at the 10k mark you know it has 45 downvotes, no guessing required.
The "+1000/-4" story has a small confidence interval. Pessimistically, at the 10k mark, you can predict it will have 50 (or fewer) downvotes, because it "should" accumulate 40 but you have to leave some room for error when you're scaling up.
The "+100/-0" story has a big confidence interval because you don't have much data. Pessimistically, at the 10k mark, you can predict it will have 100 (or fewer) downvotes, because you're scaling up zero and it "should" accumulate zero more, but you have to leave a LOT of room for error.
(And if you have a +10/-0 story, you're basically throwing darts; the margin of error overwhelms your real data.)

So in your sidebar list, they would be ranked by the bold numbers above, putting them in this order:

10000 / 45
1000 / 4
100 / 0

Even though simple division shows that the 10,000 story has the "worst" ratio, and the 100 story has the "best"!

Once you connect those dots, a lot of answers fall into place. Such as why 0-downvote stories don't outperform stories with downvotes, and why more-read stories rise in your list despite higher downvote ratios (the "fudge factor" introduced by the confidence interval is smaller).

Final note: Keep in mind that this doesn't affect "heat", which sends stories into the featurebox. (I don't think downvotes have any effect on heat at all.)

:twilightsmile:

Report horizon · 1,065 views ·
Comments ( 34 )

Insert usual lamentation about the lack of voting on blog entries here.

That makes a lot of sense. Thanks for the information! :twilightsmile:

Ugh, I should have recognized this for what it is. No wonder I was having so much trouble with determining how they were calculated. All the signs were there...

Anyway: they SHOULD do it this way. If you merely took the midpoint in the confidence interval, it wouldn't really be very different from just doing straight up ratios, which is bad. Taking the most pessimistic numbers, on the other hand, rewards stories with lots of votes and punishes stories with few votes, which is a good thing; something which has both a lot of votes and a good ratio is better than something with a few votes and a marginally better ratio.

I've always thought what we really should be doing, and the much easier thing to do, would be to just use the data to predict the true approval proportion according to some universal prior distribution for all stories on the site—i.e. take the standard Bayesian approach. And that's actually really, really easy to do. It's also, I think, really close to what we are doing already (though probably way less computationally intensive).

If you want to look at the distribution of a proportion, it follows something called a Beta distribution with two parameters, which we'll call s (# of successes) and f (# of failures). We've got obvious data for those two numbers for each story. Then... well, I could get into the mechanics of it if you all want, but basically you multiply that probability distribution by another Beta distribution that reflects your prior belief, before seeing a story, about what the approval on that story should look like (e.g. based on the population of all the other stories on the site, or just a flat guess for what sort of approval stories ought to have). Let's say this second Beta, our "prior" distribution, has parameters α and β. Then the "posterior" distribution—our final belief about the true approval proportion based on both the data and our prior beliefs, turns out to also be Beta, and the posterior Beta has parameters s+α and f+β. And our best estimate of the true proportion is just the mean of that Beta distribution (assuming squared error loss, from a decision-theoretic point of view), which is:

(s+α) / (s+α+f+β)

tl;dr if you just add about 3 to both the number of upvotes and the number of downvotes on every story, and compute the proportion of upvotes afterward, you'll get something pretty close to the site's ordering. Except, if memory serves, the site does something way more computationally intensive. Which is dumb, because this is a spectacularly easy problem.

2428527
As far as I know, the site's actual means of rating stuff is not public information, though someone may have back-derived it. That being said, you can't really just add three to each; it won't come out right.

2428824
It'll get pretty close, and I'm pretty sure it's a better system than whatever we're using.

My memory is that Bad Horse did a blog post on this back about a year and a half ago, too, where a few different systems for doing this sort of thing were mentioned. But anything other than the Bayesian method seems pretty dumb to me, given that the Bayesian method is stupidly easy to do (just pick an alpha and a beta, and you're done), and you get reasonable results if you use it[1].

Now, like I said, I'm pretty sure this isn't what the site actually does. But it should be, given the above. I'm trying to leave out technobabble here (I've already footnoted one thing and deleted another), but the fact is, from my perspective (and from a decision theory perspective), this is a very standard, very basic problem. I know there are other solutions... but I don't have any idea why there are other solutions. Not only is the standard Bayesian approach going to get very close to approximating anything else you come up with, the whole footnoted thing below about admissibility basically means that the places in which the approaches differ are places in which the non-Bayesian approach is doing stupid things and providing inferior results.


[1] You get an admissible decision rule, anyway, meaning that no rule works better over the full range of true upvote proportions. You could get rules that work better locally, for certain proportion values—I think you basically get those by changing your parameter values on the prior Beta—but given that we're interested in doing this for every story, admissibility seems like a property we'd want, since we really do care about the full range of values.

2428824

Okay, I went and took a look at the Beta(3,3) method. It'll preserve rank order on Horizon's top ten stories, and it'll nearly do it on yours (except that it wants your 9th and 10th stories flipped). It really breaks down for the top stories on the site, though. So yes, it's not going to be a great way to predict whether your story is on the first page of top stories. And in fact, based on the results I'm getting from the top stories, I think I can pretty conclusively state that the Bayesian method is not how the site orders stories. To get the sort of results I'm seeing there, I think you'd have to weight the prior distribution so heavily that 0-downvote stories would essentially be locked out of the top few pages.

So the actual system is putting a substantially higher premium on number of votes than distribution of votes—which sounds obvious, we all know that, but what I'm saying is that it's doing this more than it should, if you were to have an admissible decision rule based purely on upvotes and downvotes. And I'm aware that the secret sauce is supposed to include other things like comments, views, times favorited, whatever. But I don't particularly see why it should, because again, you'd be adding a lot of computational complexity to this problem for almost no effect.

Really, the only decent arguments I can see against the Bayesian method is wanting to have something proprietary, or wanting to discourage people from trying to game the system by making the algorithm easy to figure out. But the first one doesn't seem compelling to me since, again, you're going to wind up with an inferior decision rule; and the second one doesn't seem compelling, since it should be obvious to anybody paying attention that the whole process is still driven by upvote/downvote ratio, and you'll always be able to game the system by hacking that. And the best way to get a decent upvote/downvote ratio, over the long haul, is always going to be just writing decent stories.

Cool! For a site where user rating plays such an important roll in filtering, I find it bemusing that the calculation of heat and rank are kept so mysterious. Your explanation is very helpful in understanding the ranking system and having confidence in its results.

Lacking that, I have been using the (empirically effective) rule of thumb that a story is worth reading if upvotes >= 10% * views. :scootangel:
Notice HR2 scores very well on this metric. It just happens to be missing a chapter or two.

Isn't there some random element to it as well, seeing how the order changes anytime you reload?

2429103
Are you thinking of the "Popular Stories" sidebar on the front page? The order of that is 100% randomized. What I was talking about is the "Stories" list on the right-hand sidebar of an individual story listing or blog post, where that same author's other work shows up.

2428958
A) I can tell you for a fact that the only thing which matters is upvotes and downvotes for rating; absolutely nothing else whatsoever is taken into account.

B) You need somewhere in the realm of 240:0 to be the highest rated story on the site, which isn't unreasonable - nothing really STAYS with that many upvotes and 0 downvotes for very long, so arguably it actually overweighs having 0 downvotes because most things with 0 downvotes have 0 downvotes because not many people looked at them. Nothing with more than 200 or so upvotes remains at 0 downvotes, IIRC.

I don't think there's any real issue with the rating system, honestly; it works as well as can be expected. Also, it could just use some multiplier instead of doing some complicated calculation - I've noticed that there seems to be a sort of sweet spot at around 100:1 at about 900-1000 views - below that point you need more than 100 upvotes per downvote to hit the top, above that you need a bit less. As-is, getting lots of votes is rewarded, as is getting few downvotes. Seems reasonable enough.

2429198 Ahhhh, that makes more sense. xD

Clearly, I need to publish more things so that I can observe this in effect.

Confidence intervals sound like such a complicated notion for such a simple thing. A simple formula like score = upvotes / (downvotes + 2.5) is enough to mostly explain the relative ratings. A slightly better one is (upvotes ^ 1.2) / (downvotes + 2).

I made a spreadsheet with the top ten stories from 3 different authors. It turns out that the latter formula correctly predicts relative ratings with 100% success, and it certainly does so without using Bayesian shenanigans or greek letters.

It’s editable by anyone; feel free to tinker with the formulas or add more data!

2430017
I updated the spreadsheet with this new revelation.

What’s really disturbing is that, from the data we have, the random formula I pulled out of my plot is more accurate than the formula knighty said he was using. This seems to indicate that I made some mistake, but I can’t find it.

(Also, someone added the TopAllTime data, but I don’t think it is based on the same formula).

2430078
I've never quite managed to get that formula to work either and Knighty never replied confirming that it was the correct one. Maybe your formula is actually the correct one that is used today?

2430078
All ratings data is calculated in exactly the same way.

The formula is given in the Wikipedia article: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

2430082
2430078
2430017
2429952
2429198
2428527
So, I can now confirm that he uses the Wilson Score, and namely the lower bound on a 95% confidence interval, in order to calculate rank order for ratings - it is good for at least the top 200 highest rated stories on FIMFiction, and thusly, is probably good for everything on it.

For reference, he is using:

upload.wikimedia.org/math/3/e/9/3e90a1ab0c229383f3affa0dcd952aa8.png

Where z = 1.96, n is the total number of votes, and p̂ is upvotes/(upvotes+downvotes). He is taking the lower bound (that is to say, where that ± is actually a -), and simply ranking them in order from highest to lowest.

2430371
Yeah, I thought it was something like that.

And let me reiterate. God, that is so painfully dumb.


ETA: Okay, I guess if computation is a complete non-issue, it's probably fine. But it's just so goddamn inefficient, for very marginal differences.

2430566
Your method honestly doesn't sound any less computationally expensive, really, and if they do it the way I suspect they do it (basically, the rating is calculated either whenever someone votes, or once every five minutes or whatever, and then stored as a variable) then the computational intensity of it is probably pretty irrelevant compared to sorting the stories.

2430566
2430737
I'm really not getting the "computationally expensive" bit. I don't see any iteration or other expensive operations. Unless I'm missing something, the time it would take for a CPU to compute that formula (probably nanoseconds) is going to be in every way trivial compared to the time it take to fetch the input from the database (generally, some low number of milliseconds).

2431001
2430737
Okay, fair enough. Table access time is probably the biggest expense.

But given that my method is basically (Upvotes + 3) / (Upvotes + Downvotes + 6), it's hard for me to see anything as being a whole lot easier than that. And, again, admissible decision rule for a standard loss function.

[Cloned post – Deleted]

2431046
You're focusing on a very narrow range of stories; remember, it needs to rank ALL stories, not just the top ones. If you have a story with, say, a 200:10 ratio, you're looking at that having a very different effect than something with a 60:1 ratio. And that's ignoring all the stories which have much larger percentages of downvotes.

2428958
Unless, of course, "the site" "wants" something other than the Bayesian distribution you're talking about.

(For example, I suspect num-views should be a factor as well -- eg, if you've got a 200 up:1 down story at 400 views, that beats a 200 up:1 down story at 4,000 views.)

knighty
Site Owner

2430566 This is painfully late but why are you insisting on the computational complexity on this and then having the gall to call me dumb for using it? You realise this is basically zero complexity? I calculate the value when you make a rating and it's cached in the database.

In regard to Fimfic, as people have said, it's indeed the Wilson score with a confidence rating of 95%. I am considering bumping this up to 97% because there are a few too many stories in the 200 range of likes getting into the top rated story list which I think could be pushed out, but it's kind of opinion at that point so I don't know.

2525033
Well, on the gall side, it's probably from me being perpetually stressed out from my grad program and occasionally venting it inappropriately. I'm genuinely sorry about that. It's happened more times than I'd like to say lately, and to some people I really care about, and I'm really trying to fix that. So I really do apologize. I definitely feel like this is one of those points I've gotten sort of stridently dickish about at times over the last year.

As for the computational side—given that I've finally gotten a bit of a stress reprieve—I think I can agree that Wilson scoring can't be computationally too bad. Story voting happens a lot, but it happens on human timescales, so calculating and caching a more complicated value probably isn't going to happen often enough for there to be any serious demand on resources. And since its principle job seems to be creating an ordering for the top few hundred/thousand stories and creating such an ordering on genre-restricted subsets, it makes sense to use a tool that gets the results you want there, since nobody really cares about the validity of rank-order differences between the 20,000th most popular story on the site and the 21,000th.

That said, I do think that a standard Bayesian posterior estimator of the true proportion of upvotes on a story would still be simpler to execute, since it's just "( [upvotes]+a ) / ([total votes]+a+b )", and from a decision theory standpoint I do feel like it makes more sense to use a proper point estimate than to use one side on a confidence interval. And it's easy enough to adjust a and b as inputs to control how many low-vote stories make it up toward the top of the list. All this thus yields my aforementioned [stridency/dickishness].

But on the other hand, and this I haven't been good about admitting, Wilson scoring is doing some nice things, too. I'd say chief among these are that (1) you're happy with the results you're getting from it; and (2) it's the system that's organized voting thus far, meaning that a lot of the voting on top stories probably reflects readers' feelings about the rank-ordering given under that method and those voting patterns might look different if a different approach were used for ordering. And it's not like computing the scores on stories is at fault for the occasional site memory issues, when you stack it up against things like users with 40,000 comments in their user comments stack. I know this, so it really was just me being obstinate and obnoxious to be harping on that. So again, I'm sorry. The system we have is freaking fantastic, and much better than I'd expect to find anywhere else. And I appreciate that. I'm just sometimes a bit of a dick.

:facehoof:

In any case, having said all that, let me just say one other thing, that deserves a lot more airtime than me whining about particulars of decision theory and proportion estimators:

This website is amazing. It's beautiful to look at. It has fantastic functionality for reading and writing stories, and for making friends in the community of readers and writers. Bookshelves are awesome. I love this place, and I think it's a project you should be incredibly proud of. It just blows me away how nice this site is. I wish I were better about communicating that, in general. Because it's something that needs saying. I suspect you probably get a lot more flak from people like me about nit-picky issues like vote scoring than you get compliments on all the work you've done, and if so, that's just shameful. Whatever nitpicks I may have with this site, the fact of the matter is that it's an incredible piece of work. And it just keeps getting better.

2525033 2525324
I'd like to strongly agree with Bradel's last paragraph.

As for the question of 95% vs. 97% (which is less important, but I'm already here), I like the idea of 97 but I'm a little ambivalent. A change would reduce turbulence in the all-time top stories, which gives those rankings more meaning — but at the same time, that turbulence broadens people's exposure to stories, rather than pointing everyone at the same 10 ponyfics that everyone else didn't dislike enough to redthumb (since at the top levels that carries orders of magnitude more weight than a green). On a more micro level, like author sidebars, it would more strongly weight more-read stories to the top of the list, which makes a lot of sense.

knighty
Site Owner

2525820 Yeah that's pretty much why I'm on the fence about it. There's good arguments on both sides

knighty
Site Owner

2525820

function pnormaldist($qn)
{
$b = array(
1.570796288, 0.03706987906, -0.8364353589e-3,
-0.2250947176e-3, 0.6841218299e-5, 0.5824238515e-5,
-0.104527497e-5, 0.8360937017e-7, -0.3231081277e-8,
0.3657763036e-10, 0.6936233982e-12);

if ($qn < 0.0 || 1.0 < $qn)
return 0.0;

if ($qn == 0.5)
return 0.0;

$w1 = $qn;

if ($qn > 0.5)
$w1 = 1.0 - $w1;

$w3 = - log(4.0 * $w1 * (1.0 - $w1));
$w1 = $b[0];

for ($i = 1;$i <= 10; $i++)
$w1 += $b[$i] * pow($w3,$i);

if ($qn > 0.5)
return sqrt($w1 * $w3);

return - sqrt($w1 * $w3);
}

function Wilson( $up, $down )
{
$confidence = 0.95;
$z = pnormaldist(1-(1-$confidence)/2);

$num_ratings = $up + $down;
$phat = $up * 1.0 / $num_ratings;
return ( 100 * ($phat + $z * $z / (2 * $num_ratings) - $z * sqrt(($phat * (1-$phat) + ($z * $z) / ( 4 * $num_ratings) ) / $num_ratings)) / (1.0 + $z * $z / $num_ratings) );
}

For the record, this is the code

2430371
(Bump for above)

2525820
2525910
For what it's worth, I find that the churn on the top stories keeps them fresh, and I'm okay with the idea that something which gets say, 250/0 upvotes ends up near the top of the rated stories list. Generally speaking, stories like that don't stay there for very long unless they are very good, because being very highly rated tends to get more eyes on your story, and thus makes it more likely it will end up (eventually) eating a downvote and sinking down. While it may be true that the occasional "undeserving" story ends up highly rated from time to time, they make up only a small fraction of the highest rated stories, and I, at least, end up peeking at them sometimes to see if they really are "deserving" or not.

I think seeing new faces in the top rated stories more frequently is worth the "cost" of the occasional relatively low vote-getting story ending up high on the list, and frankly, I see stories which are very highly rated which I don't consider to be all that great anyway, so I'm not super concerned that the odd random story ends up highly rated. Plus, having some new writer end up with their story being very highly rated is very exciting to them, I think - I've noticed a number of people showing great excitement when their stories ended up near the top of the list, and they weren't the "featured on X day" types, either. The idea that you can break into the top rated stories is a nice one, I think; the top viewed stories list is nearly impregnable, and thus much less interesting to watch and use.

I am very happy with the new bookshelves system, incidentally, as someone else noted. I'm glad that you keep trying to make the site a better, more user-friendly place.

Login or register to comment