The Writeoff Association 937 members · 681 stories
Comments ( 237 )
  • Viewing 51 - 100 of 237
Bad Horse
Group Contributor

4501091 Yeah, I think so. I tried using just the lower half of the data to see what happened:

> v29 = read.table("votes-29.txt")
> f29=glm(c(t(v29[3])) ~ c(t(v29[1])), family=binomial(logit))
> f29

Call: glm(formula = c(t(v29[3])) ~ c(t(v29[1])), family = binomial(logit))
Coefficients:
(Intercept) c(t(v29[1]))
-0.912111 -0.007051

Degrees of Freedom: 28 Total (i.e. Null); 27 Residual
Null Deviance: 34.16
Residual Deviance: 34.16 AIC: 38.16
> q()
$ clisp
> (defun foo (intercept beta) (lambda (x) (exp (+ intercept (* beta x)))))
> (defun p (intercept beta votes) (let ((f (funcall (foo intercept beta) votes))) (/ f (+ 1 f)))) P
> (p -.65 -.055 10)
0.2314752 ;; just checking
> (p -.91 -.00705 10)
0.27279258
> (p -.91 -.00705 5)
0.27984107

A glance at the data shows that all this is telling us is that a bunch of stories with 9 votes and a bunch with 5 or 6 votes got into the finals.

RogerDodger
Group Admin

4499585
Could you explain how you'd go about solving this system? I don't have much experience with this class of problem.

It seems like you'd end up with inconsistent equations, e.g., S_a - S_b = c, S_a - S_b = d, c != d, so the system has no solution.

Bad Horse
Group Contributor

4501284

It seems like you'd end up with inconsistent equations, e.g., S_a - S_b = c, S_a - S_b = d, c != d, so the system has no solution.

That's right! When you're making a model, you always want to have many more datapoints than you have degrees of freedom in your model, because your model isn't an exact solution to a complete description of the world, but something that you want to generalize to data you haven't seen yet. Solving a system of equations finds a precise solution to a precisely-defined problem; creating a model looks for regularities in data.

The easiest way to make a model is like this:

1. Define the thing to be predicted as a function of other variables, including both the variables that will be given to you with each datapoint, and the variables you want to find: y = f(x1, x2, ...). In this case, you just need to know that the thing to be predicted is d (the difference in ranking in a slate), and other than the thing you're trying to predict, you only have the variables want to find, S_a and S_b, so d = f(S_a, S_b). f is just a symbol so far.

2. Define a model that makes a prediction z for y, z = m(x1, x2, ...). Now you need to build an actual function. In this case, z = S_a - S_b.

3. Define an error measure as the sum of the squares of the differences between what your model predicts and what you observe for each datapoint. This gives you an error function E = sum over all i of [ m(x_i1, x_i2, ...) - f(x_i1, x_i2, ...) ] ^ 2 . (If you get data with each datapoint, this is where you fill in those values.)

4. Find the values for the variables you want to find that minimize the value of E. The general way to do this is to compute the partial derivative of E with respect to each of those variables, and then repeatedly make small changes to those variables one by one that always make E a little smaller. With this system, though, I think we can solve it exactly. We have

E = sum over all triplets a, b, d from our voting slates of this value V:
V = (S_a - (S_b + d))^2
S_a^2 - 2*S_a*(S_b+d) + (S_b + d)^2
S_a^2 - 2S_a*S_b -2dS_a +S_b^2 + 2dS_b + d^2
dV/dS_a = 2S_a - 2S_b -2d
dV/dS_b = 2S_b - 2S_a +2d

If we find a place where dV/dS_a = dV/dS_b = zero, that will be a maximum or minimum. (We're not going to find a maximum, and that's not obvious, but a maximum to an error function is a perverse kind of thing.) That's a set of equations that we do want to solve exactly, because it gives us the exact values for S_a and S_b that will minimize our error with the data we have. Kind of confusing there; we're solving an exact system of equations on one level to solve for parameters that will give us an approximate model on our original level.

This next step is a little weird and I'm a little sleepy, so I hope it's right. When I said
dV/dS_a = 2S_a - 2S_b -2d
the value of S_a has to be the same each time story #a is used, but the value of d is different for each triplet.
But remember E is the sum over all values V, and the derivative of a sum is the sum of its derivatives, so we can just say

dE/dS_a = sum_i = 1 to N over the N matchups of a versus b of [ 2S_a - 2S_b - 2d_i ] = 2NS_a - 2NS_b - 2*sum_i [ d_i ]
dE/dS_b = 2NS_b - 2NS_a +2*sum_i [ d_i ]

So now we can take all our triplets, add up sum_i [ d_i ] to get D_ab, then set both partial derivatives to zero:
NS_a - NS_b - D_ab = 0
NS_b - NS_a + D_ab = 0
NS_a = NS_b + D_ab
NS_b - (NS_b + D_ab) + D_ab = 0
0 = 0

Dammit. That wasn't supposed to happen. 4501070 , tag, you're it!

Bradel
Group Contributor

4501629
Gimme a minute here. I'm trying to figure out your notation, and then how to approach the problem. I see what you guys are doing, and this has a nice, statistical closed-form answer I think. I just have to figure out what the best way to deal with this is.

Axis of Rotation
Group Contributor

Man guys, who knew that a simple writing contest would get so complicated? :derpytongue2:

So it seems that having participation drop off in the last week feels wrong, possibly not voting on the ultimate winner feels wrong, possibly not being able to vote on hotly discussed stories feels wrong, and not having a finalist round feels wrong to some and right to others. We all feel our sense of community is important, along with our ability to properly express our feelings in a vote, as well as just having fun, whether that's via discussing and reviewing and sharing thoughts or reading good stories (I've only read half of one fic so far, but was anxiously curious if any Celestia focused stories were submitted :P ).

No one really knows the answer here, but I believe trial and error will most effectively pave the way. If something doesn't really work we'll just toss it out. As such, I'm fine with initiating some of these changes, because there is dissatisfaction, and whether or not I may disagree with it doesn't make it wrong and me right. People are unhappy, so let's try to do something about it, within reason. It's not like we can never return to any one configuration.

Lurker here who is interested in the writeoff but has not participated (yet). I am de-lurking to make a suggestion on how to calculate scores--namely, to sugges using a Simple Rating System (SRS). For more information on how the rating system works, see:
http://www.pro-football-reference.com/blog/?p=37
http://www.pro-football-reference.com/blog/?p=39
(appologies for those who may not follow the talk about American football on the pages)

SRS offers one main advantage over what seems to be the current scoring system in that it accounts for the relative strength of stories on one's ballot. For example, in the current scoring system, a top placing story in a ballot gets the same score regardless of whether it is the "cream of the crop" or the "cream of the crap." In contrast, SRS looks at how each of the stories on a ballot score on other peoples' ballots to see whether each individual ballot is a relatively strong or weak ballot, and metes out scores accordingly.

It should be relatively simple to implement. Let's say a ballot contains seven stories, and the user ranks those stories from best to worst (let's call the best story A and the worst story G). This produces the result of 21 "games" between the stories (i.e. A beats B, A beats C,..,B beats C, B beats D....,F beats G). These wins and losses go into a database the results from others' ballots, and you could see, for example, that eight people thought A was better than B versus only two people thought that B was better than A. You could use this result to define a "game score" for the A vs B matchup (e.g. say, A is 6 "points" better than B). SRS would then calculate a rating for each story based on these "game scores." Conceptually, the system tries to rate the stories so that the game scores are consistent with the ratings. If A beat B by 6 points and B beat C by 3 points, then A should be rated six points higher than B and 9 points higher than C. Of course, you might also have the case where A played C and beat C by only 5 points. SRS generates its ratings by minimizing the inconsistencies between the game scores and ratings, and it seems to be able to do so with fairly simple matrix algebra (see the first link for a simple description of the implementation) (alternatively, it would also probably easy enough to code a Monte Carlo method to determine ratings as well).

Of course, as I am posting this late at night after a few drinks, I cannot guarantee that I have not overlooked an obvious flaw. This may also be equivalent to the solution proposed by 4501629

Bradel
Group Contributor

4501284
4501629

Okay, let's start out by defining some notation. This is going to get a bit ugly.

Let S_{i,k} be the score for the i'th story in the k'th person's voting records. We'll keep these subscripts throughout—i is always story and k is always voter (or slate).

Let d_{i,j,k} be the difference in score for between the i'th and j'th story in the k'th slate.

Now, we've got two options. We could either try to obtain rankings based on scores, or based on pairwise differences. The first problem is much easier. Unfortunately, I'm pretty sure it's also not what we want, because it'll only give us the same ordering asymptotically, assuming good randomization of stories in slates. Without knowing the randomization is good, we have the good slate / bad slate problem, where if a story happens to show up in weak slates for the most part, it may get judged to be a very good story. Since we're dealing with 50-100 stories right now, and some of them are only getting 5-6 votes, this is a significant problem. On the other hand, with expanding slates, this may be less of an issue and it may be possible to switch to the computationally way, way easier score prediction problem—because as slate size increases, it's going to become much harder to have a "good" or "bad" overall slate. That's your law of large numbers kicking in—as the sample size gets larger, the sample mean story quality will approach the population mean story quality, i.e. the mean story quality for a particular writeoff.

Score prediction is easy. Every story gets k=1,...,K_i scores depending on how many slates it appears in. Take the average of these scores; that's our best guess for the true story score value. Rank order the best guesses of story scores. I'm guessing that's what we're doing right now.

Bad Horse is trying to do some least squares error minimization on differences. That's good stuff—my dad quite literally wrote the book on the linear algebra approach to solving that problem. Now, the two questions in my mind are, how would we do this and is this the right approach? Let's start with the second, because it could invalidate the first.

So what's our setup here? This took a lot of thinking about, before I could figure it out. Our data here (our predictor variable, if you will) are our differences. What we want to estimate (our response variable) are rankings. I think this'll work, but what we're trying to do is obtain a list of idealized S_{i}'s (scores for each story) that minimize the squared error Bad Horse was talking about. For readability, since I can't use math symbols here, let's do this in code.

To read this code, note that story_n is the total number of stories in a writeoff, comp_n is a story_n x story_n square matrix containing the number of comparisons between each pair of stories in the dataset (NB: the diagonal isn't being used, and can contain any values—for data storage simplicity, it might be a good place to store the predicted scores for each story under the least squares optimization), and d_{i,j,k} = –d_{j,i,k} for all i, j, and k.

sq_err_total = 0 FOR ( i FROM 1 TO (story_n - 1) ){ FOR ( j FROM (i + 1) TO story_n ){ FOR ( k FROM 1 to comp_n[i,j] ){ this_sq_err = ( S_i - (S_j + d_{i,j,k}) )^2 sq_err_total = sq_err_total + this_err } } }

What we're trying to do is pick a vector of S_i's to minimize sq_err_total. You can do this the way Bad Horse was doing it, by taking a system of derivatives with respect to each S_i and solving for the story_n-dimensional minimum. If we consider what this'll look like, for each S_i we're gong to get something like the following:

d(sq_err_total) / d(S_i) = sum_{j != i} ( sum_{k=1}^{comp_n[i,j]} ( 2(S_i) - 2( S_j + d_{i,j,k} ) )

And if we set that equal to zero, it's pretty straightforward to see that our estimate of S_i given all the other S_j's is:

S_i_hat = ( sum_{j != i} sum_{k=1}^{comp_n[i,j]} S_j + d_{i,j,k} ) / ( sum_{j != i} comp_n[i,j] )

That gives us a system of story_n equations we need to solve simultaneously, but the solution to that set of equations will be our best estimates of the story scores based on the difference data.


Does that help? Is there anything I can clear up in that?

Bradel
Group Contributor

4501754
I took a quick browse, and I'm pretty sure they're doing the same thing I just laid out, basically—which is the regression model approach (and probably the only reasonable way to deal with this sort of problem, really). The d-scores (steps one story falls above or below another) are serving the same role as the point differentials in the SRS model.

Again, quick browse, but I'm pretty sure that's the case. They're gussying it up a little, though. All we really need, I think, are relative rankings—and I'm pretty sure that means we can solve the system of equations for the S_i's, stop there, and we've got everything we want. We don't have to worry about predicting future matchups. We could, but all the information we actually care about is contained in the S_i estimates.

Bradel
Group Contributor

4501788
I'm thinking about the meaning of the d_{i,j,k}'s under that model, and I intuitively set them up the way I was describing them before—as being based off a fixed-scale system, like a +10 to -10 rating (where, like I mentioned before, I think it makes sense to seed an invisible +10 and -10 into each slate to adjust for small slate sizes). Let me explain.

If we take the d_{i,j,k}'s to be number of steps different—which, yes, is what I said above—it's going to cause some funkiness with large ballots because you're going to see huge separations between stories on those. Bad Horse seemed to like this idea, but I'm pretty sure I'm going to disagree with him on that. I think doing it that way would basically wash out the contribution of all small slates and make only large slates matter, effectively creating a voting arms race where your vote only really counts if you vote on almost everything.

Putting it on a smooth, fixed scale (or asymptotically bounded, as above) means that we're predicting an observed difference between two stories on a ballot with two stories should be about the same as an observed difference of about seven stories on a slate of 20. I think that's much more reasonable than to say the 20-story slate gives a 7 and the two-story slate gives a 1. Part of that is because of the number size issue. Part of that is because—and this is pure intuition, I haven't thought this out—I'm suspicious that doing the "number of stories separating places" approach is going to mean that we can't meaningfully combine data across slates because the numbers we're looking at are so fantastically different. Consider a pair of stories that only appear together in small slates. Even if there's a big difference between story quality, we'll observe a very small difference in the d_scores. That could screw up the interrelationships with other story ranks, because now the equations would be saying stories close to A should also be close to B, since A was close to B in all our observed comparisons (since they all happened in small slates).

We can potentially do some reweighting in terms of how much information certain slates should contribute, I think, but I have a nagging suspicion that the d-scores need to mean the same thing for every slate, or the system will have some very weird issues.

RogerDodger
Group Admin

4501857
I don't think a fixed (or even asymptotically bound) scale would work. Consider:

Story A and B have "correct" ranks of 1st and 2nd respectively. On slate 1 they are the only two stories and ordered (A, B). On slate 2 they are two of twenty other stories and ordered (A, B, ...). Both slates correctly say A>B, but with a fixed scale, slate 2 says A>>B, even though the "correct" gap is quite small. This slate gets us further from the correct answer rather than closer, despite having no error.

The problem with slate 1 is that there's simply too little information. It only says one thing. We shouldn't be inferring anything else from it.

Instead we could simply say that all d_{i,j,k}'s are either 1 or -1 (or 1 or 0). Intuitively this seems problematic for the following case:

Story A and B have "correct" ranks of 1st and last respectively. On slate 1 they are the only two stories and ordered (A,B). On slate 2 they are two of twenty other stories and ordered (A, ..., B). Both slates only say A>B. It seems like slate 2 should say A>>B.

However, slate 2 also says every other story in it is > B, so B is still losing a lot more points from slate 2 than slate 1, which is appropriate.

This ends up with each slate giving T(n) bits of information, which seems correct since that's how many comparisons are involved.

RogerDodger
Group Admin

4501754
This sounds the same as what Bradel is explaining, though (I think) it was a bit easier to understand.

He says that the systems in the respective articles end up at the same place, though doesn't prove this. Is this simply observed by doing both calculations? Because the SRS calculation seems easier to grok, though that may just be my lack of stats skills showing.

Bradel
Group Contributor

4501912
I think we may be meaning different things by "fixed scale". I was using it to mean something like what it sounds like we're doing now, but putting hard upper and lower limits (i.e. "fixing") the bounds of the scale. So the way I'm envisioning it, we'd have a 20-point scale (or whatever, really) going from +10 to -10. Based on a given voting slate, each story would be assigned a score relative to that slate, with stories being equally spaced in the fixed scoring range. Removing my bit about rescaling with invisible edge votes, that would mean:

With a 2-story slate, the votes would be {10, -10}
With a 3-story slate, the votes would be {10, 0, -10}
With a 4-story slate, the votes would be {10, 3.33, -3.33, -10}
With a 5-story slate, the votes would be {10, 5, 0, -5, -10}
With a 6-story slate, the votes would be {10, 6, 2, -2, -6,-10}
With a 7-story slate, the votes would be {10, 6.67, 3.33, 0, -3.33, -6.67 -10}
...

(My rescaling idea involves just pretending a 2-story slate is a 4-story slate and ignoring the 10's, or a 3-story is a 5-story and ignoring the 10's.) (I'm starting to think this adjustment factor is a bad idea, because of the number of "votes" larger slates get by having many more 1-to-1 comparisons. It would probably makes more sense to give scores exactly as stated above.)

I'm not quite sure what you were talking about in the first example, because the way I'm thinking of this, your example would have A>>B on the first slate vs. A>B on the second slate. "Number of ranks different" would have d_{A,B} = 1 on both scales. The rescaling I'm thinking of would have d_{A,B,1} = 6.66 and d_{A,B,2} roughly equal 1.

Slate 1 isn't going to contribute any information to the problem, except through the estimation of the score for the two stories on that slate. But since all of the stories are interrelated between the various slates, that information is going to get combined with other slates to estimate the scores for those stories, and the scores for those stories are going to play a role in estimating the scores for every story that's been compared to them, when the equations get solved simultaneously.

The thing that I'm worried about in my comment above, though, has to do with disagreements in how people rate stories. If we had perfect interrater reliability, then I'm pretty sure the way d's get defined wouldn't matter, since we'd always have a strict ordering. The troublesome bit is figuring out how to balance things so that voters opinions are fairly comparable, even when they disagree. My intuition says the "Number of ranks different" method would give a lot more weight to the opinion of someone who had read a large slate of stories, because their d-values can become much larger than the d-values for someone with a smaller slate. That's the source of my comment about an arms race of story voting. I think giving more weight to the opinions of people who've read more stories is good, but I don't think it should necessarily wash out the contribution of people voting on smaller slates. If anything, the more I think about it, smaller slates may need higher weighting, because each slate is going to get counted in this model c(c-1)/2 times, where c is the number of stories in that slate.

Starting to drift off here, so I may have to pick this up tomorrow morning.

RogerDodger
Group Admin

4501932
What if slate 1 is ordered (B, A) (a reasonable mistake, since the two stories are of close quality)? Then d_{B,A,1} = 6.66, which will affect the final ranking more than d_{A,B,2} = 1.

Better way to phrase this: say the "correct" score of A and B are unknown and we're trying to calculate it. (What the system is actually doing.) Given slate 1 (B, A) and slate 2 (A, B, ...), isn't the correct resolution that A=B?

Titanium Dragon
Group Contributor

4500530

>> Cold in Gardez
I've added the prelim score numbers to the results page. They do match up pretty well with the public score numbers.

This is interesting to see, but the present display method is hard to read. I think adding a "Prelim" before the prelim score or a "final" before the final score would help.

As far as the ordering goes: they don't actually appear to be that good at predicting it. The top two scores don't even always both end up in the top 3 at the end (though the top score appears to in our limited sample size), and stories sometimes go way off from their order. For instance, The Gathering got the silver in I Regret Nothing, but wasn't even in the top 10 by prelim score. Sharpspark's Japanese-titled story only barely squeaked into the finals (it actually appears to have a lower prelim score than the top two stories that didn't make the finals?) but got 16th place (pretty middling). Octavia the Rebel had a pretty high prelim score but did quite poorly in the finals, and Gossamer fared only a bit better. Cold in Gardez's stories in The Best Medicine are also significant outliers.

>> Titanium Dragon
The process would more go along the lines of, you first rank the initial (say) 10 stories you're given (easy). Then you request one more story and slot it in where it fits (easy). Repeat until alcoholism.
I don't expect most people will be rankings 30+ stories anyway.

I don't expect most people will either, but I suspect that the people who fill out very large numbers of slates probably contribute as many data points as everyone else put together as a group despite being an extreme minority. Closing time had 36 stories and 76 slates; if you did every story you had what, 5 slates? If 7 people did 5 slates, they'd make up half of the data points. 7 people reviewed all the stories, suggesting that 7 people probably did fill out 5 slates each.

RogerDodger
Group Admin

4501961
Pearson correlation between prelim and public scores in I Regret Nothing and Great Expectations are 0.7633 and 0.6949 respectively.

This is pretty good considering the number of already identified issues with the current voting systems.

If your argument is that "they are not correlated enough, and therefore two-round voting is a good idea", you should consider that this same conclusion is possible when the finals round results are incorrect. So while it's interesting, I don't think this information is useful.

RogerDodger
Group Admin

4501956
Continuing this line of thinking:

If slates 1 and 2 are (B, A) and (A, C, B, ...), does A=B? I think yes. This changes as soon as we get more information about C.

Titanium Dragon
Group Contributor

4501970

If your argument is that "they are not correlated enough, and therefore two-round voting is a good idea", you should consider that this same conclusion is possible when the finals round results are incorrect. So while it's interesting, I don't think this information is useful.

I would assume that the finals round voting is likely to be more accurate than the first round of voting because it is heavily slanted towards people who made it to the finals. People who made it to the finals are the best writers and therefore are most likely to be the best judges of writing, and thus there is a higher signal to noise ratio. This might also expect why some of the stories by more sophisticated writers (Cold in Gardez, Bad Horse) do better in the finals than they do in the preliminary rounds.

RogerDodger
Group Admin

Stories by more sophisticated writers (Cold in Gardez, Bad Horse) do better in the finals than they do in the preliminary rounds.

Aside from this being demonstrably false, you're making some pretty questionable assumptions there.

Titanium Dragon
Group Contributor

4502038
We've got a small sample size, so I'm not sure if we can really demonstrate anything very convincingly.

I Regret Nothing: CiG's story won flat out in both metrics. Bad Horse's didn't make the top 10 in the prelims but got 2nd in the finals.

Great Expectations: Fits pretty well overall, other than my own story, which tied with Horizon's in the prelims but got drubbed in the finals, and Pros and Cons, which only squeaked into the finals and ended up in the middle.

The Best Medicine: Horizon clearly stomped everyone there. Some of CiG's stories did very well in both, while others (It's A Fine Line, If You Can't Cry) only barely made the finals but did quite well.

Closing Time: Mostly fit pretty well. Cold Hooves underperformed and Apple Knots overperformed a bit, but that may not be significant.

I also noted "some of the stories". It obviously wasn't all of them.

Of course, I may be seeing patterns where none exist. The sample size is, again, very small, and I'm not sure on the variance of them.

Caliaponia
Group Contributor

4498143

I like this idea, and would be interested in seeing it tried out.

Not duplicating voting effort and maximizing reading and discussion for all entries both seem like worthwhile goals. I also like that the idea would result in a shorter voting period, similar to what we just had. More compact events would open up the possibility of holding other events on the other half of the month (such as some of the alternate event ideas that have been bandied about).

Regarding the actual methods of improving the voting algorithm, I'm afraid my statistics isn't up to parsing the entirety of the approaches discussed. From on what I've understood, though, I thought I'd toss out a couple ideas as far as the voter interface goes.

Ranking large or expandable ballots sounds fine, but in regards to the issue raised about the difficulty of ranking a whole ton of stories, it may make sense to give the reviewer the option to rank two (or more) fics as equal.

There was also a request for reviewers to be able to choose to rank an entry of their choice. That's not a bad idea (it would increase rankings, but maybe to even things out, in such a case they would also be requested to rank a suggested entry.

Also, regarding voting for one's own story, I think you should never vote / rank your own story, but to preserve anonymity, the system should auto-suggest a +1 (with your own being the +1, if it would've otherwise been on the ballot).

The one thing I would miss from the prelim round is having a nice-but-lower-ish bar to aim for, since hitting the top 3 is one heck of a tall order, with some of the talent we have around here. Maybe that could be mitigated with some of the extra badges that have been discussed.

4501915
4501815
4501629

Yes, it looks like SRS is equivalent to the error minimization method you all have proposed. I'm not sure of the proof showing the equivalence of the two methods described by the two football-reference pages. However, I actually think the error minimization/SRS method is not the best way to go forward. Namely, it relies on using some point difference (Bradel's d(i,j) scores) between the stories. In rank ordering the stories, however, the votes do not give a good indication of the actual point differential between stories. For example, a points-based ranking system would assume an equal point difference between the 1st and 2nd stories in the list as the 2nd and 3rd stories in the list, whereas it could be the case that #1 and #2 were close and way better than #3 or that #1 was way far ahead of #2 and #3. Therefore, any rating system should take into account the ordering of stories only, but not make assumptions about the point difference (e.g. if story A is ranked higher than story B, story A should not receive an additional bonus relative to story B if story C is rated between stories A and B).

With this in mind, I propose using a Maximum Likelihood rating system. In this system, the rating of of story A (R_A) is defined such that the probability of a writeoff participant voting for story A over story B is given by the expression R_A / (R_A + R_B).

After users have submitted their story rankings, you would first look at the fraction of people who through story n was better than story m for each pair of stories (n,m) in the contest. For example, let's have the following votes: A 8 - B 2, A 9 - C 1, B 5 - C 5 (i.e. 8 of 10 voters thought A was better than B, etc.). We define the likelihood of the observed voting results to be (for simplicity I'll write R_A as A, R_B as B, etc):
P = (A/(A+B))^8 * (B/(A+B))^2 * (A/(A+C))^9 * (C/(A+C)) * (B/(B+C))^5 * (C/(B+C))^5
and choose the ratings of stories A, B and C that maximize P. (the formula ignores combinatorics, though I don't know if that would have any big effect).

Maximizing P is equivalent to maximizing ln(P), so to calculate the (A,B,C) ratings, we need to calculate the partial derivatives of ln(P) with respect to A, B and C:
ln(P) = 8 ln(A) - 8 ln(A+B) + 2 ln(B) - 2 ln(A+B) +...
ln(P) = 17 ln(A) + 7 ln(B) + 6 ln(C) - 10 (ln(A+B) + ln(A+C) + ln(B+C))
d( ln(P) )/dA = 17/A - 10/(A+B) - 10/(A+C)
A = 17 / (10/(A+B) + 10/(A+C))
similarly
B = 7 / ((10/(A+B) + 10/(A+C))
C = 6 / (10/(A+B) + 10/(A+C)))
We find A,B and C by initializing A,B, and C at some value (e.g. A = B = C = 1), and iterating the above three equations.
newA = 17 / (10/(1+1) + 10/(1+1)) = 1.7
newB = 7 / (10/(1+1) + 10/(1+1)) = 0.7
newC = 6 / (10/(1+1) + 10/(1+1)) = 0.6
and repeating until the values converge.

For a large number of stories, the general formula for iteration is:
N = (# of times story N has been rated higher than another story) / sum over all M {(number of times story N has been compared to story M) / (N+M)}

Here's a page containing code for a modified version of the ML method (modified to handle "undefeated" stories or stories with no "victories") http://www.davemease.com/football/ and here is a link to a paper published in the American Statistician on the method http://www.davemease.com/papers/football.pdf

JaketheGinger
Group Contributor

A quick something to add here.

Not everyone in the group are superb mathematics. I'm sure I'm not the only one who cannot even begin to comprehend these statistics and weird looking algebra.

As such I can't really give any opinion or comment on changes being made, since I have no clue what's going on.

Bad Horse
Group Contributor

4501857

Putting it on a smooth, fixed scale (or asymptotically bounded, as above) means that we're predicting an observed difference between two stories on a ballot with two stories should be about the same as an observed difference of about seven stories on a slate of 20. I think that's much more reasonable than to say the 20-story slate gives a 7 and the two-story slate gives a 1.

Oh, yeah, I just wasn't willing to do the math. How did you do it, anyhow? Are you using a binomial distribution?

Bad Horse
Group Contributor

4502052 When you say something "barely made the finals", how do you know?

Bradel
Group Contributor

4502084
My intuition on the ML system you're mentioning is that, especially if we're using the relative ranking voting method, we're probably violating one of the key assumptions for doing it that way. This looks similar to the core of a Bayesian approach, which would also depend on the likelihood equation—except that I don't think p is the likelihood equation. You've got it set up as a product of a bunch of Beta/Binomial kernels[1], where each comparison group acts as a trial and you combine across trials. But the joint distribution of all those trials (which is mathematically equivalent to the likelihood equation) probably can't be a simple product like that, because to get the joint distribution looking like that, you'd need all the trials to be independent of each other—e.g. I'm pretty sure knowing the probability A beats B and the probability B beats C can't tell you anything about the probability A beats C, for that equation to make sense[2].

I'm not sure what happens with this thing asymptotically. There's a decent chance that the values would converge to the right place asymptotically as the number of stories increased, as long as the number of stories individuals vote on didn't increase proportionally. Basically, the closer the system gets to individual one-on-one votes, the more correct that model is. But with slates of votes on multiple stories, especially when those slates include any significant fraction of the total (finite) population of stories, that independence assumption is going to break down more and more. I'm not really sure what happens at that point, but it's definitely going to screw with the results on some scale.


[1] The kernels should be all you need here, because ML is going to get to toss out all the combinatoric terms. When you log the likelihood to make it easier to differentiate, the combinatoric terms will become additive constants, which obviously get knocked out when you start taking derivatives.

[2] Digging in a little deeper, what needs to be independent here are the data, and the data are the number of successes on each trial type. So the question is: assuming A, B, and C appear together on 5 slates, if we know A beat B in 3/5 slates and B beat C in 3/5 slates, does that tell us anything about how often A beat C? Obviously it does; A must have beaten C on at least one slate, given the above information. For looking at things like actual one-on-one matchups, you should be fine satisfying the independence assumption, but when you're dealing with multiple "matches" coming from one slate of votes like this, you lose independence and the likelihood equation is going to change.

4502295
That's a very good point about the interdependence of rankings from voting slates. I guess extending the ML method from one-on-one matchups to writeoff-based voting slates is not as trivial as I thought. The overall algorithm may still work, but it's probably worth playing around with some simulations to make sure nothing screwy happens because of the interdependencies.

Bad Horse
Group Contributor

There is also something to be said for just stupidly hacking out code that naively does what you want to do:

#!/usr/bin/perl -w # Compute scores for things based on relative rankings use strict; use warnings; use List::Util 'shuffle'; my @Votes = ( # Each array is one person's votes, from best at left to worst at right [ 'a', 'g', 'd' ], [ 'c', 'b', 'f', 'd' ], [ 'b', 'd', 'f', 'e' ], [ 'b', 'a', 'c', 'f' ], ); my %Score; # $Score{a} = score for 'a'; need not be unique to 'a' # Initialize each score to zero # (Allow scores to go to infinity in either direction) # Note that the only list of stories that exist is those voted on foreach my $vAR (@Votes) { foreach my $story (@$vAR) { $Score{$story} = 0 } } my $N = scalar(keys(%Score)); # number of stories my $step = .1; # Repeatedly go thru all the slates for my $loop ( 0 .. 100 ) { foreach my $vAR (&List::Util::shuffle(@Votes)) { my $slate = scalar(@$vAR); # slate size: # stories voted on # Take each adjacent pair in the ranking for my $i ( 0 .. $slate-2 ) { my $a = $vAR->[$i]; my $b = $vAR->[$i+1] // die "out of bounds"; # Adjust their scores to conform more to that ranking my $diff = $Score{$a} - $Score{$b}; # If we want the difference to be $N / $slate : #my $goal = $N / $slate; # Rather, as argued by Roger, aim for a constant difference my $goal = 1; my $change = $step * ($goal - $diff); $Score{$a} += $change; $Score{$b} -= $change; } } $step *= .96; } # Print out scores my @stories = sort { $Score{$b} <=> $Score{$a} } keys(%Score); for my $s (@stories) { print "$s: $Score{$s}\n"; }

$ perl rank.pl a: 0.793231227723605 c: 0.661997498497781 b: 0.626570935314012 g: 0.123301082274607 f: -0.402597061452645 d: -0.515990006056656 e: -1.2865136763007

Bradel
Group Contributor

4501956
4502268
Okay, let me revisit the difference metric issue, now that I'm more awake. I think the thing Bad Horse just quoted may have been me being stupid, glancing at it.

So, following on the Least Squares Optimization method, I'm identifying three scoring methods people have talked about. Let's consider all three:

Rank Difference (RD): d_{i,j,k} = the number of placement differences between two stories, e.g. if a rated slate reads {A,B,C,D,E} then d_{A,B} = 1 and d_{A,E} = 4.

Fixed Scoring (FS): d_{i,j,k} = the "amount" of difference between two stories relative to a fixed +10 to -10 system, where stories in the slate are positioned to maximize score difference over that fixed range. So for our {A,B,C,D,E} slate, d_{A,B} = 5 and d_{A,E} = 20. This is proportional to the RD scoring system when slate sizes are equal, but when slate sizes are different, FS will create smaller d's relative to RD.

Test Scoring (TS): d_{i,j,k} = 1 if story i beat story j, or -1 if story j beat story i. This system will ignore the potential extra information from knowing how many intervening stories a slate had between the stories being considered. Following our example with {A,B,C,D,E}, d_{A,B} = 1 and d_{A,E} = 1.

Also yes, I totally picked the names of the scoring systems so that I could spend the rest of this post thinking of them as Rainbow Dash, Fluttershy, and Twilight Sparkle.

So to see how these work, let's consider a simple case. It's not going to look very simple, but I think it's the smallest we can get this and still get meaningful information. Imagine a writeoff has four stories, and four voters—the authors of those four stories. Each author turns in a slate of votes, and they look like this:

A: {B,C}
B: {A,D}
C: {A,D,B}
D: {B,C,A}

Here's what the d-scores will look like under the three systems:
Matchup RD FS TS Slate 1 B vs C 20 1 1 Slate 2 A vs D 20 1 1 Slate 3 A vs B 20 2 1 A vs D 10 1 1 B vs D -10 -1 -1 Slate 4 A vs B -20 -2 -1 A vs C -10 -1 -1 B vs C 10 1 1

Now we can look at our system of equations. Under RD, we've got:

S_A = ( ( 2S_B + (20-20) ) + ( S_C + (-10) ) + ( 2S_D + (20 + 10) ) ) / 5
S_B = ( ( 2S_A + (20-20) ) + ( 2S_C + (20+10) ) + ( S_D + (-10) ) ) / 5
S_C = ( ( S_A + (-10) ) + ( 2S_B + (-20-10) ) ) / 3
S_D = ( ( 2S_A + (-20-10) ) + ( S_B + (10) ) ) / 3

...and I can't solve that system of equations, because when I try, I'm getting A = B-2 one way and B = A-4 the other way.

Okay 4502268, tag, you're it!

bookplayer
Group Contributor

4502134
What they're doing right now is looking at different systems that seem like a good idea, and figuring out if they're actually a good idea.

This is important, because (for example) sometimes you have something like "rate all the stories 1-10" that sounds like it should be fair, but it becomes unfair because people use different numbers to mean different things, so if my "5" means the story was okay, and TD's means it was very good, then both of us giving a story a five isn't going to out it where it should be. So they go to the pure math behind things to see how it works over a large number of stories or votes.

In addition, they have to figure out even if a system seems like it should work, how exactly you'd program something to calculate it.

So even though I totally agree that I can't follow any of this, it's important for them to throw around these numbers so we get the actual best system, not just what seems like it should be the best system.

JaketheGinger
Group Contributor

4502418

So even though I totally agree that I can't follow any of this, it's important for them to throw around these numbers so we get the actual best system, not just what seems like it should be the best system.

Oh yea, not saying they shouldn't do the math. It's important that they do and get it right. I'd just like a simple explanation/conclusion so others can weigh in on the discussion without saying total rubbish because we don't know the full facts.

Bradel
Group Contributor

4502418
Yes. Many machines on Ix. New machines.

hazeyhooves
Group Contributor

too complicated. how about a thumbs up/down system?

:trollestia:

Bradel
Group Contributor

4502424
Well, unfortunately I don't think any of us have figured it out just yet.

I want to use a statistics / linear regression approach because it gives you something called the best linear unbiased estimators (BLUEs). Basically, that means that you wind up with the estimates of the scores/rankings being approximately correct on average ("unbiased") and you get the least variability in those estimates among the class of ones that are approximately right on average ("best"). BUT my example math just down didn't work out, so I'm not quite sure what went wrong. Could be any number of things. In principle, though, the BLUE approach should be really good.

You may be able to beat that approach in terms of the variability of the estimates, though (i.e. how far the estimated scores can get from the idealized true scores) if you let them be biased. That's what an ML approach like 4502351 proposed might be able to do (except that it's not totally clear we're estimating the same quantities). The ML estimates might be biased (the predicted scores are slightly wrong on average) for small numbers of stories, but for larger numbers of stories they'd probably work fine. BUT I'm pretty sure his equations don't work correctly because of the difference between how our system works and how things like head-to-head sports matches work (i.e. our "matches", voting slates, involve more than two participants).

Neither of us have our math quite working, so now 4502375 is suggesting a brute force approach to just let the computer iterate and figure everything out on its own, I think. I haven't dug into his message yet, and PERL isn't my specialty anyway. This system is the tried and true CS approach to problems (:trollestia:), and it's probably not going to have good optimality properties, but it'll probably also work pretty much okay.

I think 4501994 has been focusing more on what it means to say "Story A beats Story B" (which was also what I was trying to do in my last mathy post that ran into the math trouble). That's really important, because different mathematical interpretations of that phrase could lead to very different results, depending on slate sizes and how many votes we get per story. If we had tons and tons of data, how we score things probably wouldn't be a huge issue, but with the limited amount we get, it's important to protect against letting certain types of inputs overwhelm other types of inputs—like letting voters who only look at two or three stories matter way more than those who look at 20 or 30, or vice versa. Balancing out the influence of different people's opinions here is pretty hard. (Personally, I suspect the best way to figure that problem out is just going to be trying different systems, simulating some data or looking at previous writeoffs, and seeing if any of the scoring methods cause really weird behavior in the results. For what it's worth, I'm currently suspecting the TS system is the best system for balancing information from different voters.)

...I've got a strong suspicion that didn't really help.

Bad Horse
Group Contributor

4502401 I don't see where you got those equations. You're trying to do something like "Sum over all the comparisons a story A is in, and for each story B it is compared to, add (d * S(B) / n) to its score, where d is rank(A) - rank(B) and n is the number of stories ranked." But it isn't that, exactly, and I don't want to keep trying to figure out what it is. Just tell me. BTW, all your equations have an extra right-paren.

I still haven't looked at 4502084's maximum-likelihood approach, which I want to do first. I was originally thinking of trying to use some kind of Rasch analysis, but didn't want to lose the information about the distance between stories in a ranking. My first thought was that a maximum likelihood approach would have that same problem.

But in any case I'm doubtful that further math wankery will produce a significantly better result than my Super-Stupid Perl Program 4502375, which just uses the rankings to pull on the scores iteratively until they won't budge any further. I changed it to avoid the problem 4501912 pointed out.

Bradel
Group Contributor

4502538

They're the solutions to the minimization equations from 4501788. I think. Unless I screwed something up.

Also, I totally have the correct number of parentheses. How dare you, sir! How dare you!

4502538

didn't want to lose the information about the distance between stories in a ranking

I'd argue (and I think 4501956 is also trying to argue, though correct me if I'm wrong), that the distance between stories in a ranking is not meaningful. Consider stories with the "correct" ordering of A, B, C, D, E, F G. A voting slate containing A, E, F, G and a voting slate with A, B, C, E will give widely different values for R_A - R_E (where R_A is the rating of story A and R_E is the rating of story E) even though both slates agree with each other. This variance just adds noise to the rating calculations and should be ignored. The ML approach does this, and the least squares can be adjusted to not include information about relative rankings as well (4502401's TS scoring system).

So perhaps this is an important point to decide upon before figuring out how to implement a ranking system. Should the distance between stories in a ranking be taken into account? I argue no.


4502493 I think I would say that the least squares and ML approaches are estimating different quantities (which will probably be highly correlated).

Titanium Dragon
Group Contributor

4502275
Look at the results page:

This is the one for I Regret Nothing

Roger stuck up the prelim voting records (they're the second set of numbers).

4502401

A: {B,C}
B: {A,D}
C: {A,D,B}
D: {B,C,A}

I'm not sure this is a good test dataset because it does not look to have a unique solution. Two authors agree unanimously that B > C, and two authors agree unanimously that A > D, but there is a 1-1 split between whether A > B or B > A and there are no comparisons between C and D. (it would be a good test of 4502375's Monte Carlo method as it would hopefully come out with A,D,B,C and B,C,A,D as equally likely outcomes).
[edit: nevermind, actually I think there is a unique solution: ADBC]

Titanium Dragon
Group Contributor

4502493
Thought, along the lines of experimentation:

What if next competition, we did both a ranking of stories from best to worst, and the old 0-10 scale system, and compared the results and see how much they differed?

Then at least we'd know if we're actually getting different results, and how different they are, in actual practice, with real data over the same stories. Right now we can't do direct comparisons because they're not over the same stuff.

Bradel
Group Contributor

4502612
Well, that might explain why my math didn't work.

I just had to go and choose an unsolvable system of equations, didn't I?

Bad Horse
Group Contributor

4502594

So perhaps this is an important point to decide upon before figuring out how to implement a ranking system. Should the distance between stories in a ranking be taken into account? I argue no.

How would you objectively decide this question?
If you'd decide it using a Hamming-like error function that counted how far you'd have to move each story to produce each voter's slate from the system's predicted slate, you'd probably say the distance should be taken into account.
If you'd decide it by choosing the system that was most likely to produce the observed votes, you'd probably choose the likelihood method.

EDIT: Scratch that. Show me that a maximum likelihood method based on predicting the likelihood of frequency(A > B) also maximizes the likelihood of observed slates, and I'll be more inclined to your view.

I think your argument cuts both ways. I've had many slates where I felt groups of stories on it were indistinguishable in quality. In that case, my (A, B, C, D) was really ((A, B), (C, D)), and equivalent to a ranking of (B, A, D, C). If you do the math, which is tedious, I think you'd find that considering the distance would convey more information about the voter's true opinion in cases where she considers adjacent stories of equal value.

4502555

Also, I totally have the correct number of parentheses. How dare you, sir! How dare you!

Sir, you are correct, and I am wrong. My humblest apologies.

4502612 My program wasn't a Monte Carlo method, since it was deterministic. I just changed it to go thru the slates in random order each loop. Does that technically make it a Monte Carlo method? It feels to me like such a trivial change shouldn't change the way we describe it. (Most of the differences showed up in the 3rd decimal place.) It's an energy-minimization method.

Bradel
Group Contributor

4502657
Technically, "Monte Carlo" is just synonymous with "Simulation".

4501915

Ok, here is a proof of how the two ways of looking at SRS are equivalent (plus a way to solve for the rankings in 4502401's least squares method). Consider three stories A, B, C with ratings A,B,C. We know the differences in score (which are not transitive) d_{i,j} for all pairings of A,B,C.

We can write these differences as:
A - B = d_{a,b} + e_1
A - C = d_{a,c} + e_2
B - C = d_{b,c} + e_3
and find A,B,C by minimizing the e_n terms.

E = e_1 + e_2 + e_3 = (A-B-d_{a,b})^2 + ...
Taking the partial derivatives:
dE/dA = 2 (A - B - d_{a,b}) + 2 (A - C - d_{a,c}) = 0
You can rearrange this equation to get:
A = (d_{a,b} + d_{a,c}) / 2 + (B + C) / 2

In other words, the rating of story A should be equal to the average margin of victory of story A against all other stories, plus the average rating of the stories to which it was compared (weighted by number of match-ups). To find A,B,C, you can initialize the values with their average margin of victories, recalculate the A,B,C then iterate until A,B,C converge (as described in http://www.pro-football-reference.com/blog/?p=37).

4502620
The data set does have a unique solution (ADBC). I was just being dumb. Still dumb. See my next post for the actual estimates.

4502657
The ML method should be able to do this. Also you're probably right that I was incorrect to describe your method as an MC method. I have the excuse that I'm a biochemist, not a computer scientist.

Bad Horse
Group Contributor

4502612 $ perl rank.pl A: 0.166439515507966 B: 0.164036810586457 D: -0.164214927493822 C: -0.166261398600601 $ perl rank.pl B: 0.165708227208687 A: 0.163784941986014 D: -0.164495818658509 C: -0.164997350536192

Looks like the right outcome to me. A and B must get the same rank, and so must C and D, and (A,B) > (C,D). To convince yourself of this, look at the original rankings:
B,C
A,D
A,D,B
B,C,A
Now set b=A, a=B, c=D, d=C:
a,d
b,c
b,c,a
a,d,b
Look familiar?

Bad Horse
Group Contributor

4502671 "Monte Carlo" refers to randomness being an essential ingredient. It refers to Monte Carlo, probably the world's most famous casino. It's used when you have so many possibilities that you can only sample a small subset of them. If you sample all or most of the dataset, it's not Monte Carlo.

4502401
Using the iterative method in 4502723, here are the least squares ratings for your three scoring systems:
FS 2.5000 2.5000 -4.1667 -4.1667
RD/TS 0.1250 0.1250 -0.2083 -0.2083

For comparison, here are the rankings ML comes up with (maybe we should call this Perfecting Probabilities :pinkiehappy:)
ML 1.2331 1.2331 0.6165 0.6165

So, for the test data set you have made, the least squares, ML and Bad Horse's energy minimization all give equivalent answers.

horizon
Group Admin

4502657

I've had many slates where I felt groups of stories on it were indistinguishable in quality. In that case, my (A, B, C, D) was really ((A, B), (C, D)), and equivalent to a ranking of (B, A, D, C). If you do the math, which is tedious, I think you'd find that considering the distance would convey more information about the voter's true opinion in cases where she considers adjacent stories of equal value.

I don't have the time to dig into the math at the level you all are, but I wanted to strongly agree with this. There's a human factor here: I think that most people can easily distinguish A>>B, but as the difference between A and B approaches zero, the ordering between them becomes increasingly arbitrary.

I'm going to throw out an example (using our existing 1-10 finalist scale) which I don't think is a degenerate case; it has happened to me basically every round I've voted:

10: Super Awesome Story
9: Also Awesome Story
8: Great Comedy, Great Drama, Good Sci-Fi (which I like), Awesome Story With Big Problems
7: Great Slice of Life (which I dislike), Good Comedy
(...etc...)
4: Incomprehensibly Artistic, Well-Written Until The Ending, Great Ideas But Bad Writing
1: String Of Poop Jokes

We're ultimately trying to make subjective judgments translate into objective ratings, and when I assign two stories the same score I think it's because in my judgment they're approximately as good as each other in different ways. Force me to rank-order these beyond "yeah, these things are in the same bucket" and all I'm going to do is make post-hoc justifications on increasingly irrelevant distinguishers. Even on seven-item lists I find myself dithering at times between A, B, C1, C2, D and A, B, C2, C1, D, and I can't think of a minific preliminary (~15 entrants on a ballot) in which I haven't shuffled stories based on 15 seconds of dithering and a random moment of remembering an arbitrary fact I liked or disliked about one of the two contenders.

1-10 voting is better at distinguishing outliers than ranked voting, which may or may not be a desired feature; but shifting from that to ranked voting both loses information (the magnitude of the distinction between i and j) and introduces noise (e.g., rank-ordering my four 8s above means that, of two stories I considered identically good and would be equally happy to see with medals, one is now in 3rd place and one is in 6th). Implement rank-ordering with ties and, well, you're basically back to a 1-10 system again, except possibly one that scales based on the size of the range the voter is using.

Bad Horse
Group Contributor

4502979 How'd you do it? Can you post code? Can you post results on a more-interesting dataset?

4503094 So for the least squares calculations, I manually calculated the average margin of victory, then ran some quick & dirty scripts in matlab. For example, for Bradel's FS data:
mA = 20/5; mB = 20/5; mC = -20/3; mD = -20/3; A = mA; B = mB; C = mC; D = mD; for i = 1:50 newA = mA + (2*B + C + 2*D)/5; newB = mB + (2*A + 2*C + D)/5; newC = mC + (2*B + A)/3; newD = mD + (2*A+B)/3; A = newA; B = newB; C = newC; D = newD; end

Similarly for the MLE, I here's my matlab script
A = 1; B = 1; C = 1; D = 1; for i = 1:50 newA = 3/(2/(A+B) + 1/(A+C) + 2/(A+D)); newB = 3/(2/(A+B) + 2/(B+C) + 1/(B+D)); newC = 1/(2/(B+C) + 1/(A+C)); newD = 1/(2/(A+D) + 1/(B+D)); A = newA; B = newB; C = newC; D = newD; end

Analyzing a larger data set would writing a bit more code to figure out the number of matchups and the number of victories/margin of victory from the voting slates instead of doing it manually.

Too much math and programming. Beyond my basic programming skills completely. :derpyderp1:

Have you guys considered using a rubric to score? It seems like it could eliminate some issues -- specifically story A is better than story B in slate 1 but not compared to story C in slate 2 and rater A rates completely different than rater B. Instead of judging stories based against the others in a slate, you'd judge the story against a set scale. It'd probably be difficult agreeing to a scale though. Since this is the first time I've participated, I have no idea if you guys have thrown around this idea before. Apologies for the lack of math though. :twilightblush:

  • Viewing 51 - 100 of 237