The Writeoff Association 937 members · 681 stories
Comments ( 15 )
  • Viewing 1 - 50 of 15
RogerDodger
Group Admin

Since the first event with the new voting system is now over, I'd like to analyse the correctness of the used ranking algorithm and determine if another is more fit for purpose.

The results page for the event now has a link to the slates (input data). They are formatted as a list of story IDs, one slate per line, ordered from highest to lowest rated. (Outputting story IDs instead of titles is intended. The question here is not "should this story have beaten that story?" but "does the output match the expected output, given this input?")

https://gist.github.com/RogerDodger/96173c8fac9d8bedda4c contains a module and script for parsing this data. If you want to write a new algorithm without having to write the I/O stuff (and know Perl), just add a sub to Rank.pm and call it from rank.pl instead of twipie().

It is desirable (though not required) for an algorithm to also return some "error" value, suitable to use for the "most controversial" award.

Comment posted by KwirkyJ deleted Aug 10th, 2015
Titanium Dragon
Group Contributor

TIL one of the people who voted for every story put The Fabulous Flim Flam Fibrous Fetilizer Fic as their 5th choice.

Another person who voted stuck it in 2nd on a smaller ballot.

I love you guys <3

horizon
Group Admin

4619020
I don't know that I have the brain to debate algorithms here (I was halfway checked out of the previous thread's statistics geekery), but I do notice a few things about the results:

*1* It was asked several times in the discussion thread, but still no explanation of what the score number and the epsilon value represent. Hard to comment on a system without understanding it.

*2* The vast majority of finals voters did not rate over half the stories (21 did vs. 35 didn't). There were a number of accepted finals ballots with only a few stories ranked:

2841 8274 2556 2273
3744 4433 5102 1086
6886 2886 3744
2841 5102 6430
1942 2841
8748 5164

2.1) I know that the old "vote on half the finalists to make your ballot count" cutoff was chosen somewhat arbitrarily, but it still served a purpose of reducing outliers (e.g. story X ranked first because it beat story Y and the reader didn't see stories A through W). Were these micro-ballots counted?
2.2) What weighting is given to a ballot with two rankings vs. a ballot with 35+? Did story #1942 and story #8748 each get credited with a 100% vote even though they only beat one competitor?

I did some basic sanity checking, e.g. comparing first-place #6886 with second-place #5164 and sixth-place-with-much-lower-score #9533, and on a macro level the rankings seem intuitively fine, but I'd like to know enough to make a more informed assessment.

RogerDodger
Group Admin

4621385
The scores are comparable. A story with score X will beat a story with score Y with probability X/(X+Y).

Given this, the difference between expected and actual outcome is the "error".

There isn't any direct weighting. The only weighting in practice is the natural consequence of larger ballots containing exponentially more matches. (Bear in mind, for anyone interested in coming up with algorithms, that this particular interpretation of ballots is not necessarily correct, just the one being used.) My concern is that this would actually weigh large ballots too heavily.

Titanium Dragon
Group Contributor

4621385
4621545
In this vein, another question:

Did we see more, less, or the same number of complete ballots (i.e. ballots containing all of the finalists) in the final round than we did in previous rounds?

Titanium Dragon
Group Contributor

Incidentally, having looked at the actual numbers, I'm not sure if any ranking system is really going to give us significantly better results than this one; a lot of the results look to be pretty ill-sorted in any particular list. I see stories ending up near the top and near the bottom and in the middle and all over the place, and only the top and bottom stories show a high level of consistency; we seem to agree reasonably well on the best and the worst, but the middle ones are a big mangled mess.

Looking at the finals, for instance, we have an error of .46 to .48 for almost all the stories, and quite a bit of variability in the placement of most of them; even the winning stories were pretty mixed, and anything outside of the top 3 was very mixed.

I don't think there's much of an indication that the big ballots are having too much of an impact, though; one of the top 3 stories was second to bottom on one of the big ballots, and the 5th placed story was dead last on one of them (Interestingly, 2nd place, 5th place, and 6th place were 2nd to last, last, and third to last on the same ballot). 9352 got 3rd place and was ranked pretty consistently high on the big ballots.

That being said, there's another valid interpretation: that people are only really ranking the top and bottom of their ballots carefully, and the rest are being thrown into the middle more or less randomly, which would be bad. Might be worth asking a question about that - were people carefully sorting their whole ballot, or were they mostly just focusing on the best and worst? In the final round, when the worst stories were taken away, the bottom of the ballot had as large of errors as pretty much the entire rest of the ballot, whereas in the first round, there was a clear decline in the error amongst the best and worst stories.

It could just be that we don't really agree on much other than the top and bottom, though, which is also entirely reasonable. Hard to differentiate between the two.

Bad Horse
Group Contributor

Meh. It looks good at first glance. I'm too depressed at the results to look further than that.

KwirkyJ
Group Contributor

While not exactly related to the algorithm (which it at least sensible—one would have to do quite some in-depth research on statistical methods to suggest something demonstrably better), a possible interface approach to avoiding any ballot weighting is to change the way users rank stories.

I propose: instead of having a single slate where n (of k) stories are ranked in order, readers are instead given a number (say, p?) of two-story slates where they must rank, in isolation, which of two stories is 'better.' The generation of comparisons pairs (now distinct slates) is therefore completed by the reader… the current algorithm should be very easy to adapt. This should, in theory, preempt any side effect of 'weighting' in long ballots, or grading noise, as posited above, where ranking distinction has become muddled, overlooked, forgotten, and so on.

Handling random selection and possible overlap will be an issue in implementation, however, particularly pertaining to the request to 'give me more stories to rank' (or, p++).


EDIT: overlap in the per-user slates arguably being a benefit. 'Issue' so said as it is a point worth raising, not meant as a value judgement at this time.

RogerDodger
Group Admin

4639855
Larger ballots having more weight is desirable. The concern is that maybe they are weighted too much.

Doing a bunch of pairs rather than a full ballot throws away a lot of data points, so it'll just make the results worse. Aside from that, it'd over-complicate the UI.

Titanium Dragon
Group Contributor

4640287
Looking at the data, there is no clear pattern of "stories on big ballots" being really divergent from "stories on little ballots". There doesn't seem to be much agreement on a lot of the stories in general, though, to be honest. Some are obviously on top, others on the bottom, but a lot of them are pretty variable in their placement - a good chunk of the stories in the middle range from "very high" to "very low".

Have you tried running the program on just the "big" ballots and just the "small" ballots and seeing how much it differed from the group selection?

KwirkyJ
Group Contributor

4640287

Larger ballots having more weight is desirable.

Upon reflection, I do understand this incentive… in addition to receiving the information with how it compares to a single story in the ballot, it also has information related to all the other stories in that ballot (and not just any coincidental overlap). Only as the count of ballots increases to (at this time) unreasonable numbers does this additional value become less of a factor; the pairs-exclusive slates makes sense when the number of ballots (and therefore, slates) approaches an ideal of infinity. When considering having to decide between possible purity of comparisons and getting as much data as possible from a limited set of ballots, the decision to use the current ranking system is most respectable.


4640319
If one makes the somewhat-optimistic assumption that each user is consistent and conscientious in their ranking, is it possible that much of variance between ballots stems from a disparity of values? It would follow that those which are truly faulty would be largely agreed to be the 'worst,' and those that are very soundly constructed and executed would be the 'best,' with those in between subject to which aspects the disparate readers value. Not sure what—if any—useful discussion can come from this, but it seems relevant to the concern.

Titanium Dragon
Group Contributor

4640359
Doing a comparison of my finalist ballot with Bad Horse's, each of us put only one of the other's top 10 stories in the top 10 (though if we go down to the top 13, that number rises to three). In all fairness, each of us put one of the other's stories in the top 10, which lowers our level of similarity, but even still, I think if you ordered the ballots completely randomly you might end up with a similar level of similarity.

Bugle
Group Contributor

As someone with a (very) partial ballot, the reason for that was I was super busy and had no intention of voting/reading the finalists at all this past WO. But since my original slates had some finalists, those seemed to be grandfathered in to the finalist ranking block. I assume that this is the case for many other partial ballots since voting in the prelims is mandatory if you want to advance while voting in the finalists is completely optional.

While having your rankings carry over is ideal (it means less work), it does have this as a somewhat unforeseen consequence. I would recommend not counting finalist ballots that don't have fewer than half the votes cast. Having that many small ballots implies they skewed voting a bit.

RogerDodger
Group Admin

4644570
This isn't an issue when small ballots are appropriately weighted. If you say that A>B>C>D>E in prelims, and A and D pass, then it stands to reason that you still say that A>D. It's better data than no data, so it isn't worth discarding.

Also, voting in prelims is no longer required.

  • Viewing 1 - 50 of 15