• Member Since 11th Apr, 2012
  • offline last seen Yesterday

Bad Horse


Beneath the microscope, you contain galaxies.

More Blog Posts758

Jun
5th
2012

Does grammar matter? · 3:04pm Jun 5th, 2012

In my previous post about different ways of measuring authors' popularity and quality, I introduced the "thumbscore per view" measure (TPV), which is

TPV = (thumbs_up - 9.027 * (thumbs_down+1)) / views

This is a rating of a story or of an author, which has an average value of zero over all the stories on FimFiction.net.

I downloaded, over a period of weeks, the first 2 chapters of each story on FimFiction.net with 2 chapters. This lets me compute a new measure of quality, the reader retention ratio (RRR). This is just

RRR = views of chapter 2 / views of chapter 1

Every now and then, RRR > 1. This can happen because a reader reads one chapter while logged in, starts a second chapter, then finishes it later without logging in (or from a different computer), and so counts as 2 readers.

Also, now that I have the text of the first 2 chapters, I can run a grammar-checking tool on that text and score all the stories by grammar. I used LanguageTool because it's free and you can run it from the command-line. There are better grammar checkers, but they require you to load the text into Word, and I wasn't going to load 15,000 chapters into Word by hand.

First I had to convert all the stories, in several different character encodings, into something LanguageTool can read. How I did this would be long and boring to describe; message me if you want details.

Then I had to recalibrate LanguageTool. It misses most grammatical errors, and when it flags an error, it's often wrong. Its rules are just local rules that look a few words to the left and right and then guess. Some of them are always wrong. Some are just the personal preferences of style-Nazis. So I went through the output from hundreds of stories and rated the accuracy of the rules that came up. I then had each rule contribute an error equal to the probability that it was correctly identifying an error. (For some rules, this was under 10%. The awfulness of LanguageTool is a problem here.) Some grammatical errors would trigger multiple LanguageTool rules, so I ignored any errors flagged in the same line and within 5 characters of the previous error. I excluded stories that listed Applejack as a main character, because she riles up that there grammar checker somethin' awful.

It's still pretty bad. The scores are seriously thrown-off by stories with Bon Bon, by authors who write "She waited.." instead of "She waited...", and dialogue where someone says "No no no no!" (which is surprisingly common). Dialogue of characters who are supposed to be ungrammatical is also a problem.

But I'm gonna make my graphs anyway, dammit. We can at least compare the impact on RRR vs. TPV fairly. I produced a grammar score for each story by dividing the total weighted error by the number of words. 0.0002 is a good score. What proofreaders call a "gouge-your-eyes-out" story gets a grammar score around 0.008 (almost 1 error per 100 words - but remember, LanguageTool misses most errors). To get a score that bad, you have to forget to capitalize the first word of every sentence. All the scores I looked at over 0.01 proved to be due to an error in translating the document encoding, so I threw them out. The average grammar score was 0.0014, with a standard deviation of 0.0014. That means, among other things, that 68% of all stories had scores between 0 and .0028.

Now I can plot grammar error scores versus TPV and RRR, and see how much bad grammar impacts these scores.

RRR vs. grammar errors: RRR average = 0.720, standard deviation = 0.309, N = 7,475 stories, R2 (R-squared) = 0.00024


[NOTE: I have changed this picture to one from a very similar one, because imageshack ate the original image. It does not correspond exactly to the data in the text, e.g., the R-squared value is probably slightly different.]

The blue box shows the area within 2 standard devations of the average values. 91% of all datapoints should fall within this box. The green line is the least-squares regression line, which shows the general trend in the data. The main purpose of the box is so you can compare these green lines between these two charts. You don't want to compare TPV's to RRR's, but it is meaningful to compare standard deviations to standard deviations. So think of the two boxes as being equal, and imagine stretching the graphs so the boxes were both squares before comparing the slopes of the green lines.

Notice the slope for RRR vs. grammatical errors is negative. This says that readers prefer stories with good grammar, but very, very slightly. Eyeballing it, I would say that grammar has no impact on RRR. The R2 (r-squared) value of 0.00024 strongly argues that this is the case.

TPV vs. grammar: TPV average = -0.107, standard deviation = 0.255, N = 10,774 stories, R2 = 0.0267


[Again, original image eaten by imageshack.]

NOTE: I have conflicting image files on my hard drive that seem to have come from different datasets, but the result in either case is that 1 grammatical error per 100 words drops the thumbscore per view by about 1 standard deviation. That's actually pretty significant.

So, grammar matters to ratings, but not to whether the reader keeps reading. A story with terrible grammar has about the same chance of keeping a reader's interest, and nearly the same chance of reaping a thumbs-up, as one with perfect grammar. I may revisit this if I get a better grammar-checker.

Next, I'll post about the most-popular characters and genre tags, what we can say about what readers want and what determines whether a story gets a lot of views, and whether this is the same for popular and less-popular stories. Get yourself psyched for some exciting multiple linear regression action!

Comments ( 10 )

I can overlook some clumsiness of writing and such, if the story is otherwise worthwhile, but for me at least grammar matters. Perhaps it's because I'm not a native English-speaker, so I have had to learn the damn language, instead of just absorbing it via osmosis from all around me?

115421
This is because you have refined, bourgeois tastes, and will be the first against the wall when the revolution comes.

115571

The revolution will not be spell-checked. The revolution will not be spell-checked, will not be pre-read, will not be edited, brony.

Language must evolve, we have advanced to a point where even phoneticised phone speech can be intrepreted and mostly understood if needs be.

But there's a large number of people that are very strict about grammar still, insistant that some terrible calamity will befall the world if we don't dot the i's cross the t's and use the correct punctuation.

As soemone that has suffered from a great deal of learning difficulties and managed to overcome a great deal of them I am very pleased that we now have spelling and some grammar checkers built into many word processors, but they can only go so far and suggest so much.

Myself personally, i'd prefer a fun and readable story with a good plot, great characters and the grammar and punctation can take second place.

so you used bad tools and dodgy math to create graphs based on said bad data? I'm sorry, but whilst you've got a few good ideas the result is... not built on a solid basis.

It sounds like, rather than spend time correcting for errors by throwing data out you should refine your search parameters or find a way to use those better tools.

As it stands, this reminds me of somebody who stayed up all night doing fancy mathematics on numbers pulled out their ass who was surprised when I said "wait, you picked numbers and then bothered to try to justify that?"

159522
First, what dodgy math? Regression is tricky, but not dodgy.

Second, the grammar-checker isn't as good as I'd like, but when I look at a story that got a very bad score, it usually has very bad grammar; and when I look at a story that got a good score, it usually has good grammar. Nothing here is pulled out of my ass.

The slope of the regression line is not a number you can rely on; but the overall result - that grammar doesn't matter nearly as much to readers as most people think it does - is, I think, solid, and also the only data in the world that exists on the subject.

159768 I apologize, the math is actually relatively sound. The problem is the data. You're using admittedly bad tools, throwing out certain results and selectively picking others. Take the fact you're looking at reader retention over two chapters. Could it perhaps be that some of your "retained readers" merely check to see if the story is as bad as it was in chapter one? Certainly, but your analysis doesn't take that into account.

Yes, it's mildly interesting, but I don't think it's well designed. A fun afternoon's diversion, sure, but demonstrably false I should think with careful presentation of a few choice fics - all checked with a good grammar checker, where reader retention is checked beyond the first few chapters. People will tolerate a lot of little issues, some of the time, but repeated transgressions and poor sentence structure and paragraph layout does take it's toll.

The problem with trying to use a computer program to detect spelling/grammar errors is that computers are terrible at it. They'll flag up errors that aren't they, they'll ignore regional variations (did you use a UK AND US dictionary?). The only way to do this properly would be to check each story by hand and manually count the errors.

174004
That can't be done. An acceptable way to do it would be to check some good number of stories by hand and manually count errors, and then see what the error distribution is for the computer grammar checker. Then you could take that distribution and use it to estimate confidence intervals for the error in the regression. It wouldn't be precise, but it would tell you precisely how imprecise it was.

174298

It's a fascinating idea I'll give you that, and the results seem quite counter-intuitive, though statistics often have surprising outcomes. Given how tightly the data is packed though, it might be an idea to make the checking more sensitive. I'm just not sure how you could get past computers being terrible at grammar.

Login or register to comment