• Member Since 22nd Jan, 2013
  • offline last seen Oct 20th, 2022

Bradel


Ceci n'est pas un cheval.

More Blog Posts144

Apr
29th
2014

Adventures in Science! – Bayesianism, Frequentism, and the Foundations of Statistical Epistemology · 2:53pm Apr 29th, 2014

At the behest of a certain villainous equine, I come to you with a story.

Once upon a time, in merry old England, lived a Presbyterian minister by the name of Thomas Bayes. He also died, in 1761, and if that had been the end of things, in all probability no one would remember who he was. But the Rev'd Bayes had two things going for him: an in-progress paper called "An Essay toward solving a Problem in the Doctrine of Chances" and a literary executor named Richard Price, who saw the paper read to the Royal Society in 1763.

Blah, blah, blah. Why should we care? Well, gentle reader, you should care because this paper included one of the most foundational statements in probability theory, a result that is totally uncontroversial and yet stands at the heart of the greatest debate in the history of statistics.

Put in words, let A and B be two events, let p(A) denote the probability that event A occurs, and let p(A|B) be the conditional probability that A occurs if it is already known that event B occurs.

Then the conditional probability that event B occurs when event A is known to have occurred is equal to: the probability that A occurs when B is known times the probability that B occurs, divided by the total probability that A occurs (broken down into the probability when B also occurs and the probability when B does not occur).

I want to make sure everybody's cool with that idea before I continue, because it's kind of important.


Let's skip forward a few years. Well, a lot of years, actually. Galton and (Karl) Pearson have come and gone, Gossett (you may know him as Student) and Fisher have come and gone, and Neyman and (Egon) Pearson have come and gone. In fact, let's skip forward to Leonard Jimmie Savage.

Savage is one of those people you've probably never heard of, because he wasn't quite famous enough. Immediately after receiving his PhD in 1941, Savage spent a year at the Institute for Advanced Study at Princeton. He worked with John von Neumann there—and, if Wikipedia is to be believed, served as chief "statistical" assistan to von Neumann during World War II. Savage is often credited as being the modern founder of the Bayesian school of statistical inference, though his work owes a debt to Pierre-Simon Laplace and Bruno de Finetti, among others. His ideas are laid out in his 1954 book, Foundations of Statistics.

So, historical foundation. Now what's all the fuss about? Let's begin as Savage begins:

It is often argued academically that no science can be more secure than its foundations, and that, if there is controversy about the foundations, there must be even greater controversy about the higher parts of the science. As a matter of fact, the foundations are the most controversial parts of many, if not all, sciences. Physics and pure mathematics are excellent examples of this phenomenon. As for statistics, the foundations include... probability, as controversial a subject as one could name. As in other sciences, controversies over the foundations of statistics reflect themselves to some extent in everyday practice, but not nearly so catastrophically as one might imagine. I believe that here, as elsewhere, catastrophe is avoided, primarily because in practical situations common sense generally saves all but the most pedantic of us from flagrant error. It is hard to judge, however, to what extent the relative calm of modern statistics is due to its domination by a vigorous school relatively well agreed within itself about the foundations.

The dominant school he mentions here is the frequentist school of statistics, usually associated with Jerzy Neyman and Egon Pearson. I intentionally exclude Ronald Fisher here, widely seen as the father of modern statistics, because although his approaches to inference were often similar in structure, his underlying philosophy was often at odds with Neyman. Also, if the two of them had ever met in a dark alley, it's fairly likely one of them would have murdered the other.

Frequentism is named for the frequency interpretation of probability (e.g. the probability of getting heads when you flip a coin is 1/2 because if you do it many, many times, that will be about how often it happens over the long run). I'm a pretty dyed-in-the-wool Bayesian, so I'm not necessarily the best person to ask about frequentism, but my understanding is that its appeal stems from three things.

(1) Frequentist methods of inference are mathematically elegant. If you've ever dug into probability theory much, you may be aware that math stops behaving well when you introduce randomness. Convergence of functions is a big deal in mathematical analysis. Convergence of random variables is the big deal in probability, but you spend most of your time worrying about seemingly minor differences like: does this random variable converge at the limit with probability p=1 (almost sure convergence), or does the limit of the probability that the random variable converges go to 1 (convergence in probability)?

You might not think this sounds like a big deal, but it is. And it's the sort of stuff that frequentist statistics is founded on. Frequentism is very well grounded in classical probability theory, and it's structured to take advantage of that. Frequentist tests—like the classic Neyman-Pearson likelihood-ratio test—focus on problems that are very mathematically tractable and for which it's easy to prove that, under certain constraints, you'll get the answers you expect.

(2) Frequentist methods of inference are computationally tractable. This was a bigger deal half a century ago, and the rise of modern computers has made it kind of meaningless, but for a long time, the Bayesian approach just wasn't very tractable. Neyman-Pearson statistics work well, owing in part to their strong mathematical foundations. Until we hit the point where we could take advantage of MCMC—Markov chain Monte Carlo, large-scale simulations of random events that can be used to directly sample from random quantities instead of relying on asymptotic limiting arguments—frequentism made a sort of practical sense as the more tractable approach to most problems.

(3) Frequentist methods avoid subjectivity. And here's the big philosophical point at the heart of the frequentist side of the debate. Bayesian methods require the specification of a set of prior beliefs about a problem you're going to study. This idea of probability by belief, or personal probability, is one of the competing interpretations of probability alongside the long-run frequency interpretation. But personal probability is, by its very nature, subjective. Frequentists hold that statistics, and all of science, should be fundamentally objective in character.

These points weren't well specified in 1954, when L.J. Savage published Foundations of Statistics, but they are (to my understanding) the key reasons why so many statisticians clung to frequentism and opposed Bayesianism in the years that followed. And they're fairly compelling reasons: mathematical and computational tractability, and an avoidance of subjectivity. So what could possibly convince Savage, and so many others, to embrace the Bayesian approach to statistics?

The answer to this is stupidly simple. You may remember I said above that with frequentist methods, under certain constraints, you'll get the answers you expect.

Here's the problem. Those aren't the answers you want.

Let's go back to Bayes theorem for a moment, and consider one of the core concepts from college-level introductory statistics. Say we want to test an hypothesis[1]. We go out, we gather a bunch of data, we do some sort of test, and we get a "p-value". What is a p-value? Good question!

Everyone in their right mind wants a p-value to be the probability that the hypothesis is true, given the data that we saw. Unfortunately, that's not what it is. The p-value is the probability that we'd see data that weird or weirder, given that the hypothesis is true.

Enter Bayes theorem! We've got p(A|B), and now we want to know p(B|A). We've got a nice formula for that. Just one problem, though. That formula wants us to know p(B), the underlying probability (in this case) that our original hypothesis is true. Well, how are we supposed to know that!? If we knew whether or not our hypothesis were true, we wouldn't need to be doing experiments on it!

Well, gentle reader, this is where we need to start thinking like Bayesians. Because remember, there's a big difference between p(B) and p(B|A), what we typically call the "prior probability" for a model and the "posterior probability" that we get after incorporating what we learn from our data. The Bayesian approach to statistics, and to epistemology, is about gradually accumulating knowledge until we become more and more sure of our answers. As we gain data, our estimates refine themselves to account for that new information and we get a clearer and clearer picture the phenomena we're interested in. But we're having to introduce that dreaded subjectivity to do so. We need to be willing to specify some sort of prior probability to kick things off.

At this point, Bayesianism breaks into two schools of thought: objective Bayes and subjective Bayes. The first school seeks to minimize subjectivity by choosing "non-informative" priors[2]. The second seeks to use existing scientific knowledge as a baseline for further inquiry by building what we already know into our analyses moving forward. For example, if I asked you to guess the mass of a boulder, would you want to be completely agnostic and pick a flat probability for all possible masses, or would you perhaps think that it doesn't make a whole lot of sense to put prior probability on the mass of the boulder being greater than the mass of, say, Great Britain? As scientists, we want to allow the data to meaningfully contribute to our understanding of the phenomena we observe, but it's intellectually dishonest to pretend our prior knowledge doesn't color our beliefs as well.

For Bayesians, and subjectivists in particular, our attitude is that subjectivity is inescapable in science as in all human endeavors, and that instead of trying to move toward full objectivity and implementing some rather weird and arbitrary procedures to get you there[3], you're better off embracing subjectivity and trying to deal with it as carefully as possible to make sure it's not exerting undue influence on your results. This is why Bayesians will frequently do sensitivity analyses, where they consider a problem with multiple priors and look at the degree to which their results depend on the choice of the prior distribution they've chosen. There's no better way to deal with subjectivity, I think, than to put all of your preconceptions front and center so your peers can judge for themselves whether you've compromised your results.

I never gave a nice handy list for why I think the Bayesian approach is superior, and that's because I don't think it really needs a list. Frequentism is great—at answering questions nobody really wants to ask. It papers over the fact that you're stuck making a mental leap at the end to try and justify that you've found what you actually wanted to find. There are tools to help make this more reasonable, and a lot of the mathematical work in frequentist statistics goes into proving that certain types of tests will have certain desirable properties that make the leap less dodgy—but at the end of the day, you're still answering the wrong questions.

Bayesian statistics tell you what you want to know. They're all about embracing probability, rather than trying to weed it out analytically. They are, I think, fundamentally closer to how we actually understand our world. They're less conventionally mathy (but still strongly grounded in probability theory) and more computationally intensive (which isn't a very big deal anymore).

If you want a list, though, I can give you one. Bayesian approaches will also let you do a wider range of problems, including some where the analytical arguments get so ugly that frequentists still haven't figured out how to get a foothold on them. And Bayesianism is more philosophically sound, since Bayesian epistemology is one of the only decent ways to tackle the problem of induction. Also, from a decision-theoretic standpoint, Bayesian decision rules are always admissable—i.e. it can be shown that once you pick a utility function and for a given prior distribution, you can never find a decision rule that is uniformly better than the Bayesian decision rule. That doesn't mean the Bayes rule will always be the best, though. For any given set of states of nature, one rule may be better and another worse. Admissability just means that there will never be a decision rule that performs better for all possible states than the Bayes rule. This is not something that frequentist decision rules can guarantee.

Okay, that's 2500 words and entirely enough from me for one day, especially when I'm home sick and trying to recuperate. Hopefully this'll satisfy some of Bad Horse's curiosity about the clone wars Bayesian vs. frequentist arguments.[4]


[1] If you've had intro stats, you may find the idea of testing one hypothesis rather than two strange. Testing is a whole can of worms on its own, and I'm not going to dig into it here except to say, if you've only ever had one or two statistics classes, there's a very good chance that everything you were taught was nonsense. See Christensen (2005), "Testing Fisher, Neyman, Pearson, and Bayes" in The American Statistician. Incidentally, that's my dad.

[2] I use scare quotes because all priors are informative. Even the occasionally used "minimally informative" is a misnomer. Depending on the situation you're in, what initially seems like an innocuous choice of prior may wind up completely compromising your results. Things like flat priors on the real line can be prone to this, because they effectively put no mass at any finite range of locations, and in certain contexts, it's easy for your parameter estimates to just stroll off into infinity. I, and most subjectivists, prefer the term reference prior.

[3] Like random tests, an artifact of Neyman-Pearson testing where you occasionally have to flip a coin and decide whether you consider a pool of data weird or not based on the outcome of that coinflip, to allow you to meet a pre-specified alpha level (which is the probability of rejecting a null hypothesis if the null hypothesis is in fact true, so basically a threshold of acceptable weirdness).

[4] And if it doesn't, at least I can give you a comic. Click here for the wisdom of Randall Munroe.

Report Bradel · 433 views ·
Comments ( 39 )

I believe your equation for Bayes' theorem says B to the power of C (presumably meaning the complement of B) where you could just write "not B".

I think it would be simpler & better for readers of this essay to just write

P(A|B) = P(B|A)P(A) / P(B)

Comment posted by Bradel deleted Apr 29th, 2014

2060448
Yeah, this is complement notation. It's pretty standard in my parts. I suppose I could have clarified it better, though I'm sick and kind of tired, and don't know how fast I'll get to it right now...

"The p-value is the probability that we'd see data that weird or weirder, given that the hypothesis is true." Don't you mean given that the hypothesis is false?

this post was very cool and informative! Thanks!

Fun fact: I was going to say something about how my experience with the term "epistemology" was limited to Munroe's Every Major's Terrible. Then I checked, and it turns out that the term used therein was "epidemiology." This sums up my experience in this field. :twilightsheepish: (Though I was proud of myself when I realized that the "B to the C power" was actually "contra-B.")
The closest thing I had was the probability course I took in Comp Sci and the quantum course in physics. Still, between the in-depth analysis and the xkcd strip, you definitely broadened my horizons. Thanks for that. :twilightsmile:

(And now I can't help but wonder how a Bayesian model of the Infinite Improbability Drive would work.)

I have 2 arguments in favor of using frequentist statistics:

1. t-tests and F-tests are simple, and yet they are interpreted incorrectly in something like 1/4 of biological and medical research papers that refer to them. We can't use more complex math because our scientists would be too stupid to read it.

2. If you use Bayesian statistics, you won't get published. I have never seen a paper in the biological sciences use Bayesian statistics.

I don't understand why someone would identify as a Bayesian or frequentist. That would be like a mechanic identifying as someone who uses metric or English tools.

You mention MCMC, but that's expectation maximization, which it seems to me is frequentist, because it maximizes P(data|hypothesis) rather than P(hypothesis|data), and because it requires no priors.

2060494
Unfortunately, no. Conventional wisdom is that you want small p-values, so what you're looking for is the probability of the data if the null model/hypothesis is true. The idea is, if the data is really unlikely when the model is true, this implies that the model isn't true. And this is the crux of that whole "frequentist statistics don't tell you what you want to know" thing, because seeing something that's unlikely under your null model doesn't really say a lot about other possible models, if you don't know what they look like. That paper I linked in footnote [1] digs into this issue pretty well, and should be fairly readable I think.


2060507
I don't think I've ever heard a positive story about statistics and the biological sciences. But Bayesian methods aren't really "harder math", they're just different—and it'd be perfectly easy to generate black-box approaches biologists could use with them, which seems to basically be how many of them use statistics right now, anyway. The advantage to giving them Bayesian black-boxes instead of frequentist ones is that the results are more naturally interpretable and they'll probably be less likely to say stupid things

As for the lack of Bayesian publications in biological sciences, on the one hand I find that surprising because Bayesian methods are increasingly the go-to standard whenever you've got hard problems and I'm sure biologists do encounter hard problems. Longitudinal data collection kind of guarantees it. On the other hand, again, I've heard very little to make me think most biologists have more than a passing acquaintance with good statistics.

2060448
You could write it that way, sure, but to do so ignores the fact that...

Well, f***. I just screwed up a rather important part of this discussion.

2060545 I was using "hypothesis" to mean the hypothesis that isn't the null hypothesis.

2060448
2060554
Nope, nevermind, I've got it right. Being sick is really taking its toll on my faculties.

To write it the way you've got it is to concatenate all of that stuff in the denominator, in the context of Bayesian analysis, to "the probability that you see those data". Expanding it out is, I think, actually rather important, because it breaks down into the probability that you see those data if your hypothesis is correct plus and probability you see those data if it's not. That's kind of the crux of the Bayesian approach: start with some uncertainty over the parameter space, and then integrate out that uncertainty once your data come in. The probability of your hypothesis being true given your data is basically a weighted average of the probability of your data under a bunch of different hypotheses.


2060541
MCMC in itself isn't really anything. It's just the primary strategy for sampling out of posterior densities to get an idea of what they look like, when they don't come in a nice, easily tractable form (like when you've got conjugate priors.

I honestly don't know what you're trying to get at with the EM thing, since it's not like either approach has a monopoly on probability theory. But MCMC is the backbone of how Bayesian analysis gets done. If that makes you feel uncomfortable, well, I don't think I've got anything for you except to tell you to go take a few classes in Bayesian statistics rather than reading my 2500-word blog post.

2060563
In most of what passes for frequentist inference, the hypothesis that isn't the null hypothesis doesn't matter in any way. It's just a place-holder. This is why I avoided talking about it.

If you're actually doing proper Neyman-Pearson stuff, it will matter, but it's not treated in the same way as the null hypothesis even then. The lack of balance between how you treat hypotheses—the lack of reversibility, effectively—is one of the biggest problems people have with frequentist statistics. Test statistics are always based on the null model, and talking about the probability of seeing the data under any model except the null doesn't really make sense.

ETA: Unless you're a Bayesian.

I'm just going to comment here so it's not just you and Bad Horse talking to one another.

2060570 But you didn't avoid talking about the hypothesis. First, "the hypothesis" means the hypothesis that isn't the null hypothesis. Second, you wanted to talk about the hypothesis:

Say we want to test an hypothesis[1]....
Everyone in their right mind wants a p-value to be the probability that the hypothesis is true, given the data that we saw.

Nobody in their right mind wants to test the null hypothesis. But the p-value is the probability of seeing the data or stranger when the null H is true. So when you wrote,

We've got p(A|B), and now we want to know p(B|A). We've got a nice formula for that. Just one problem, though. That formula wants us to know p(B), the underlying probability (in this case) that our original hypothesis is true.

you meant

We've got p(data|null), and now we want to know p(H|data).

Bayes' rule can't compute p(H|data) from p(data|null), because hypothesis H and the null hypo are not complementary.

And thus we have a small fragment of what Predictive Climate Science has been going through for the last 20+ years when attempting to use statistical analysis on the most chaotic, badly-measured, misunderstood data set in the world. (It doesn't help that the data set and models have been 'nudged' in various ways by well-meaning scientists, or at least I would hope they're well-meaning).

Equestria is the only place where abacus-driven weather prediction a year ahead of time is accurate to the minute. (Well, other than SoCal where it's 84 and sunny for months on end)

Well, for the first time in many many years, I feel the inklings of a desire to take a statistics course. :twistnerd:

2060746 You live in SoCal too? What part?

2060737

Bayes' rule can't compute p(H|data) from p(data|null), because hypothesis H and the null hypo are not complementary.

You meant to talk only about the null hyp, but the way you used English, you were talking about the hypothesis.

This is the main reason people misinterpret statistical results: Papers attempt to disprove the null hypothesis. People interpret them as having proven the hypothesis was true if they disproved the null, and as having proven the hypothesis was false if they failed to disprove the null. Neither is correct. Any time a review of a paper using a t-test, f-test, or ANOVA says "Scientists have proven X", it is necessarily wrong; it should have said, "Scientists have proven that it is not the case that Y, where Y implies not X." When a review says "Scientists have disproven X", it is not necessarily wrong, but in practice, it almost always is, and should have said, "Scientists have failed to prove that it is not the case that Y, where Y implies not X".

See "The universal medical journal article error" for some elaboration. This article is also interesting because it is a simple mathematical argument that is correct, in the domain that Eliezer Yudkowsky claims to have the greatest expertise in, and he declared it to be incorrect.

I come for the technicolor anthropomorphic ponies, but I STAY for the statistics and epistemology discussions.

2060777

I get the feeling that:

What you call "a simple mathematical argument" would be so far beyond my understanding that would have to define the terms you were using to define the terms you were using to define your terms before I could even begin to get an inkling of what you were talking about... :scootangel:

Mike

2060777
I'm guessing you still haven't gone and read the paper I linked in the footnotes? Because (while I'll admit that I'm not putting my explanation in terms of Bayesian hypothesis testing), you seem to be working under the same misapprehensions about testing that most people have, namely that all statistical tests are and should be tests of competing hypotheses. I specifically tried to avoid talking about two hypotheses because, at least as far as general understanding is concerned, I think it muddles the issue unnecessarily.

When I talk about an hypothesis in the blog post, I am only ever talking about a null hypothesis. This, admittedly, leaves the p(A|B^C) p(B^C) a bit messy since I'm pretending that we can just integrate over a nebulous "everything else", and this is why in reality you've got Bayesian testing as a direct competitor for Neyman-Pearson hypothesis testing and not Fisherian significance testing, but again, I think it muddles people's intuition to jump straight to that because hypothesis testing isn't the proof-by-contradiction that most people tend to think of when they think about how statistical inference works. Hypothesis testing is just a simple decision problem, where you pick between two alternatives.

The point I'm talking around here is that Bayesian inference fundamentally isn't interested in proof by contradiction, because proof by contradiction isn't what most scientists naturally want to do. Most scientists want to accumulate evidence that they're "right" about something, and that runs headlong into the induction problem. And the easiest way out of the induction problem, in my opinion, is to adopt a probabilistic interpretation of knowledge.

2061580 Hypothesis testing is just a simple decision problem, where you pick between two alternatives.

If that's so, then you can't have a standard null hypothesis as one of the two alternatives. That's because the prior for the null hypo is always zero--the null hypo says that the means of two distributions are exactly equal, yet we treat the distributions as if they're independent distributions, which means that P(m1 = m2) = zero.

2061710
No, there are plenty of ways to have a standard null hypothesis. In the Bayesian context, you just put point mass on your null, which is a piece of cake. In the Neyman-Pearson context, the N-P lemma is for testing two simple (single-point) hypotheses so the problem doesn't arise, but for simple vs. compound which is what most people are familiar with, this is where you get into most powerful unbiased tests, most powerful invariant tests,etc—the mathy stuff frequentists like to do, that'll get you (again) answers where you know what to expect even if it's not what you'd really like to know. But the crux of N-P testing lies in treating the null and alternative hypotheses differently, so you base everything on the probability of committing Type I and Type II errors[1]. Of course, most people just look at Type I and make stupid choices about it, but that's part of the whole significance testing vs. hypothesis testing mash-up.

In any case, no, that's really not a problem.


[1] Which should really be called alpha-errors and beta-errors to make everyone's life easier.

2061762 In the Bayesian context, you just put point mass on your null, which is a piece of cake.

That sounds like cheating... because you can't really believe that describes reality. :trixieshiftright:

2061787
...when you're doing hypothesis testing, you're not trying to describe reality. You're making a decision between two alternatives.

I'm hesitant to interject here, as it seems likely that I'll get jumped on for failing to understand some basic precept of the topic at hand, or not doing the prerequisite knowledge-gathering, but what the hell.

2060545
The problem I have with this discussion is that it seems to be doing a lot of symbolic gymnastics for no clear reason.

It seems you're trying to compare the probability that a model would produce the data you've collected with the probability that the same data would have been produced anyway, ie by all other possible models; models which you cannot possibly know because you haven't created them, nor can you know what they would have produced. To make complicated mathematical statements about such events strikes me as profoundly meaningless.

2061796
I'm not sure how you could possibly intend to do such a thing in a conceptual environment completely divorced from reality. What purpose would statistical analysis even serve there? Wouldn't what statistical analysis attempts to describe, that being the behavior of apparently random or uncertain systems, only arise when dealing with reality as a result of an incomplete knowledge of the rules and initial conditions of reality as a system?

2062835
Y'all are making me think I should have just gone to the trouble to do the hypothesis testing / significance testing split, even though that would have added a ton of words... Again you're only actually going to use the Bayesian procedures when you're doing hypothesis testing, so you've got a competing alternative to compare to, in which case you can talk about the probability that the data you saw arose under the null vs. the probability that they arose under the alternative.

I think I may have been too dismissive on the depiction of reality comment (and that may have driven
2061787 out of the comments, unfortunately), but at the same time, his original statement about not being able to do tests in certain contexts (when you have a simple null and compound alternative) is a little hard to address, because it seems to misunderstand how statistical testing works and what it's trying to tell you. The whole thing boils down to decision theory, and the costs of taking various decisions. If you want to describe stuff, you generally move over to parameter estimation, which is a whole other bag, although it still shares a lot of the ideological split, since a frequentist confidence interval and a Bayesian posterior interval have fundamentally different interpretations.

As for not being able to consider how likely data is when you know nothing about how it's generated, though, I'm not actually sure that's true. Bayesian nonparametric methods (things like Dirichlet processes and Polya trees) deal with putting probability measures on the space of all possible probability measures. I think these kind of get at that issue, though I'm sick and it's possible my understanding just isn't carrying far enough right now. This is the sort of stuff I'm currently working with / learning about, and it's not terribly easy to wrap your head around.

2062835
It seems you're trying to compare the probability that a model would produce the data you've collected with the probability that the same data would have been produced anyway, ie by all other possible models; models which you cannot possibly know because you haven't created them, nor can you know what they would have produced.

Consider the problem of gene-calling, using a 2nd-order Markov model on protein sequence. This means you have a model of genes, and use it to decide whether a particular protein sequence is a gene or not.

The Markov model is a table that lists, for every possible sequence of 3 amino acids, the probability that if you find the first 2, the third follows after them. I compute P(seq|M), the probability of a long protein sequence given a model (table) M, as the product of all the probabilities in the table for all the amino acid triplets in seq.

In practice, I just compare that to the probability of observing seq in the random model, where every entry in the table is 1/20. But I /could/ compare it to all possible models, even though I can't enumerate them because there are an infinite number of them.

To generate one model, for every pair of 2 amino acids, you draw a line segment of length 1 (the probabilities for the third amino acid sum to 1), and then you put 19 fences on that line that divide it into 20 segments. The length of the first segment is the probability that the third amino acid in the triplet is amino acid #1, the length of the second segment is the probability that the third amino acid in the triplet is amino acid #2, etc.

To consider all possible models, you cycle thru all possible positions for those 19 fences in a way kinda vaguely like an odometer. But because there are an infinite number of positions for the fences, you have to do multiple integration, one integration for each of the 19 fences.

Saying that probably doesn't help unless you've seen something like it worked out before. But it is possible.

2063080 That was actually quite helpful. Being familiar with Markov chains due to their somewhat less practical applications, your example was fairly easy to follow, and demonstrated how one could reasonably consider the probabilities associated with all variations of a particular class of model, an application for which I can see some utility.

I was originally imagining a much more general case with a far less limited set of possible models, such as possible models for particle interactions, which could involve properties or methods of interaction which could completely escape consideration by the experimenter, rendering any statistical analysis of possible models potentially incomplete.

2063160 I was originally imagining a much more general case with a far less limited set of possible models, such as possible models for particle interactions,

I think one of the things Feynman is famous for is for figuring out how to do that. Or maybe he just drew the diagrams... it's a quantum mechanics thing, I wouldn't understand.

One thing I WILL note is the thing that I find most frustrating about people who enjoy Bayesian statistics is that they seem to be insufferable about it, and constantly point out that THEY are Bayesian statisticians, which always makes me wonder how good their initial assumptions about the probability of their hypothesis being true tends to be. Several of the folks I've encountered who I end up associating with Bayesian statistics have been pretty sure of some pretty wrong ideas.

Regarding the actual topic:

As someone who has only ever taken a few college-level statistics courses, what would you say the best book for learning Bayesian statistics would be?

Also, from a decision-theoretic standpoint, Bayesian decision rules are always admissable—i.e. it can be shown that once you pick a utility function and for a given prior distribution, you can never find a decision rule that is uniformly better than the Bayesian decision rule.

Alright, I buy this, but I have to ask a (rather important) question here: does it tend to be a better decision rule over the regions which most matter? Nothing being uniformly better sounds good, but as you yourself noted, asking the right question is pretty important here, and the actual question we usually want the answer for is "is our hypothesis likely to be true". It strikes me that this is dependent on your initial assumptions. If you are biased towards your own hypotheses, as most people are, does this make Bayesian statistics better or worse in this regard? Because it looks to me (without actually doing any math, so I could be totally off here) that being biased towards your own hypothesis would negatively impact its value as a decision rule because it would bias it towards confirming your hypothesis on weaker data, which is exactly the opposite of what you'd actually want your decision rule to do, because we are already naturally biased towards our own ideas.

2063080
Alright, possibly stupid question:

Why did you pick 1/20 here? Is that closer than if you assume that the actual nucleotide sequence is random? Because that will not give 1/20 for each of the amino acids, as some are produced by a lot more triplets than others.

So I was super proud of myself for understanding probably 95% of the big words in the post itself. Then I read all the comments, in order... and while I still think I understand the basics, my deep-actual understanding is probably down to 70%.

That said. Holy &#&#!!!! First off, I seriously love the fact that this kind of debate and discussion happens on a forum for pastel magic talking horse fanfiction. Why? Because when my "real life" friends ask why I associate with this fandom, I can actually and honestly prove to them that it's full of smart people and very, very interesting things. Not just "kid's stuff."

Secondly, and more on topic... I'm, at best, a spectator for statistics. I read Eliezer Yudkowsky, but because he wrote fanfiction. It's actually what pointed me to ponyfic in the first place. I like it. I've also been tangentially interested in statistics for years, and explored (to my limited ability) various things for computing situations. For years I've been working on neural networks for personalized relationship graphing in social networks (from a lay-standpoint) and using things like various K-Nearest-Neighbor hyperspace (which I "mostly" understand) classifiers for spam filtering. I understand like 80% of what I play with at best though.

All that said, my main takeaway on statistics has been that of a system administrator or engineer. "The right tool for the job." Sometimes, Bayesian analysis gives better results. Sometimes it's the frequentist approach. Sometimes, the assumption of naive, random-distribution is just fine. So while I can't match wits on a deep level here as to the veracity of either school of thought, I feel I must ask the "idiot question." Why does either school of thought assume theirs is the "one true" answer? Again, maybe just my naivety here, but it seems to me that predicting a batting average in an upcoming season may be qualitatively different sort of question/prediction than "is this email spam?" and that it they might be served by different approaches and viewpoints in statistical analysis. So as a mere bystander, it's hard for me not to see it as debate akin to "which database engine is best?" or "what's the best operating system?" (But seriously, let's not do the OS debate here. That gets ugly.) :pinkiehappy:

2063296 Assuming the nucleotide sequence is random would probably be better.

First, awesome post: I think* I actually get that.

Second, I also think I understand now why I dislike experimental psychology so much: because it assumes that the prior probability is even, which is demonstrably untrue.

*Of course, I could have an entire wrong understanding. I have no way to be sure.

2063296

Alright, I buy this, but I have to ask a (rather important) question here: does it tend to be a better decision rule over the regions which most matter?

Perhaps unfortunately, this gets at why admissability is one of the properties people look at with decision theory—my understanding is that there's not a particularly good way to do what you're saying. The issue is that "the regions which most matter" are necessarily unknown when you're constructing a decision rule; you're trying to build a rule that will maximize utility (or minimize loss) across all possible states of nature. The prior you'd choose as a Bayesian factors in here, though, and in practical terms, it does mean you're more focused on maximizing the utility of the decision over the region you think is most likely to correspond to the state of nature you'd observe.

I have to admit that decision theory is not my strong suit in statistics right now, though. It doesn't get taught much these days, and my exposure to it has been limited.

If you are biased towards your own hypotheses, as most people are, does this make Bayesian statistics better or worse in this regard? Because it looks to me (without actually doing any math, so I could be totally off here) that being biased towards your own hypothesis would negatively impact its value as a decision rule because it would bias it towards confirming your hypothesis on weaker data, which is exactly the opposite of what you'd actually want your decision rule to do, because we are already naturally biased towards our own ideas.

My understanding would be that, based on your prior, you're going to worry more about making sure your decision is good in the states of nature you consider most likely, so you'll probably be less good for states of nature you wouldn't expect to see. But again, this is why good Bayesians should be doing things like sensitivity analyses and comparing the answers they get with a whole range of priors, so it's clear how much influence your choice of prior is having on the outcome of the problem. Each prior is going to change the states you'd focus on with the decision rule. If you've got any background with measure theory, what you're basically doing is you're maximizing utility according to the measure of regions of the state space under the prior probability distribution (since all probability distributions are measures)

Frankly, to not do sensitivity analyses strikes me as just crazy, since knowing how confident you are in your results is, I think, a key component to doing science.

As someone who has only ever taken a few college-level statistics courses, what would you say the best book for learning Bayesian statistics would be?

I'm going to be pretty prejudiced on this question, but I'd point you at

http://www.amazon.com/Bayesian-Ideas-Data-Analysis-Statisticians/dp/1439803544

. Two of the authors are my father and my PhD advisor, so obviously I'm going to be biased, but it's got a few things going for it. First, it's designed for self-study. Second, it should be more readable than most of the books on the market, given that (going back to the conversation on
2063633 's blog) it's put together by folks who are very conscious of the poor communication problem and have been trying to work against it in the field. And third, it's one of the few books on the market that focuses on subjective Bayesian analysis, which sounds like it's one of the areas where you have questions.

As for subjective Bayes, I'm not too sure who you're talking to on the subject, but I do have to say I wouldn't be surprised if they're doing it somewhat badly. Eliciting good expert information isn't terribly easy and there's a body of documented evidence that suggests that people are bad at putting priors on things—and when I say "bad", I'm pretty sure what I mean is that people tend to violate coherence. The particular example that I remember is that if you ask people to identify when historical events occurred by giving a range of dates such that they're 95% certain they'll have caught the actual event, people tend to miss a lot more than 5% of the time, i.e. we're a lot surer of what we know than we ought to be. One of the things the aforementioned book gets into is how you can go about eliciting prior information from experts in a way that's actually useful. My memory is that in simple cases, it works something like this:

(1) Ask the expert for their best guess at what the parameter value should be. This value will be the median of your distribution (or alternately mean or mode, but IIRC the median wounds up being the most nicely stable).
(2) Ask the expert for their best guess on a limit past which the parameter can't go. This will be your 95th percentile (or 5th percentile), because experts aren't great at telling you where limits really are.

And of course you want to make sure you've got positive probability for anything in the realm of mathematical possibility, no matter how improbable it seems, but that's usually pretty easy to manage.

You could also look at the L.J. Savage book I mentioned in the blog post, though that's more philosophical and less hands-on for how you can actually do Bayesian statistics, and how to prove the theory behind it. Depends on what you're looking for.

2063535

Why does either school of thought assume theirs is the "one true" answer?

Like I think (I hope) I mentioned above, this attitude really hasn't continued into the current generation of statisticians—which is one reason I personally feel like Kuhn's model of how scientists arrive at consensus makes sense, because it's a very good description of how this has gone down in statistics.

That said, I'm still a die-hard partisan! I grew up surrounded by these ideas, and by the (now largely abandoned) discussions about which approach was better. At this point, I'm kind of a relic of a bygone age.

If you ask me why I think we should be following the Bayesian approach instead of the Neyman-Pearson approach[1], though, here's what I'd say:

The critical reason for taking the Bayesian approach is interpretability. You can approach most problems either way (and largely get the same answers), but frequentist statistics require a lot of additional cognitive run-around to get from the answers they provide to the questions you actually meant to ask. Most of that run-around is poorly defined and poorly understood, which you can see pretty much any time the NY Times talks about a p-value. From an educational perspective, I think we waste a lot of time (and lose a lot of students) trying to communicate what frequentist statistics really tells you. This is where you get the whole thing about how you can't conclude the null hypothesis in a null-alternative scenario. Putting aside for a moment that in a null-alternative scenario you totally can, we're spending serious effort to teach students advanced mental gymnastics rather than helping them understand probability and statistical ideas better.

In its true form, frequentism isn't really to blame for this, it's the ugly hybrid that's arisen from people mashing Neyman-Pearson and Fisher together. But frequentism is still going to have that interpretability problem even if you try to clean up what people are teaching because, like I said in the blog post, the mathematically straightforward answers are in many cases not telling you what you really want to know.

All that, and like I also said above, I personally believe that the Popperian focus on falsifiability is missing the mark with science, and that what we really want is a probabilistic view of the strength of evidence we have for various theories. Now, it's not like I can just say falsifiability is a bad idea—it does a lot of solid work in sorting sciences from pseudosciences. But I think we can do better, if we actually try to quantify the amount of information we gain from the data we observe.

Take, for example, astrology. The Popperian view would be that astrology isn't really a science because astrological predictions are so generic that you can pretty much say they agree with whatever data you see, so there's no data that would lead you to believe that the astrological model is false. It's not falsifiable, therefore it's not a science. Under the Bayesian paradigm, you're going to account for the relative probability of seeing your data under various models. This is the whole p(A|B) p(B) + p(A|B^C) p(B^C) thing.

Let's unpack that a bit. We're looking at the probability of seeing the data we saw, assuming the model B. Well, with astrology, since there's really no way for these models to distinguish between data, that probability is basically 1. The data you saw is always perfectly expected under the model you've been given.[2] So you multiply this by p(B). Then you do the same for the other side, with other competing astrological predictions, and you get the denominator as p(B) + p(B^C) = 1. The numerator, meanwhile, has reduced from p(A|B) p(B) to p(B), and what we're left with is:

p(B|A) = p(B)

In other words, because our models had no ability to distinguish between the data we saw, the posterior probability of our model being true is identical to the prior probability that our model was true. We've gained no information.

Instead of relying on falsifiability, which I find kind of simplistic, Bayesianism approaches the problem by weight of evidence, and you really do wind up with a garbage-in garbage-out scenario. Bayesianism lets you actually accumulate knowledge regarding how well models work, instead of just looking at whether your data contradict them. Given that we're instinctively interested in building good models, I find this approach far more sensible than one that ties itself to the idea of true/false.

We're constantly dealing with probabilities in science. It just seems sensible to me that we'd, therefore, adopt a probabilistic epistemology.


[1] I'm specifically talking about the two approaches to the competing-alternatives type of problems. Popperian proof by contradiction should, I think, work under a different framework, but not much has been done to elaborate on that yet.

[2] The question I expect this may raise is, "if the probability for the data you saw is 1, how can the probabilities for different sets of data sum to 1; aren't you breaking the laws of probability?" What's going on here is that you're not really looking at different sets of data. The features that distinguish one set of data you might see from others are non-existent, and you don't have a way to partition up a set of responses. Effectively you can only ever see one thing: data that agrees with the model. There are no separating features.

2063879 To me, falsifiability has just always been sacrosanct as what makes scientific predictions scientific. So you just hit on something I'd never really considered before: That a probabilistic system could actually serve as the epistemological foundation instead. Interesting... and my initial reaction is that I'm intrigued by the idea, and like the notion of "gained no new information" being the failure point for a "science." Learning new stuff is the true core/purpose of scientific exploration, so it just feels right to measure by that standard. That said, it's definitely something I'll be considering in more depth before I'm ready to swap "provides no new information" for "makes no falsifiable claims" in my bullshit detector. Great food for thought though! :pinkiehappy:

I wanna love this, but I'm afraid I don't quite follow all the big words. (Especially the equation, but that's using all the stats jargon.) This feels funny, because I like to think I'm smart. I'm sure I could get a useful explanation out of you if you felt like it, though it'd probably be better overall to sit down with this post, wikipedia, and a stats book or two.

On the other hand, that xkcd mostly makes sense now, so there's that. (Although maybe I just had to read it again.)

As I recall my (one) introductory stats class, we were testing "hypothesis, or not-hypothesis?" Just 'cause you mentioned it.


2060746
No it's not. They screw up with the weather schedule just as often as anything else.

I mean jezz, the sleepover was about a weather screwup.

2061194
I dunno, I think if you can follow addition and "for every action, there is an equal and opposite reaction" you should have no problem. You just have to know what a couple of the symbols mean [1]. Probably not even that, as nearly every time the use the symbols they say the same thing in normal words.

Then again, everyone tells me I'm smart, so that may have something to do with why I had no trouble with it.

[1] The upside-down A means "all", as in "for all x"; and the backwards E means "there exists", as in "for all living persons, there exist parents for that person".


2063789
I feel like that wikiarticle was badly written. (Not to blame you or say you should change it, merely observing.) On the other hand, I now understand what you meant by "coherence", so I guess it's a wash.

Thank you for the post. My field in mathematics is algebra, so I don't deal with this stuff, but it was interesting to read.

Login or register to comment