Hacker News new | past | comments | ask | show | jobs | submit login
Bayes’ Theorem in the 21st Century (2013) [pdf] (caltech.edu)
228 points by mikevm on Oct 14, 2018 | hide | past | favorite | 92 comments



http://www.overcomingbias.com/2009/02/share-likelihood-ratio...

Seriously, the main point of an experiment is to gather evidence. Coupled with prior beliefs, you get a posterior belief, but the most important point is how much evidence the experiment provides.

Sure, a full fledged posterior belief is needed to make an actual decision, like, what should we test next. And if a subject is deemed important enough that we need to be certain, we can replicate until we get enough evidence to trump any reasonable prior belief. (Mind publication bias, though, some replications are going to fail, and that's relevant evidence too.)

In the mean time, it would be nice if the papers just told "the experiment provides 20dB of evidence that A is wrong, and B is right", instead of saying "B is right (at p<0.01)". No you're not certain B is right just yet. Your evidence is significant, perhaps even decisive, but it is not certain. A one in a hundred fluke is not unheard of. Also, sharing likelihood ratios (instead of posterior beliefs) makes the whole debate a bit less heated.

Getting a double one on dice you just threw for the first time doesn't mean they are loaded to make you lose. It only provides about 15 decibels of evidence in favour of such a con job.


That's a great link, though another writer at that blog wrote a short story re-explaining it many years later.

It's a conversation between a scientist, a bayesian, and a confused undergrad: https://arbital.com/p/likelihoods_not_pvalues/?l=4xx (warning, site loads very slow).


The Bayesian believes that probability represents our beliefs about the world.

The Frequentist believes that probabilities merely represent the long term frequency counts of events (for a given 'population').


Is either of these universally true? Or is it possible that a "belief" associated with a given math tool can be chosen as appropriate to the problem being solved du jour?

When I learned Bayes' theorem in college stats class, there was no mention of beliefs. It was just a straightforward theorem related to conditional probability.


You don't need belief to write a theorem. You need it in order to use/apply it - the theorem doesn't tell you where your prior comes from.


> The Frequentist believes

The frequentists do not "believe," they measure.


No. They believe you can measure an infinite number of trials (say # of heads vs tails) and whatever ratio you get is the probability of heads.

However it's problematic because you can measure a million coin flips and get heads every time. It's not possible to actually measure an infinite number of trials - you need to imagine it.


This is just silly, nobody in their right mind would believe they could do something “an infinite number” of times.


If you don’t do an infinite number of trials then you can’t be sure your frequencies match the real probability.


They still non-trivially define/demarcate what the population actually is. That is kind of a belief because it is a choice not given by nature, and there are infinitely many choices one could choose.


Nature does not give you choice, it gives you the frequency (say, in the form of the intensity of a spectral line of an atom), and it is the base of the scientific method to listen to what nature is trying to tell you. There is nothing subjective in this process. Arguing otherwise is like saying that atheists “believe” in the non-existence of god.


Nature is telling you an infinite number of things, the process of selecting what to measure and what to exclude in the measurement is a choice.

In terms of frequentism, how you define what the population distribution you are sampling from is exactly, is a choice.

You are limiting the discussion to what happens after you've chosen how and what the population distribution consists of.. that part is itself a nontrivial and subjective process.


I truly believe that Bayesian inference is the statistics of the 21st century. Recent advances in MCMC (e.g., NUTS, Stan [1]) and variational inference (e.g., ADVI [2], VAE [3], etc.) + more computing power than ever promise a near future in which Bayesian inference is the default inference engine.

Prior distribution is a beautiful and logical mechanism for adding regularization, domain-specific knowledge to our model.

[1] Stan, a platform for statistical modeling http://mc-stan.org/

[2] Automatic Differentiation Variational Inference https://arxiv.org/abs/1603.00788

[3] Auto-Encoding Variational Bayes https://arxiv.org/abs/1312.6114


But note, the last paragraph of TFA cautions against use of a Bayesian prior in cases where it is not well supported by actual hard prior information.

It is very hard to validate a given choice of a prior in many applications. E.g., if I claim one prior, and another investigator claims a sharper one, it can be very difficult to decide who is right.

If the prior does not wash out due to lots of data, this indicates a serious and fundamental problem.


> It is very hard to validate a given choice of a prior in many applications. E.g., if I claim one prior, and another investigator claims a sharper one, it can be very difficult to decide who is right.

Both prior and likelihood are our model's assumptions. So, the prior validation problem is similar to the likelihood validation problem. To check a Bayesian model or any model, we need to bring the model out of the formal world, to the real world for validation.

Prior predictive simulation method, which generates random data points from the prior, is a good heuristic to check if the prior is NOT plausible.


> ...the prior validation problem is similar to the likelihood validation problem...

But priors can be much harder.

Say I’m trying to estimate a wind speed from the blade velocity of a windmill. I can bring a more accurate wind speed sensor to calibrate the windmill against the wind speed, perhaps aided by basic physics. This is the likelihood portion.

But what should the prior be? The typical speed at that time of day? The speed in January? The speed on cloudy days? I have to have a crisp number — a full distribution actually, accurate out to the tails. I really have very little grounding for choosing that distribution.

I started out just wanting to relate the wind speed to some data in a rather concrete way, and now I’ve been roped in to choosing a crisp distribution for a rather amorphous state of nature.

This is a deep problem.

We can sharpen the problem. Say my number and yours are different. How do we tell who is right?

One can try a different tack: I’m being stubborn. The prior will mostly wash out in any well-posed problem, or else why try to solve it? But now we’re back to frequentism, just looking at the likelihood.

HN tends to invoke the Bayesian framework as a complete solution to inference — I’m just trying to demonstrate that there are problems with that approach.


> I can bring a more accurate wind speed sensor to calibrate the windmill against the wind [...] But what should the prior be? [...] I have to have a crisp number — a full distribution actually, accurate out to the tails.

What would you do when the sensor returns negative wind speeds due to noise or errors?

The wind speed cannot be negative, or greater than speed of light. An expert in windmill can narrow down the prior distribution much more.

> But priors can be much harder.

Choosing a prior is hard because it requires thinking explicitly about the problem and its assumptions. It merely exposes our lack of expertise on the problem.

When you're lazy, you can pick a uniform prior Uniform(0, c) and call it a day.

> We can sharpen the problem. Say my number and yours are different. How do we tell who is right?

Forget about prior, say, we have 2 sensors which output two slightly different wind speeds. Which wind speeds is right? The lower one or the average speed.

This is a deep philosophical problem. However, it's a problem for any model.

> The prior will mostly wash out in any well-posed problem.

I don't think so. Any well-posed problem should include the prior, or else how can we tell: 2 data points is not enough?

> HN tends to invoke the Bayesian framework as a complete solution to inference [...]

Bayesian framework is indeed a complete solution to inference in a formal/logical sense. However, I agree that there are many problems in applying Bayesian framework to real world problems that requires serious thinking about our assumptions on the problem.


"Bayesian framework is indeed a complete solution to inference in a formal/logical sense."

Bradley Efron, in TFA, begs to disagree:

"I wish I could report that this resolves the 250-year controversy and that it is now safe to always employ Bayes’ theorem. Sorry. My own practice is to use Bayesian analysis in the presence of genuine prior information; to use empirical Bayes methods in the parallel cases situation; and otherwise to be cautious when invoking uninformative priors. In the last case, Bayesian calculations cannot be uncritically accepted and should be checked by other methods, which usually means frequentistically."


xcodevn said “in a formal/logical sense”, not “in a practical sense”.


My perspective is that the problem of deciding the "correct" prior is a human problem because the human brain is a messy machine. An artificial intelligence which has full access to its own code and its memory in perfect detail will know precisely what it knows about a certain situation, and therefore can estimate a prior that accurately reflects this knowledge.

In the windmill example, the AI can quickly collect all it has in its memory about blade speeds, and maybe spend a self-imposed X min computational time to make a best guess for the prior speed distribution.

Humans can't do this, so we have gone down a philosophical rabbit hole of figuring out this "prior problem", when the real problem is that we are just messy informal thinkers.

> How do we tell who is right?

You are fundamentally conceptually mistaken here. There is nothing right or wrong with two agents disagreeing on the prior. The different prior reflects the before experiment knowledge of the two agents. I am a windmill engineer, so my priors will be much more narrow than yours, who has never seen a windmill outside of a hollywood movie.


To me, and I'm not quite the expert that perhaps you are, but to me it seems like Bayesian inference is still in a better spot here because the priors are part of an explicit quantification of bias and assumption in a model.

Much havoc has befallen the scientific world because of the hidden assumptions of frequentist techniques with poorly understood preconditions, even for rather basic models. And there isn't much anyone can do about that save move to ever more complicated models.


I feel like variational inference has never been described very well to an intro audience even having statistical basics.

Is it a graduate level topic or is there an intuitive course that teaches it to beginners?


"variational inference" is perhaps an uninformative name. You can just think of it as

- approximating the posterior using a nice parametric distribution, then

- minimizing some error (typically KL Divergence) between your approximate posterior and the true posterior


Do you know _why_ KL divergence is minimized? I get that it gives a lower bound on the marginal likelihood, which is cool, but is that it? What are the alternatives?


KL divergence is motivated nicely from an information/coding theory viewpoint. It's very closely related to Shannon-von Neumann entropy [1], and KL(P||Q) characterizes the efficiency of a code designed for a model distribution P, when applied to reality which is actually represented by Q.

A lot of recent work focuses on the Wasserstein divergence [1] as an alternative. One advantage of Wasserstein over KL is that the Wasserstein metric provides better fit over the whole distribution instead of localizing on some specific regions, thereby preventing "mode collapse". This makes it a popular metric for training Generative Adversarial Networks (GANs).

For recent work on applying Wasserstein distance to variational inference, see: https://arxiv.org/abs/1805.11284

[1]: https://physics.stackexchange.com/questions/64574/definition... [2]: https://en.wikipedia.org/wiki/Wasserstein_metric


I found this talk to be useful, despite the technical difficulties https://www.youtube.com/watch?v=Dv86zdWjJKQ

(I haven't watched https://www.youtube.com/watch?v=ogdv_6dbvVQ but it seems like a longer version of the same talk)


NUTS and Stan are quite old at this point!

Here's a more recent advance https://arxiv.org/pdf/1711.09268.pdf


Stan supports Hamiltonian Monte Carlo.

https://arxiv.org/pdf/1701.02434.pdf


Many samplers are based on HMC, it's a general class of samplers, not a specific algorithm. NUTS is a variation of HMC, as well as the paper I linked above.

'vanilla' HMC uses detailed balance to guarantee that the stationary distribution of the chain is the one you want, causing the process to behave like a random walk. So although the Hamiltonian bit of HMC lets you take these great big steps through state space, you end up retracing your steps quite a lot.

Hence NUTS (No-U-turn sampler) et. al


It's fine for a company or individual trying to optimize an objective, but not as a way to do good science (which this article is about).


I strongly disagree. Bayesian inference is the only known self-consistent formal system for doing science, i.e. updating our belief system about the world based on the current evidence.


The problem with that reasoning is: whose belief system? Where do you come up with a prior that everyone agrees with?


You are not supposed to agree on a prior. That's one of the fundamental insights of the Bayesian inference framework. That different people know different things about a given situation, so they initially disagree, and therefore their priors are different. This should not be surprised. People disagree all the time, and the Bayesian framework just formalizes it.

The different people can then go on and do lots of experiments, collect lots of data and update their priors to posteriors. And the guarantee is that as long as each person's prior was not a mathematically weird function, after enough evidence has been collected all these people will have the same posterior function i.e. they will agree [1].

[1] The famous Aumann's agreement theorem https://en.wikipedia.org/wiki/Aumann%27s_agreement_theorem is a related result that you might like to read about.


Exactly. This is why (in my view) scientific research should focus on presenting evidence, not on arguing for certain posteriors or priors. The meta-science process then steers Bayesian beliefs correctly and the evidence-gathering process efficiently. (edit: I see now the top post here on this article also discusses this point.)


Actually, it's the strength of Bayesian inference that these assumptions are made apparent.

Coming to consensus on priors is the same process for arriving at consensus that all scientific inquiry must engage in. Anyone who says frequentist methods somehow more accurately represent an underlying reality are pulling a fast one.


Hmm, I think part of the question is where this debate and consensus should occur. I believe in firmly separating rigorous science from opinion and belief. To me, it follows that scientific research should focus on presenting evidence and leave it to Bayesian individuals to update their beliefs based on this evidence. Similarly I think argument or discussion about priors is not in scope for scientific research (except maybe a bit in the "motivation" subsection). (edit: I see now the top post here on this article also discusses this point.)


I can't help but feel like this is a fundamental definition problem. Science is not actually distinct from consensus forming. Science does not work with raw facts, it forms models based off human observations which are themselves a kind of consensus.

Bayesian research is just more honest about what's already the case.


I wouldn't agree, at any given time there is a lot of disagreement and non-consensus in given fields. So we need to new research to gather additional evidence. If every research paper tried to argue for a particular prior and posterior, rather than just gathering evidence, we would never make progress toward consensus either...


> I wouldn't agree, at any given time there is a lot of disagreement and non-consensus in given fields.

That's precisely my point. The act of presenting and refining research IS the act of building that consensus.

My statement here is not a novel thought. It's pretty much the modern philosophy of science for over a decade.

> So we need to new research to gather additional evidence.

This is simply data gathering though. Every approach starts here. I'm not sure why you suggest that people using Bayesian approaches to analysis are somehow forbidden from being informed by data (or informing priors by data).

That's exactly the same process folks use when selecting non-bayesian models. They don't spring from absolute truth, they're selected as well.

> If every research paper tried to argue for a particular prior and posterior

Given the replication crisis that's in part due to mis-application of existing models along with a lack of rigor in data collection, having research focus more tightly on the methodology for presenting data and conclusions doesn't seem like a bad outcome at all.


And if you believe Friston, Active Inference is how biological systems work!


Curious if there is any work to build ecosystem to run complex models on ML accelerators, like some layers for TF for example..


There are several libraries. PyMC4 [1], the next version of PyMC3, will introduce TF as a backend. Tensorflow probability [2] from Google. PYRO [3], from Uber, uses Pytorch backend.

[1]: https://github.com/pymc-devs/pymc4

[2]: https://www.tensorflow.org/probability/

[3]: http://pyro.ai/


I think this article omits the most important distinction between Bayesian and Frequentist statistics: subjective vs. frequentist interpretations of probability. In my own opinion, neither is "true", they're both just different tools for different purposes.

Bayesian inference is great when you have to make a decision and there are many theorems that illustrate this (for example, the arguments around coherence [1] and the complete class theorems [2]). In fact, Bayesian techniques are often useful for creating estimators with great frequentist properties! However, Bayesian interpretations of probability, and thereby the meaning of Bayesian statements, are inherently tied to the beliefs of an individual. That means that Bayesian statements usually aren't "true" in the objective / non-relative sense that we often expect from science. On the other hand, frequentist statements tend to have more of an objective flavor. The trick is: all our mathematical models have short comings and ways in which they're wrong when applied to any particular situation -- so neither really has a claim to being true.

The frequentist perspective often looks at worst case risk and tends to give a more global understanding of a procedure in terms of "how does this procedure shake out in all reasonably possible scenarios?". So, frequentist methods tend to be a bit more risk-averse which is often useful but can cost you for being to pessimistic. Ultimately, the real win is to know your tools well and to pick the right one for the job.

[1] https://en.wikipedia.org/wiki/Coherence_(philosophical_gambl...

[2] https://projecteuclid.org/euclid.aoms/1177730345


For those who are new to Bayesian Statistics and not to eager to dive into the maths right away I recommend Think Bayes [0]. It gives a nice introduction for those who know programming (Python). The ebook is available for free (see link).

[0]: https://greenteapress.com/wp/think-bayes/


I remember the frequentist approach taught in introductory stats classes never making sense to me. I didn't want to shove "stats" into the back of my brain and just focus on graduating. I genuinely wanted to understand the world a little better.

I began to research alternative approaches to modeling and conducting inference a few years ago. Discovering Bayesian Inference has had a large impact on the way I think and conduct research. There's a lot of hype and uncertainty about what "Bayesian" actually means. Here's a compact definition that I hope will attract some interest:

Bayesian Inference allows you to explicitly quantify your prior beliefs and get a more complete picture of uncertainty when modeling something.

If you'd like to learn more, the links below should be helpful.

Introduction to Bayes' Theorem (short): https://www.countbayesie.com/blog/2015/2/18/bayes-theorem-wi...

Bayesian A/B testing example (short): https://www.countbayesie.com/blog/2015/4/25/bayesian-ab-test...

If you're interested in spending some time learning about applied Bayesian Inference, I highly recommend Statistical Rethinking. The book doesn't assume a strong mathematical background and its filled with practical examples. https://xcelab.net/rm/statistical-rethinking/

McElreath is currently working on a second edition of that textbook, due around 2020: http://elevanth.org/blog/2018/07/14/statistical-rethinking-e...


BTW can anybody share a link to a really simple explanation of the Bayes' Theorem? I've once seen one, it was a size of a twit and would let you understand it in a matter of seconds, all the "super-duper intuitive explanations" around are too huge and complex actually.


I like to split the theorem in the following way:

P(Hypothesis|Data) = P(Hypothesis) * evidence_factor

P(Hypothesis) is the prior probability of the Hypothesis being true, in other words the probability we gave to the Hypothesis before seeing any of the data we are using in the theorem. When new data is observed, we use Bayes' theorem to update our believe in the hypothesis, which in practice means multiplying our prior probability by a number that depends on how well the new data fits our hypothesis. More precisely:

evidence_factor = P(Data|Hypothesis)/P(Data)

So it is the ratio of how likely our data is if our hypothesis is true, compared to (divided by) how likely it is in general. If it is more likely to occur in our Hypothesis, our probability of it being true increases, if it is more likely in general (and thus also more likely in case our hypothesis is not true, you can prove mathematically that those two statements are the same), then our believe in the hypothesis decreases.

TLDR: Prob(Hypothesis after I have seen new data) = Prob(Hypothesis before I saw the new data) * (how likely I am to see the data if my hypothesis is true, compared to in general)


    P(A|B) = P(B|A)P(A)/P(B)
Is a rule from statistics. In Bayesian speak this usually becomes

    P(prior|data) = P(data|prior)P(prior)/P(data)
or

    P(prior|data) proportional to P(data|prior)P(prior)
where P(prior|data) is also called the "posterior". The main idea is that you have some idea of "the prior" and you update it with your data to "update the prior" to get the posterior.

This is probably not what you were looking for but this is it.



    "Extraordinary claims require extraordinary evidence"
Always resonated with me as a good summary.

Where a (naive) frequentist might assume, for instance, that after a 90% accurate test comes back positive the hypothesis is likely to be true, a Bayesianist would ask how likely it was to be true in the first place; all the test did was make it ten times more likely, which may or may not make it probable.

You may enjoy https://www.lesswrong.com/posts/XTXWPQSEgoMkAupKt/an-intuiti...


This is the best I've seen, and I've seen a lot:

http://arbital.com/p/bayes_rule_guide


To update the probability you assign to a prior after you make some observation you rescale it by P(observation | prior) / P(observation). This scaling factor is proportional to how well the prior predicted the observation and inversely proportional to how well the "average prior" predicted the observation.

So a prior's probability increases to the degree that it predicts an observation better than alternative priors.


By definition of conditional probability: P(A|B) = P(A,B) / P(B) and P(B|A) = P(A,B) / P(A). Therefore, P(B|A) = P(A|B) * P(B) / P(A).


You should define what kind of simplicity you're asking for: An easy to understand explanation or a simple formula. I actually think that J Pearl & Co in the Book of Why provides a good example of the former.


"A Bayseian FDA regulator would b more forgiving". He absolutely would not. It should be obvious to anyone who gives the matter any thought that it must be more probable that you'll conclude drug A is better than drug B at the 5% level at some point over the course of an open ended experiment than that that will be the case at the end of a trial with a specific number of runs.

In fact, the open ended stop when you win method is the equivalent of running an enormous trial and then re-anlyzing the result at each point and publishing the most favorable point as the result.


The "open ended stops when you win" method, has to take into account contrary evidence into account. If you make 10 experiments, and 3 of those say your new and improve medication doesn't work, you have to take those into accounts, or else you're just cheating.

The only remedy against publication bias based cheating is to publish everything, including the failures. That will take care of the "wait until I get a 1 in 20 fluke" just so you can get past the p<0.05 threshold.


""" The Bayesian-frequentist argument, unlike most philosophical disputes, has immediate practical consequences. Consider that after a 7-year trial on human subjects, a research team announces that drug A has proved bet- ter than drug B at the 0.05 signifi cance level. Asked why the trial took so long, the team leader replies “That was the first time the results reached the 0.05 level.” Food and Drug Administration (FDA) regulators reject the team’s submission, on the frequentist grounds that interim tests of the data, by taking repeated 0.05 chances, could raise the false alarm rate to (say) 15% from the claimed 5%. A Bayesian FDA regulator would be more forgiving. Starting from a given prior distri- bution, the Bayesian posterior probability of drug A’s superiority depends only on its fi nal evaluation, not whether there might have been earlier decisions. """

Is that right? At each next trial Bayesians should feed the probability from the previous one as prior. Assuming that the first two trials did not bring the required results - then the prior to the third one should be rather small.


It's wrong. Not because the stated probability at the point you stop the experiment is wrong, but because the (stupid) rule is that we'll approve the drug if the probability that the drug is better than the alternative is above some arbitrary threshold value. If you run experiments a fixed number of trials, you'll get a variety of conclusions with different strengths. If you stop as soon as you are in the "barely passing" zone, you'll get a lower number of failing result, a higher number of barely passing results, and none at all that do better than barely passing.


The way I read this, the first two trials did not bring enough evidence towards the drug working, but they did provide some evidence. In the end, all three experiments taken together provide pretty massive evidence that the drug actually works, way below 0.05.

Of course, this all comes crashing down if the first two experiments happen to provide contrary evidence (that is, evidence the drug does not work). This would cancel out the results of the final trial somewhat, and not taking this into account is clearly cheating by publication bias.


Written by the creator of the bootstrap method in statistic.

Bayesian isn't used much in the industry compare to frequentist approach. Likelihoodist is even rarer. I've learned a bit on Bayesian but end up refocusing on time series and survival analysis within frequentist domain. There is waaaay more job posting and people that you work under that are frequentist or more comfortable doing it the old way.


Bayesian theorem made more sense once I started seeing the conditional (if statement) in conditional probability.


Yep. Baye's theorem is actually a generalization of contrapositivity (if A implies B, then "not B" implies "not A") to stochastic settings. It's not usually taught in an intuitive way.


Yeah, I had to figure some of this out on my own. Do you have a good book/resource on this?


Probability Theory the Logic of Science is an excellent textbook on math and theory. Note the title even references cultus' point.

Unfortunately it predates many of the modern developments in methods / computation, but if you want to dive deep, I strongly recommend it. It takes the perspective of designing a reasoning robot to make the most effective decisions.

Another resource, eg Stan's manual, can get you up and running on computation/inference. Your choice in computation tool should reflect the type and size of of problems you're interested in, and languages you're comfortable with. Stan has bindings for many scripting languages. R also offers Nimble, Python PyMC3 and Edward, and Julia has DynamicHMC and Turing. (EDIT: xcodevn has better Python recommendations: https://news.ycombinator.com/item?id=18213923 )


I apologize for repeating myself, I don't think I'm being clear on what I'm trying to say. Let me give an illustrative example: Let's say I've got an 100% fair coin, but I want to trick you into thinking it comes up heads more often than tails. The way I do this is wit a meta-experiment: I will have 1000 trials with up to 1000 flips each, but I stop the experiment whenever I have a majority of heads.

What we expect to find at the end is, I get about the same number of heads and tails in the whole meta-experiment. About 95% of the runs will have more heads than tails, but each of those runs will only have one extra head. The few runs where I did all 1000 flips will be ones where heads never had a majority, so they'll probably have lots of extra tails. Same number of heads and tails over all is the relevant result, 95% of runs had majority heads is bullshit intended to baffle you. Nobody would be fooled by such nonsense, right?


Does anyone have any examples of informative priors that they used to solve some problem at work?


See the discovery paper of gravitational waves [1] which uses informative priors from physical evidences/constraints.

[1]: https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.11...


Suppose you have a probability monad, implementing enumeration or random sampling. Like Amb but with probabilities attached.

   bayesRule :: (Prob a) -> (a -> Prob b) -> b -> Prob a
    bayesRule prior likelihood data = do
   h <- prior
   d <- likelihood h
   guard $ d == data
   return h
I don’t actually do Haskell...just thinking out loud.

Looks like it was written similarly here http://www.randomhacks.net/files/build-your-own-probability-...


d == data

requires an exact match of data, doesn't seem right


Interesting article (thanks for sharing). Indeed, a genuine prior is the key to apply Bayes' rule. However, is there a proof that it always exist? (As Feynman would question the law of physics changing in time.)



"The prior can generally only be understood in the context of the likelihood" https://arxiv.org/abs/1708.07487


One of the best examples of the idea of Bayes is the Monte Hall problem. It is a good example of how prior probability (3 unopened doors) can lead to a more clear posterior-host selects an unopened door with a bad prize and you are asked whether to stay with the unopened door you chose or switch to the remaining open door. Turns out via Bayes it’s better to switch doors because you have more information now.

Tons of write ups and YouTube videos out there on it but here is one example of an explanation:

http://angrystatistician.blogspot.com/2012/06/bayes-solution...


It's better to switch doors according to any statistical method, whether you're a frequentist or Bayesian does not matter. You can also show that you should switch doors by making an exhaustive truth table, by writing a computer program, or by experimentation, if you prefer these kind of approaches.


A truth table doesn't necessarily get you to the right answer. There are 3 doors I could pick, 3 doors Monty could pick, and 3 doors the prize could be behind. If I make a truth table of all 27 possible combinations, there are 12 combinations where Monty doesn't choose the same door as the contestant or the prize. Of these 12 options, exactly 6 have the contestant choosing the right door and 6 have the contestant choosing the wrong door.

You can certainly create a different truth table that arrives at the correct answer, but the truth table approach does not help ensure you get to the right answer like the Bayesian approach does.


No method on earth necessarily gives you the right answer to a problem. You first have to represent the problem correctly, of course.

Check out Scenario 2 here https://medium.com/@ProfessorF/visualizing-the-solution-to-t... for a correct tree.


Neither method guarantees the right answer, but it's a lot easier to get it wrong with a truth table.


Maybe you're right, I'm not fully convinced.

I mentioned a full truth table/decision tree because a long time ago these gave me the insight why switching is the right solution in the standard formulation of the problem, and they also illustrate why the problem is a purely deductive/logical problem whose solution does not require any inductive inference.

Then, to me it was a valuable lesson to learn that the Monty Hall problem does not reveal any perceived or real fundamental problem of probability theory.


I made this very same point ;) they burned me at the stake.


Bayes' theorem also helps clear up some of the subtleties behind the Monty Hall problem. Switching is advantageous because Monty knows which door has the prize. If Monty doesn't know which door has the prize, seeing a non-prize door results in 50-50 odds that your door has the prize.

Depending on how you model Monty Hall's prior probability of revealing the prize, seeing a non-prize door can result in the probability anywhere between 0 and 2/3 of switching being advantageous.


My take on the subtleties, in painstaking details, with source code so you can reproduce my experimental results: http://loup-vaillant.fr/tutorials/monty-hall

I consider several kind of Monties there, including an "enemy" that will try to open the prize door if he can.


I like your attention to detail on this! I'm a bit surprised you don't include the overall probability of winning against each Monty with an optimal strategy. I think it's very interesting that the helper doesn't increase your odds beyond 2/3.


Yeah, that and the spelling mistakes… I think I'll have another pass at this. I just re-read it, and I glossed over the numbers at times.


This cheats frequentist because in the game the frequentist isn't allowed to change the distribution.

With frequentist the trick is always in choosing a distribution. You can't update it according to a rule, but there is no reason why you simply can't pick a different populatuon distribution to operate under


I founded my consulting based on bayesian network. If anyone needs deeper explanation or advice, feel free to reach out. manmit@dextroanalytics.com


The only sensible "non-informative" prior is Jeffreys' prior. Invariance under reparameterization of the parameter is what I would consider to be a non-negotiable feature of any non-informative prior belief.

To assign a (improper) uniform prior to the variance of a Gaussian distribution is to assign a non-uniform prior to its standard deviation, and vice versa. One can, in certain circumstances, assign priors to be non-inforamative in a particular way, but to be universally non-informative, no, it must be Jeffreys' or nothing at all.

In consideration of the aforementioned, the debate about non-informative Bayesian priors is a relic of 20th century philosophy. The construction of hiearchical causality networks for the purposes of unsupervised learning is the future of Bayesian statistics, and priors in this context are rarely non-informative.


I agree with a tiny caveat, in that I'd change Jeffreys prior to reference prior.

On the other hand, these priors can be difficult to create in some (many?) situations and it's often more tractable to do ML.

Bayesian inference seems more principled to me in general if you allow for and use reference priors, but outside of that I think there are still reasons to prefer ML. There's two areas where I still have problems with priors.

The first is that the sequential testing paradigm (that is, prior -> posterior -> prior) doesn't always work in reality because you often have multiple experimenters operating simultaneously and independently with different priors. In one sense this is a trivial problem but in another sense it is not. E.g., if you are a meta-analyst faced with integrating such results, is prior variation akin to publication bias? What implications does that have?

The second is that there are situations in which using a prior actually might lead to unfair inequities. For example, let's say you're trying to make some inference about an individual, and know that ethnicity provides information in a statistical sense about the parameter you are making an inference about. Is it prejudicial or not to use a prior? I think using a reference prior would address this situation, but depending on the scenario you could make an argument that it is unfair (e.g., if the informative prior would suggest a positive outcome, not using it might be seen as prejudicial, but if the informative prior would suggest a negative outcome, using it might be seen as unfair). In this case, not using a prior at all actually might make sense--you might make a similar argument about non-Bayesian inference as Bayesian reference inference, but using non-prior-based inference does sidestep the issue in a sense, in that there is no longer a prior to decide about. This might be especially important in that, e.g., if you have a series of individuals, the act of choosing a prior might be seen as prejudicial in itself.

I generally consider myself as an "objective Bayesian" in the Jaynesian / reference prior sense, but there are practical and theoretical scenarios where I think people are likely to run into problems.


Jeffreys / reference priors also have some weird behaviour in high dimensions. You may enjoy this attempt to do better, without giving up reparameterization invariance:

https://arxiv.org/abs/1705.01166


But it’s worth noting that the most important branch of the physical science today, quantum theory, is manifestly non-Bayesian.



QBism is an idealist interpretation of QM. (Which is to say, non-scientific.)


An interesting alyernative to frequentist and Bayesian is that of the 'likelihoodist'

http://gandenberger.org/wp-content/uploads/2014/07/Statistic...

http://gandenberger.org/2014/07/21/intro-to-statistical-meth...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: