Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Bayes Rules – An Introduction to Applied Bayesian Modeling (2021) (bayesrulesbook.com)
147 points by RafelMri on July 17, 2022 | hide | past | favorite | 38 comments


Review by Christian Robert, a statistician: https://xianblog.wordpress.com/2022/07/05/bayes-rules-book-r...


This is a really useful book review, thank you for sharing it (and thank you to the author for writing it!).


The hardest part of bayes is

1) understanding what 'likelihood' actually is or represents

2) understanding what a 'partition function' actually is or represents

I always forget what these mean since I don't use bayes in anger

Also the above reveal there is more elaborate structure in bayes than just the idea of 'updating your prior with new information', which is simply a platitude among STEM folk


No, the hardest part is realizing that there are all kinds of tacit assumptions that we bring to bear merely by formulating a problem for Bayesian analysis.

Take the example used in chapter 2 of the book. It assumes that news articles can be classified as "real" or "fake", and there is no middle ground. It assumes that the initial prior produced by experts is reliable. And most of all it assumes that certain features, like exclamation points in the title, are causally related to realness or fakeness.

To illustrate this last point, consider analyzing the titles for a different features, the presence of the letter "z" rather than the presence of exclamation points. If it turned out that fake news in the corpus used to generate the priors just turned out to have more zees in their titles, would you then be justified in concluding that a news article about Zanzibar was more likely to be fake because it contained two zees?

That example might seem contrived, but if you use a Bayesian spam filter it is actually plausible that the presence of the letter "v" is dignostic of spam because of the prevalence of spam concerning viagra. But again, there is a causal model of this: viagra is a product that is often the subject of spam marketing, and the word "viagra" happens to have a v in it, which is a priori an uncommon letter in English.

But all this can fall apart depending on the circumstances. If you one day joined an email discussion of Stradavarius violins, the v signal could suddenly fail. An even more dramatic example: suppose you are an academic who starts to do research on spam filters and you have a collaborator who starts to send you examples of hard-to-filter spam. Now you have a very strong signal, but no straightforward textual analysis will allow you to extract it.

The hard part of Bayesian analysis is deciding what features to even look at. All the rest is borderline trivial by comparison.


>the hardest part is realizing that there are all kinds of tacit assumptions that we bring to bear merely by formulating a problem for Bayesian analysis.

I think the common argument is this is a strength of Bayesian analysis. Namely, that your priors make you explicitly state your assumptions. All models integrate assumptions, but not all of them make you explicitly quantify them like Bayesian analysis does.


This was my experience as well, and it's why I recommend learning Bayesian stats to any student of data analysis and statistics. I find that learning how to set up Bayesian models has a strong elucidating effect on model-building in general.


> would you then be justified in concluding that a news article about Zanzibar was more likely to be fake because it contained two zees?

I completely agree with you about hidden assumptions. But I don’t think that causal thinking is one of the problems here.

Bayes’ rule applies whether or not the relationship between X and Y is causal. Bayes’ rule is a predictive model and prediction does not need causation.

The problem in the Zanzibar example is that the model is misspecified. Or at least the model specification does not represent how you as a human being with cultural knowledge think about the problem. You look at Zanzibar and see an island in Tanzania. Your model looks at Zanzibar and sees two a’s and two z’s and a few other letters. So it is the wrong implicit assumption about the underlying model that’s giving you “non-sensical” results.


> the model is misspecified... it does not represent how you as a human being with cultural knowledge think about the problem

Yes, that is exactly my point. The way "you as a human with cultural knowledge think about the problem" is a major component of Bayesian analysis. The math is almost incidental. By the time you have chosen what features to do the math on, most of the heavy lifting has already been done.

This matters because if you just crunch the Bayesian numbers on an arbitrary data set you will almost certainly find features that appear predictive but are not. The more numbers you crunch, the more likely this is to happen. A lot of people lose money in the stock market this way.


> A lot of people lose money in the stock market this way.

This sounds extremely plausible to me. I'd like to use it as an example in something, any data to back it up?


This is referred to as "p-hacking" if that helps you find some. Or maybe that term is more general, I'm not 100% sure, but it's close.


No it isn't. p-hacking is something much more general and is in no way evidence of people losing money on the stock market. Plausible unsubstantiated anecdote is not data, etc etc, no matter how well the story reads.

Try to imagine that there's quite a few experts around HN ...


P-hacking is sifting through _many_ possible relationships in a dataset and picking one that happens to look correlated.

This is what I was responding to:

> This matters because if you just crunch the Bayesian numbers on an arbitrary data set you will almost certainly find features that appear predictive but are not. The more numbers you crunch, the more likely this is to happen. A lot of people lose money in the stock market this way.

Please explain how this is not exactly the same concept, expect with different motivations for looking through the data.

> Try to imagine that there's quite a few experts around HN

Why be a jerk though? I gave a reasonably well thought out comment related to one I found interesting, and made sure not to overly imply expertise.


Forking paths. [1], overfitting, fitting to the holdout set, etc etc.

"P-hacking" applicable to vastly more than stock data and is not evidence of anything, evidence is what was asked for.

Your response implies I'm ignorant and can't google p-hacking without being told the name of that one way of finding false correlation applicable to many, many fields. Now you're getting a bit cross about it that I've quietly pointed out that it isn't useful, and you're not actually providing expertise here which you thought you were. It's kind to provide expertise and I appreciate the kindness. You misread, fine. Best.

edit: Oh wait

> Or maybe that term is more general, I'm not 100% sure, but it's close.

That is an edit addition to the parent after my reply, so I guess you did get the point. Best.

[1] http://www.stat.columbia.edu/~gelman/research/unpublished/p_...


Much agreed. Data analysis needs a good theory.


You’re making an interesting point here.

I’m resistant to it because some analysis I do (say, inferring atmospheric composition from the spectra of light you shine through the atmosphere) does fit rather well in the Bayesian setup. In a very real sense, if you are doing a bunch of these inversions and not using Bayesian methods, you are leaving RMSE on the table.

Here, the “features” are the spectra you sense and you add some basic prior information based on historical expectations. You don’t have that much wiggle room to invent new variables to confuse the analysis.

But the situation changes if you have woolier problems with human judgement in finding “features” — the variables in the prior or likelihood in the spam example you give. And you can fool yourself badly!

E.g., to go with your market example, I have two models for the stock market, and 1 month of historical data strongly favors model 1 over model 2, so I bet heavily on model 1 as if that’s the only possibility. But it wasn’t, and I lose.

The area of causal inference uses a slightly different set of tools that gets outside the Bayesian setup, including the Do() operator, which allows for manual intervention to probe the linkages in systems like these. There’s a lot of activity there as we’re looking for systems that learn reliably.

I don’t know if you listen to podcasts, but Judea Pearl was recently on Sean Carroll’s (excellent) Mindscape podcast. They discuss this at some length, it was very nice.

https://www.preposterousuniverse.com/podcast/2022/05/09/196-...

PS: it’s enjoyable to see your comments here.


Hi Mike! :-) Thanks for the kind words.

I don't know much about atmospheric science, but the canonical "real" example of Bayesian reasoning is a diagnostic test for some disease, where you are told that X% of the population has the disease, where X is typically a small number, so the "surprising" result is that your odds of having the disease are low even if you test positive and the test is generally accurate. The problem is: how can you possibly know what percentage of the population has the disease? To determine that you'd need a reliable test, but there is no such test. That's the whole point! If there were a 100% reliable test you wouldn't need to do any math.


Sometimes called the "Bayesian dogma of precision" -- to any state of nature, you can assign a crisp probability (e.g., https://sipta.org).

For your disease example -- if you don't know the accuracy of the test ("the model"), and/or if you can't figure out what percent of the population has the disease ("the prior"), then you can't turn the Bayesian crank.

Yet, some do, because Bayesian good.


> understanding what 'likelihood' actually is or represents

The probability of observing the data you observed, assuming that the model parameters take on a certain value.

> understanding what a 'partition function' actually is or represents

The probability of observing the data you observed, this time averaging over all possible parameter values (weighted by the priors).


> The probability of observing the data you observed

Yes, in discrete cases. In continuous cases, you have to work with a probability density. I think this is one of the hurdles people encounter when they're first exposed to Bayesian stats. The probabilities, technically speaking, are zero.

The important insight in Bayesian work is that it's often not the probabilities themselves that matter but the ratios thereof, since from those alone you can compute posteriors.


Indeed, but it's a bit tedious to always say, "probabilities (in the discrete case) or probability densities (in the continuous case)."

In general, you have sums in the discrete case and integrals in the continuous case, but most formulas are otherwise the same.


That's quite true. Also, one could argue that continuous probabilities in practice are discrete probabilities due to finite resolution--we just don't care to specify what the resolution is.


I once had a TA job for an undergrad stats course. This was the "non calculus" course for the psych majors. I had also taken the "math" version of the same course, where we spent two semesters and proved everything. I honestly never came up with a satisfactory layman's explanation why continuous distributions are necessary, or what "continuous" is. I knew that we used calculus to derive the formulas that they were faced with memorizing, but that would have been irrelevant to them.

The best explanation I can think of today is: Use the one that makes the math easier or more readable.


Many measurements are continuous.

What is the age of a rock? That's not a discrete quantity: it could be anything.


Plus or minus what?

I know the importance of continuous sets in the study of statistics as a branch of math. But I don't know of any measurements that can't be represented by integer multiples of a unit for all practical purposes. And the students in the non-calculus stats course can't grasp what continuity is anyway.


> Plus or minus what?

That's what the probability density specifies.

> But I don't know of any measurements that can't be represented by integer multiples of a unit for all practical purposes.

You can always discretize any real number, but why would you? I don't see how integers are easier to deal with than real numbers. Calculus can be viewed as the limit in which you discretize numbers infinitely finely. Once you know how things work in that limit, it's generally easier to use calculus than to work with discretized quantities. One example: summations are often more difficult than integrals, and one way of approximating sums is to turn them into integrals.

From my perspective, calculus is a basic part of mathematics that everyone should be expected to learn in school. In the US, calculus is often viewed as some sort of intimidating subject that only extremely clever people can grasp, and then only in late high-school or in university, but in East Asia, it's taught to children as a matter of course.


I thought likelihood doesn’t necessarily sum to one so it’s not a probability.


It will always sum (or integrate) to one with respect to the data. For example, given likelihood p(x1, x2, ..., x_N|params), summing (or integrating) over all possible values of x_1 ... x_N will indeed yield 1.


The hard part is that the intuitive understanding of probability you learned as a kid is almost certainly wrong. You need to unlearn it and replace it with proper axiomatic treatment of probability based on measure theory. Once you have done that and built a new intuition, Bayesian thinking should feel pretty natural. Once you accept that probability is just the "size" of a set relative to the size of a superset and it's up to you to attach meaning to the sets, much of the confusion goes away.


I think you can get most of the way there with straightforward geometric intuition, treating the sample space as a rectangle of area 1 and proceeding from there into joint and conditional probability.


>>> ... the idea of 'updating your prior with new information' ...

... dates back to antiquity. I'm not a statistician, but my impression is that Bayesian methods are a way of formalizing that approach. On the other hand, use of the term "prior" is somewhat misleading because the formulas do not specify a time sequence for acquiring information.

I tend to have a lesser view of sprinkling Bayesian terminology into blogs about social issues. That's what I call Bayes Theater.


Likelihood: given a probabilistic model and its parameters, what is the probability of observing some data under that model?

For example, given a coin with some probability f of getting heads and thus 1-f of getting tails (model parameter), a set of coin flips (data), and the assumption that each coin flip is independent of all others (model), the probability of seeing an arbitrary sequence with H total heads and T total tails after H+T flips is

p(H,T|f) = f^H(1-f)^T [0]

Note that this likelihood is normalized (i.e. sums up to 1) with respect to all possible sequences of flips of length H+T, e.g. for length 2, there are 4 possible sequences of coin flips:

p(hh|f) + p(ht|f) + p(th|f) + p(tt|f) = f^2 + 2f(1-f) + (1-f)^2 = 1

(Also note that the likelihood is only a probability for discrete data; it’s a probability density for continuous data, since the underlying terms would no longer be probabilities but rather densities. In that case, replace the previous sum with an integral ranging over the entire domain of your continuous data.)

Partition function (usually called a "marginal likelihood"): what if we instead normalize the likelihood with respect to the model parameters? Then it would no longer be a probability distribution with respect to the data, but rather a probability distribution with respect to the parameters. The marginal likelihood is just this normalizing constant. In the previous example,

p(f|H,T) = p(H,T|f)/<constant that would normalize p(H,T|f) to integrate to 1 over all possible values of f>

You can optionally weight this likelihood by a prior p(f), in which case the numerator would be p(H,T|f)p(f), with the partition function updated accordingly.

>Also the above reveal there is more elaborate structure in bayes than just the idea of 'updating your prior with new information', which is simply a platitude among STEM folk

I could not agree more. The core philosophical tenet of Bayesian inference is that this re-normalization of the likelihood returns something probabilistically meaningful. The debate over the validity of priors pales in comparison to the debate over whether the likelihood ought to be treated as proportional to a probability distribution over the model parameters.

[0] Note that this is distinct from the probability of seeing any sequence with H heads and T tails. That would require normalizing with respect to the number of total sequences with H heads and T tails (H+T choose H), yielding the binomial distribution.


The 'hardest' part is realizing that basically everyone reasons this way unless taught otherwise. The books do a lot of hard work to cover this up.


I've come across this format a couple times now, does anyone know how to export it to PDF?

I honestly don't even mind paying for them but I want the flat file.


Pandoc + wget

wget -r https://www.bayesrulesbook.com/ && cd www.bayesrulesbook.com && pandoc --pdf-engine=xelatex foreword.html preface.html chapter-{1..19}.html references.html -o book.pdf


For me that repeats the table of contents many times, and leaves the maths as Latex code instead of formatting it as formulas.


I always get confused who the audience is for these kind of books?

For advanced reader this is really shallow, for beginners its really advanced (who really want to read 543 page book as beginner).


Simply put, there are many people who are neither "beginners" nor "advanced". There is a section of the book titled Audience which reads

> Bayes Rules! brings the power of Bayes to advanced undergraduate students and comparably trained practitioners. Accordingly, the book is neither written at the graduate level nor is it meant to be a first introduction to the field of statistics.


The scope seems pretty standard but I really like strongly structured form and exercises. Is there a pdf version available perchance?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: