Hacker News new | past | comments | ask | show | jobs | submit login

However, I'm not sure I understand a single component of the expression P(A|B) = P(B|A) P(A) / P(B) in this context.

You've come to the right place :)

In the context of least squares estimation you can usually ignore the denominator P(B) and only look at P(A|B) = P(B|A) P(A). Real mathematicians (!= me) cringe here because the numerator alone isn't a proper PDF but since we're only seeking the value where that PDF has its maximum, that's ok.

Good, so now we have P(A|B) ~ P(B|A) P(A). What you want to know is the maximum of P(A|B), that is "where's the most probable position of my 3D point A given all the 2D measurements B"?

Let's look at the solution P(B|A) P(A). The first factor is the likelihood function, the second factor is the prior. The prior is easy: It's some prior belief about the distribution of A. So if you know the position of your 3D point somewhat (even without looking at B), say you know it must be in a range of coordinates, or you know that Z must be positive or any other statement about A that you know must hold, that information would go into the prior.

P(B|A) on the other hand is your likelihood. That's kind of the "meat" of Bayes. Your likelihood is a distribution over B, given A. Meaning, if you knew your 3D point A, how would the PDFs of your 2D observations of it look like? Note that the likelihood is to be understood as a function of state, meaning the domain of it is A.

So to summarize argmax(P(A|B)) = argmax(P(B|A) P(A)), meaning to get the most probable location of your 3D point you need to look in your likelihood domain A (in this case the 3D space) where the 3D points are that would produce the observations B that you have and then also take into account your prior beliefs about A, to narrow down that set of candidate points.

In the concrete case of least squares both likelihood and prior are assumed to be gaussian, so if you multiply them you get a new gaussian with a single maximum. Neat!




I appreciate the kind reply, but I don't think you understood the gist of my complaint. I have a degree (just undergrad) in math, and I've implemented Kalman filters, Kalman smoothers, information filters, particle filters and so on at least a dozen times. I know what operations to perform, and I even have an intuition about why they work.

When I complain about the Bayes theorem way of describing or thinking about Kalman filters, I really mean that I don't understand what almost anything in that expression means. The notation is intuitive if read as English, but opaque to what operations are being performed on what mathematical objects. Capital P means probability of an event or predicate, and A is some abstraction of the thing I'm trying to estimate, and B is the abstraction of a new observation or measurement.

Think of it this way: I have some knowledge of my prior estimate, and I'm able to characterize it as a Normally distributed random variable. It has a mean and a covariance in 3D space (let's call those x and P). There is a multivariate function for that pdf (lowercase p); it's an exponential of the negative square of the Mahalanobis distance normalized to have a unit volume. The same thing applies for my measurement, except it's only a 2D mean and covariance (let's call those z and R). I already know that I can use my H matrix, some matrix inverses and multiplications to combine this information. It's really just a case of multiplying the pdf for the state estimate and the measurement and re-normalizing to a unit volume. In other words, I believe both things are true, and they are independent, so I can multiply their pdfs and re-normalize to get a combined result. The rest is just linear algebra.

Now let's get to Bayes. P(A) is supposed to be the probability of an "event" or predicate. I can squint sideways and translate that as integrate my pdf for the prior estimate (Normal with mean x and covariance P) over a some unspecified bounds and convert that to a probability of my estimate being in those bounds, but I'm not sure that's what is intended. Again a similar thing applies for B (Normal with mean z and covariance R). It's frustrating that the bounds are never stated, because they could be radically different spaces in the numerator and denominator. I guess I'll give that a pass because maybe the notation would be too cumbersome if it was included, but this simplification seems to never be stated in any of the books or papers I've read on it.

Next, P(B|A) has to be probability of another "event", and if I squint again, the best I'm able to come up with is that it means a new Normal pdf with mean = H'z and covariance = H'RH. However, I haven't seen that spelled out any where, and so really that's just assuming the conclusion I want, which is super questionable. It also doesn't help me understand the left side of the Bayes equation - that vertical bar there seems to mean something different. When I look to the definition of conditional probability, I don't see anything about the vertical bar applied to multivariate Normal distributions.

If you translate it all to English, it reads as a coherent sentence, and that's fine. This is the "prior state estimate", that's the "observation", and this other thing is that's the "a posteriori" etc... However, if Bayes really helps with the understanding the math, the vertical bar has to mean something specific as an operator, and one would hope it meant the same thing on the left and right sides of the equation.

Similarly, in order for all of those P( ... ) to become scalars so multiplication and division are well defined, you need some integration bounds to turn the event into a probability. But at this point, I'm not even sure if that's what is intended by the notation - maybe they aren't scalars, and multiplication and division mean something radically different here. I honestly don't know.


Probability notation generally works best for the people who already understand the concept in question. Let me take a crack at your question.

The equation in question is P(A|B) = P(B|A) P(A) / P(B). In modern Kalman filter literature, this would be stated as something like: P(x_k | z_k) = P(z_k | x_k) P(x_k) / P(z_k). It is generally left as implicit in these sorts of equations that everything is also conditioned on the sequence z_1 to z_{k-1}. In this equation, x_k is a free variable in the state space (possibly multi-dimensional, so a vector), while z_k is the measurement, which is a realization of the random variable distributed as N(H(x_k), R_k). The result is a PDF over the free variable x_k.

So let's tackle the terms one by one:

1. P(z_k | x_k) - this is the probability that we measured z_k, given that the true object is at x_k. This is the aforementioned normal distribution N(H(x_k), R_k).

2. P(x_k) - this is the prior probability of the state estimate, generally after propagation through the motion model from P(x_{k-1} | x_{k-1}). In the Kalman filter, this is also Gaussian, and conveniently has the same mean as the term above (see note below).

3. P(z_k) - this is the denominator that someone else mentioned earlier can be effectively ignored, which is right - you only need it to normalize the numerator. If you must compute it, it can be factored as the integral over the entire state space of P(z_k|x_k)*p(x_k). Given z_k, this is a number, not a function.

4. P(x_k | z_k) - The result, which is a Gaussian PDF. You can arrive at it numerically by plugging in specific values for x_k, in which case (1) and (2) are numbers. Or symbolically, in which case (1) and (2) are functions, and you'll end up with the form of a Gaussian PDF.

Note: The original article quotes a distribution for the product of two Gaussians with arbitrary means. It does not state that this is an approximation, which is exact only in the case of equal means. This is why unbiased measurements are one of the Kalman assumptions.


I think the crux of the complaint of is the imprecision in saying e.g. "P(z_k | x_k) - this is the probability that we measured z_k, given that the true object is at x_k"

Technically, you can only give a useful answer about the probability density of the random variable Z_k at the value z_k, conditioned on X_k = x_k. In the Bayesian interpretation of the Kalman filter, you never have an event "I measured z_k" (that event has probability 0, of course).

I agree that the probability notation is the issue here. Look at how wikipedia shows Bayes' rule for continuous random variables on both sides of the |. : https://en.wikipedia.org/wiki/Bayes%27_theorem#Random_variab... That's the kind of explicit and precise notation I would use to help someone understand the Kalman filter from a Bayesian perspective.

Once you use that definition of Bayes' rule, then you can substitute the definitions of the multivariate normal pdf, Do The Math, and derive the Kalman filter recursive updates.


Thank you for your reply here, and the one below. I wish I had seen it before posting my sibling message to this one. I'm a bit too tired to go any further with this tonight, but I plan to look at your links tomorrow.

Cheers.


In term 1, you say P(z_k | x_k) is a probability (which should be a scalar value between 0.0 and 1.0 inclusive). How did we get from the vector pdfs of measurement and estimate to a scalar probability? Maybe you meant evaluating a pdf at an arbitrary point z_k, getting its density (scalar non-negative value, between 0.0 and infinity) at that point? I have the same confusion on your other enumerated terms. If they are probabilities, I'm only familiar with getting those by integrating the pdfs over some set of bounds or range.

I've seen the k subscripts plenty before, and some people use + and - to indicate pre and post parts of an operation, others put conditional vertical bars in the subscripts, but your definition for the pdf of z_k is a bit different than I'm used to. If we focus on the current instant of time (dropping the subscripts), ignore the motion model part, and only deal with a single measurement, I would say the estimate and measurement are defined like this:

    estimate    ~ Normal(mean=x, covar=P)
    measurement ~ Normal(mean=z, covar=R)
Then there is an elegant formula looking at it as an information filter, using Matlab-ish notation:

    Y = inv(P)    +  H'*inv(R)*H
    y = inv(P)*x  +  H'*inv(R)*z

    update ~ Normal(mean=inv(Y)*y, covar=inv(Y))
Ignoring the information filter stuff, and getting back to your list, Term 1 looks like a pdf with the dimensionality of z, but the result in Term 4 is a pdf with the dimensionality of x. In otherwords, the resultant Term 4 is a pdf (function) which takes a vector argument in the x space, and I don't see where to pick any given z for the pdf in Term 1.

It's getting late, and I'm getting spacey, but it's tempting to try and write the type signatures of these functions in some strongly typed pseudocode so I can show you my confusion.


Concerning the difference between our definitions of measurement, I was referring to the prior probability of making a measurement given the true state, which is P(z|x), or, if you prefer, f(z,x) = P(Z=z|X=x). This PDF is Normal(Hx, R). The distribution you refer to is that of the image of the true state in measurement space (Hx), which is Normal(z, R).

Concerning the general question of dimensionality and numbers vs. functions, I'm afraid I still don't understand the fundamental issue. I agree that probability notation isn't the best, but each of those terms can be replaced with a Gaussian (or in the case of the denominator, an integral of the product of two Gaussians). You end up with a function in terms of x_k and the sequence of z's. You can then choose which variables are free or bound and evaluate the function as your application requires. If all variables are bound (the case that you know the measurement z_k, and are evaluating the posterior probability of the state at a given point), all the terms are numbers (yes, densities), and there are no dimensional issues. If you leave x_k free, you end up with two functions in the numerator, and a number in the denominator. Those two functions are Mahalanobis distances (which are scalars) in an exponent, and again, no dimensional issues. In general, such operations with PDFs represent various combinations and manipulations of event spaces that need not be commensurate. Computing the PDF of a coin flip given the distribution of all of the positions and velocities of the molecules of air through which the coin flies requires comparing a two-state PMF to a virtually infinite dimensional PDF. Notationally speaking, no problem.


> I agree that probability notation isn't the best [...]

There is so much hidden context in mapping Bayes' theorem to Kalman filters that I still have to cringe. First we've got the sub-expression "p(x_k)" in the numerator, and you know that is a multivariate PDF for the state estimate, or possibly the density value of that PDF at x_k. However, we've got a very similar looking expression "p(z_k)" in the denominator, and in this case it means "an integral of the product of two PDFs". Those are wildly different substitutions for nearly the same syntax.

That same confusion applies to p(x_k|z_k) and p(z_k|x_k). The first seems to indicate "the PDF of my estimate given this measurement", but the second really says something like "evaluate the PDF in the measurement space at the location which corresponds to my estimate". Essentially, p_Z(h(x_k)).

If you don't already know the solution, this notation is not prescriptive for getting there.


Oh, I see where you're getting at.

I think I can clear that up for you: Bayes formula works for both concrete probabilities (P(A), P(A|B), etc..) as well as PDFs (p(A), p(A|B), ...). So

    p(A|B) = (p(B|A) * p(A)) / p(B)
is just

    P(A|B) = (P(B|A) * P(A)) / P(B)
the only difference is that in the first version the inputs are functions of A and B and the output is a function of A and B. In the second case you have A and B given and the output is a concrete number.

In the case of the Kalman filter (or LS estimation in general) all your p(..) PDFs are Gaussians.


Yeah, the difference between lowercase p and capital P seems pretty important. Most places show capital P, so that's part of the confusion.

There's more though. Lowercase p(B|A) seems to really mean p_B(h(a)), and that's not obvious. Hell, I might still have it wrong.

And most everyone says to ignore p(B) in the denominator, but that's really sloppy hand waiving. The notation means something, and there should be a well defined set of substitutions, but in each of the four terms, they do something radically different. I can't see a pattern to follow.


Lowercase p(B|A) seems to really mean p_B(h(a)), and that's not obvious. Hell, I might still have it wrong.

I think you have it about right. It is "semi-obvious" by the fact that of course, observation and state are related through the observation function (in your case, observing a 3D point as a 2D coordinate) and that is of course part of the PDF.

And most everyone says to ignore p(B) in the denominator, but that's really sloppy hand waiving.

It's not. You want to know the most probable value of A. In other words you are looking for the argmax of a function of A. p(B) is purely a function of B, no A involved, so it becomes a constant in your equation. Since it's in the denominator, it's a normalizing constant.

Note that if you have the "Uppercase Bayes" (P(A|B) = ...) you are looking for concrete values so the normalizing P(B) does matter.

Now in the case of "lowercase Bayes" (p(A|B) = ...) it matters just as much, but you can still ignore it if all you're looking for is the argmax of the resulting PDF, as the p(B) is just a scaling constant and it's not changing the argmax of p(A|B).

but in each of the four terms, they do something radically different. I can't see a pattern to follow.

I don't understand what you mean here.


> It's not. You want to know the most probable value of A.

Nah, it's really not that simple. When I've done this in the past, I've needed both the mean (which is the mode for Normal distributions) and the variance, so I can make confidence ellipses. I don't just care about the most probable location.

I already have a set of techniques for working with Kalman filters. The only reason I would want to understand applying the Bayes' theorem in this context is if it offers insight into a wider class of problems (non-Gaussian PDFs) or if it helps me communicate with others. In both of those cases, I'd like to understand the thing first before I hand-waive the denominator away.


Nah, it's really not that simple.

Well, you came here and asked and I gave you a response because I happen to know the topic and wanted to help. It's your choice to not believe what I explained but I doubt you'll get a very different answer from other people.


Heh, I think we stepped in the wrong direction. I didn't mean to offend you.

Take care.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: