Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Gaussian distributions are monoids and why machine learning experts should care (izbicki.me)
99 points by jackpirate on Nov 25, 2012 | hide | past | favorite | 31 comments


all exponential family distributions may be written in a form that depends upon a set of fixed dimension sufficient statistics. https://en.wikipedia.org/wiki/Exponential_family these sufficient statistics have the additive form described in this article (this is a consequence of i.i.d. sampling). it is common to exploit this structure when implementing efficient inference in, for example, mixture models.

if you combine this property with a bayesian analysis, and put a conjugate prior on the parameters of an exponential family distribution, then the posterior distribution, and the marginal likelihood depend upon on the data only through these sufficient statistics and everything else is easily computed. in this form, one of the sufficient statistics often has an interpretation as a "pseudo-count"; how many effective samples are encoded in your prior?

exponential family distributions include: poisson, exponential, bernoulli, multinomial, gaussian, negative binomial, etc.


That the sum of two Gaussians is a Gaussian an be found in any basic text on statistics. That combining distributions, calculating them in parallel or online is possible isn't new either. The Kalman filter which was published in 1961 is an example of an online algorithm for fitting a Gaussian.

This is a interesting exposition and places specific results in a more general algebraic framework; however, the title suggests this is a revolutionary discovery which it isn't


I think you mean product. The sum of two Gaussians isn't a Gaussian.


I think he meant the sum of two gaussian random variables is a gaussian. I.e., if X and Y are both drawn from a gaussian distribution, then X+Y is too.


Right. (You also need X and Y to be independent, or at least multivariate Gaussian -- i.e., X and Y can both be Gaussian but (X,Y) can fail to be Gaussian as a pair.)

The product of Gaussian random variables is certainly not Gaussian.


When thinking about monoids, the word sum and product aren't distinguishable unless you specific the precise operation, as neither one is a canonical analogy to integer sum/product. Pointwise combination is an unattractive definition, as it ignores a lot of interesting structure in a connected space like real numbers.

Indeed, convolution, the common way of combining functions, is defined as a sum of products at each point.


A comment on the reddit version of this discussion points out that the approach the article uses to combine Gaussians has numerical stability issues: http://www.reddit.com/r/programming/comments/13r2mh/gaussian...


Sorry, but this is an example of bringing in extraneous math and notation to make something well understood seem mysterious. The fun statistics facts being abused here are: 1) moment summaries (sum 1, sum x, sum x^2 ...) are easy to manage; 2) many important summary statistics (count, mean, variance, kurtosis ...) are easy functions of moments; 3) once you decide to think parametric you can use that many distributions are completely determined by the first one, two or three sufficient statistics. But since all of the above can be clearly taught- it isn't as exciting.


Also if we want to get all math abusey (using graduate level concepts to do basic work): how about something like "moments are a co-product"? The emphasized property in the article is calculations on the raw data can be done on the moments. So all the calculations of interest map through the moment summaries (hence you can think of them as a co-product). Surely that is more exotic than just saying "monoid." And monoid is kind of used up by the crowd that says "free monoid" when all they mean is "set of all strings."


...so my knowledge of Abstract Algebra and Category Theory could actually have practical applications if I learn Haskell?

Edit: I phrased this flippantly, but it's a serious question. My educational background is in Math, specifically Abstract Algebra. Would learning Haskell actually help me use those concepts for practical purposes?


Yes and no.

There are "issues" with those structures in GHC, i.e. how the Functor/Applicative/Monoid hierarchy evolved in GHC (applicatives were discovered a little late)

http://stackoverflow.com/questions/7595023/why-is-this-decla...

http://stackoverflow.com/questions/5730270/alternative-imple...

But: (and this one of the beauties of Learn You a Haskell), learning the categories (functor, applicative, monoid through arrows, duality, groups etc), is a superb way to abstract the software analysis/development process and is one of the winning features of languages with good type systems (haskell, the *ML's incl F#, scala)

http://learnyouahaskell.com/


I have a math background, and I've found Haskell to be very confusing. There's a lot of syntax that's not like other programming languages or mathematics, which makes programs very dense and hard to read. (Perl and Ruby suffer from similar problems.)

I'd encourage you to try Python. Its syntax is simple, natural, powerful, and very similar to mathematical notation (I'm thinking of list comprehensions). If you want to use Python for mathematics -- in the sense of an open-source replacement for Maple/Mathematica -- I recommend Sage [1].

[1] http://sagemath.org


Actually, here Haskell is just like any other language: i.e. you can use the described properties to define similar operations in any language. It's not Haskell, that allows to define them, but the abstract properties of this class of objects


Would somebody explain this step in his explanation?

http://izbicki.me/blog/wp-content/plugins/optimized-latex/im...

How does he go from (a - b)^2 to a^2 - b^2? What am I missing?


That's the definition of variance. Look here: http://en.wikipedia.org/wiki/Variance#Definition


b in the case given is the mean. so it's defined as sum(a)/n. so the "ignored" term is sum(a-sum(a)/n)) which is zero.


Strikes me mainly as a trendy recasting of the statistical concept of a "sufficient statistic" (http://en.wikipedia.org/wiki/Sufficient_statistic)


Which would be to miss the point that monoids are the general principal of which sufficient statistics are a specific application.

Also, not sure mathematicians wore skinny jeans in 1904.


Surely the burden is squarely on the author to show what the generalization (in this context) offers that the specific concept of sufficient statistics does not? In other words, how else can a statistician or ML expert use this foreign "monoid" concept to improve or better understand parameter estimation?


"show what the generalization (in this context) offers that the specific concept ... does not"

What does group theory tell me in the context of 1+1 that knowing the answer is 2 doesn't? Generalizations are useful because they apply to more than one context.

That said; the interesting connection between two areas of mathematics is what I took from the article, but I'd agree that his title (well, subtitle) is way overblown - his argument for "Why ML experts should care" seems to boil down to speed, but the Haskell statistics package itself contains faster algorithms that are marked unsafe, and he hasn't demonstrated safety.


As I recall the Gaussian distributions are the only distribution that have the desired property (i.e. form a monoid). While the Gaussian is important it is by far the only distribution of interest in machine learning. I would like to hear what the author has to say about handling, for example, the beta distribution.

In summary: interesting idea but I'm not sure it applies outside a very narrow domain.


Gaussians definitely are not the only distribution. For example, the Poisson distribution is uniquely determined by its mean, so it also forms a monoid by the same reasoning. The Categorical distribution is also implemented in the HLearn library as a monoid.

The beta distribution is uniquely determined by two parameters called alpha and beta. Typically, these represent the sum of observations we've seen of some events. (On the same website is an example with heads and tails from coin flips.) The beta distribution forms a monoid simply by adding these parameters from each distribution. This is very similar to the Categorical distribution, and also generalizes to the Dirichlet distribution.

Things get really interesting when you stop talking about distributions, and start using them in the context of actual machine learning algorithms. If you know how the Naive Bayes algorithm works, for example, it should be plausible that it forms a monoid in the same way that the Gaussian forms a monoid. The future posts will cover how to apply this to some of the "important" learning algorithms like this.


i wonder if you are thinking of another monoid property of a gaussian?

suppose we have:

   x ~ N(0,1)
   y|x ~ N(x, 1)
then we have:

   y ~ N(0, 2)
i.e., gaussians are closed under marginalization.

however, i believe gaussians are not the only distribution with this property either: i think this property corresponds to the stable family of distributions: https://en.wikipedia.org/wiki/Stable_distributions


Stable distributions are something else, not related to marginals or conditioning. They come up when studying laws of averages.

Gaussian distributions belong to the class of stable distributions, though, because of another of their properties: independent Gaussians, when added, are again Gaussian.


the particular property of the stable distribution, i was thinking of, is "closure under convolution" which is the above marginalisation (i believe?).

infinite divisibility is (yet another) property of gaussians!


Nope. Closure under convolution is the same as closure under summation of the associated random variable, which is the defining property of stable distributions. This is explained in the first paragraphs of the wikipedia page you linked to ;-)

Closure under marginalization is something else.

It so happens that the functional form of the gaussian satisfies both, but the two properties are not at all the same.

  P1: X, Y gaussian => Z = X+Y gaussian
  P2: X, Y gaussian => X | Y gaussian


Yes, that's what I was thinking of! Thanks for the link to stable distributions -- hadn't heard of them before.


Noel, did you mean to say "while the Gaussian is important it is by far NOT the only distribution of interest in machine learning"? Not trying to put words in your mouth, just seems to me you may have intended that.


Erm, yes. My comment above was fairly awesomely wrong in several dimensions.


This reminds me of the graduate level physics class where we used the renormalization group to prove the central limit theorem.


Very interesting to be able to subtract inputs from the learning rather than starting the learning over again. I had been wondering if this were possible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: