Gaussian distributions are monoids and why machine learning experts should care

dododo · on Nov 25, 2012

all exponential family distributions may be written in a form that depends upon a set of fixed dimension sufficient statistics. https://en.wikipedia.org/wiki/Exponential_family these sufficient statistics have the additive form described in this article (this is a consequence of i.i.d. sampling). it is common to exploit this structure when implementing efficient inference in, for example, mixture models.

if you combine this property with a bayesian analysis, and put a conjugate prior on the parameters of an exponential family distribution, then the posterior distribution, and the marginal likelihood depend upon on the data only through these sufficient statistics and everything else is easily computed. in this form, one of the sufficient statistics often has an interpretation as a "pseudo-count"; how many effective samples are encoded in your prior?

exponential family distributions include: poisson, exponential, bernoulli, multinomial, gaussian, negative binomial, etc.

absherwin · on Nov 25, 2012

That the sum of two Gaussians is a Gaussian an be found in any basic text on statistics. That combining distributions, calculating them in parallel or online is possible isn't new either. The Kalman filter which was published in 1961 is an example of an online algorithm for fitting a Gaussian.

This is a interesting exposition and places specific results in a more general algebraic framework; however, the title suggests this is a revolutionary discovery which it isn't

montecarl · on Nov 25, 2012

I think you mean product. The sum of two Gaussians isn't a Gaussian.

yummyfajitas · on Nov 25, 2012

I think he meant the sum of two gaussian random variables is a gaussian. I.e., if X and Y are both drawn from a gaussian distribution, then X+Y is too.

mturmon · on Nov 26, 2012

Right. (You also need X and Y to be independent, or at least multivariate Gaussian -- i.e., X and Y can both be Gaussian but (X,Y) can fail to be Gaussian as a pair.)

The product of Gaussian random variables is certainly not Gaussian.

Evbn · on Nov 25, 2012

When thinking about monoids, the word sum and product aren't distinguishable unless you specific the precise operation, as neither one is a canonical analogy to integer sum/product. Pointwise combination is an unattractive definition, as it ignores a lot of interesting structure in a connected space like real numbers.

Indeed, convolution, the common way of combining functions, is defined as a sum of products at each point.

_delirium · on Nov 25, 2012

A comment on the reddit version of this discussion points out that the approach the article uses to combine Gaussians has numerical stability issues: http://www.reddit.com/r/programming/comments/13r2mh/gaussian...

jmount · on Nov 25, 2012

Sorry, but this is an example of bringing in extraneous math and notation to make something well understood seem mysterious. The fun statistics facts being abused here are: 1) moment summaries (sum 1, sum x, sum x^2 ...) are easy to manage; 2) many important summary statistics (count, mean, variance, kurtosis ...) are easy functions of moments; 3) once you decide to think parametric you can use that many distributions are completely determined by the first one, two or three sufficient statistics. But since all of the above can be clearly taught- it isn't as exciting.

jmount · on Nov 25, 2012

Also if we want to get all math abusey (using graduate level concepts to do basic work): how about something like "moments are a co-product"? The emphasized property in the article is calculations on the raw data can be done on the moments. So all the calculations of interest map through the moment summaries (hence you can think of them as a co-product). Surely that is more exotic than just saying "monoid." And monoid is kind of used up by the crowd that says "free monoid" when all they mean is "set of all strings."

SatvikBeri · on Nov 25, 2012

...so my knowledge of Abstract Algebra and Category Theory could actually have practical applications if I learn Haskell?

Edit: I phrased this flippantly, but it's a serious question. My educational background is in Math, specifically Abstract Algebra. Would learning Haskell actually help me use those concepts for practical purposes?

gtani · on Nov 25, 2012

Yes and no.

There are "issues" with those structures in GHC, i.e. how the Functor/Applicative/Monoid hierarchy evolved in GHC (applicatives were discovered a little late)

http://stackoverflow.com/questions/7595023/why-is-this-decla...

http://stackoverflow.com/questions/5730270/alternative-imple...

But: (and this one of the beauties of Learn You a Haskell), learning the categories (functor, applicative, monoid through arrows, duality, groups etc), is a superb way to abstract the software analysis/development process and is one of the winning features of languages with good type systems (haskell, the *ML's incl F#, scala)

http://learnyouahaskell.com/

csense · on Nov 25, 2012

I have a math background, and I've found Haskell to be very confusing. There's a lot of syntax that's not like other programming languages or mathematics, which makes programs very dense and hard to read. (Perl and Ruby suffer from similar problems.)

I'd encourage you to try Python. Its syntax is simple, natural, powerful, and very similar to mathematical notation (I'm thinking of list comprehensions). If you want to use Python for mathematics -- in the sense of an open-source replacement for Maple/Mathematica -- I recommend Sage [1].

[1] http://sagemath.org

vseloved · on Nov 25, 2012

Actually, here Haskell is just like any other language: i.e. you can use the described properties to define similar operations in any language. It's not Haskell, that allows to define them, but the abstract properties of this class of objects

afc · on Nov 25, 2012

Would somebody explain this step in his explanation?

http://izbicki.me/blog/wp-content/plugins/optimized-latex/im...

How does he go from (a - b)^2 to a^2 - b^2? What am I missing?

jgeralnik · on Nov 25, 2012

That's the definition of variance. Look here: http://en.wikipedia.org/wiki/Variance#Definition

andrewcooke · on Nov 25, 2012

b in the case given is the mean. so it's defined as sum(a)/n. so the "ignored" term is sum(a-sum(a)/n)) which is zero.

mturmon · on Nov 25, 2012

Strikes me mainly as a trendy recasting of the statistical concept of a "sufficient statistic" (http://en.wikipedia.org/wiki/Sufficient_statistic)

bazzargh · on Nov 25, 2012

Which would be to miss the point that monoids are the general principal of which sufficient statistics are a specific application.

Also, not sure mathematicians wore skinny jeans in 1904.

elliptic · on Nov 25, 2012

Surely the burden is squarely on the author to show what the generalization (in this context) offers that the specific concept of sufficient statistics does not? In other words, how else can a statistician or ML expert use this foreign "monoid" concept to improve or better understand parameter estimation?

bazzargh · on Nov 25, 2012

"show what the generalization (in this context) offers that the specific concept ... does not"

What does group theory tell me in the context of 1+1 that knowing the answer is 2 doesn't? Generalizations are useful because they apply to more than one context.

That said; the interesting connection between two areas of mathematics is what I took from the article, but I'd agree that his title (well, subtitle) is way overblown - his argument for "Why ML experts should care" seems to boil down to speed, but the Haskell statistics package itself contains faster algorithms that are marked unsafe, and he hasn't demonstrated safety.

noelwelsh · on Nov 25, 2012

As I recall the Gaussian distributions are the only distribution that have the desired property (i.e. form a monoid). While the Gaussian is important it is by far the only distribution of interest in machine learning. I would like to hear what the author has to say about handling, for example, the beta distribution.

In summary: interesting idea but I'm not sure it applies outside a very narrow domain.

jackpirate · on Nov 25, 2012

Gaussians definitely are not the only distribution. For example, the Poisson distribution is uniquely determined by its mean, so it also forms a monoid by the same reasoning. The Categorical distribution is also implemented in the HLearn library as a monoid.

The beta distribution is uniquely determined by two parameters called alpha and beta. Typically, these represent the sum of observations we've seen of some events. (On the same website is an example with heads and tails from coin flips.) The beta distribution forms a monoid simply by adding these parameters from each distribution. This is very similar to the Categorical distribution, and also generalizes to the Dirichlet distribution.

Things get really interesting when you stop talking about distributions, and start using them in the context of actual machine learning algorithms. If you know how the Naive Bayes algorithm works, for example, it should be plausible that it forms a monoid in the same way that the Gaussian forms a monoid. The future posts will cover how to apply this to some of the "important" learning algorithms like this.

dododo · on Nov 25, 2012

i wonder if you are thinking of another monoid property of a gaussian?

suppose we have:

   x ~ N(0,1)
   y|x ~ N(x, 1)

then we have:

   y ~ N(0, 2)

i.e., gaussians are closed under marginalization.

however, i believe gaussians are not the only distribution with this property either: i think this property corresponds to the stable family of distributions: https://en.wikipedia.org/wiki/Stable_distributions

mturmon · on Nov 26, 2012

Stable distributions are something else, not related to marginals or conditioning. They come up when studying laws of averages.

Gaussian distributions belong to the class of stable distributions, though, because of another of their properties: independent Gaussians, when added, are again Gaussian.

dododo · on Nov 26, 2012

the particular property of the stable distribution, i was thinking of, is "closure under convolution" which is the above marginalisation (i believe?).

infinite divisibility is (yet another) property of gaussians!

mturmon · on Nov 26, 2012

Nope. Closure under convolution is the same as closure under summation of the associated random variable, which is the defining property of stable distributions. This is explained in the first paragraphs of the wikipedia page you linked to ;-)

Closure under marginalization is something else.

It so happens that the functional form of the gaussian satisfies both, but the two properties are not at all the same.

  P1: X, Y gaussian => Z = X+Y gaussian
  P2: X, Y gaussian => X | Y gaussian

noelwelsh · on Nov 25, 2012

Yes, that's what I was thinking of! Thanks for the link to stable distributions -- hadn't heard of them before.

redsymbol · on Nov 25, 2012

Noel, did you mean to say "while the Gaussian is important it is by far NOT the only distribution of interest in machine learning"? Not trying to put words in your mouth, just seems to me you may have intended that.

noelwelsh · on Nov 25, 2012

Erm, yes. My comment above was fairly awesomely wrong in several dimensions.

PaulHoule · on Nov 25, 2012

This reminds me of the graduate level physics class where we used the renormalization group to prove the central limit theorem.

fudged71 · on Nov 25, 2012

Very interesting to be able to subtract inputs from the learning rather than starting the learning over again. I had been wondering if this were possible.