all exponential family distributions may be written in a form that depends upon a set of fixed dimension sufficient statistics.
https://en.wikipedia.org/wiki/Exponential_family
these sufficient statistics have the additive form described in this article (this is a consequence of i.i.d. sampling).
it is common to exploit this structure when implementing efficient inference in, for example, mixture models.
if you combine this property with a bayesian analysis, and put a conjugate prior on the parameters of an exponential family distribution, then the posterior distribution, and the marginal likelihood depend upon on the data only through these sufficient statistics and everything else is easily computed. in this form, one of the sufficient statistics often has an interpretation as a "pseudo-count"; how many effective samples are encoded in your prior?
exponential family distributions include: poisson, exponential, bernoulli, multinomial, gaussian, negative binomial, etc.
That the sum of two Gaussians is a Gaussian an be found in any basic text on statistics. That combining distributions, calculating them in parallel or online is possible isn't new either. The Kalman filter which was published in 1961 is an example of an online algorithm for fitting a Gaussian.
This is a interesting exposition and places specific results in a more general algebraic framework; however, the title suggests this is a revolutionary discovery which it isn't
I think he meant the sum of two gaussian random variables is a gaussian. I.e., if X and Y are both drawn from a gaussian distribution, then X+Y is too.
Right. (You also need X and Y to be independent, or at least multivariate Gaussian -- i.e., X and Y can both be Gaussian but (X,Y) can fail to be Gaussian as a pair.)
The product of Gaussian random variables is certainly not Gaussian.
When thinking about monoids, the word sum and product aren't distinguishable unless you specific the precise operation, as neither one is a canonical analogy to integer sum/product. Pointwise combination is an unattractive definition, as it ignores a lot of interesting structure in a connected space like real numbers.
Indeed, convolution, the common way of combining functions, is defined as a sum of products at each point.
Sorry, but this is an example of bringing in extraneous math and notation to make something well understood seem mysterious. The fun statistics facts being abused here are: 1) moment summaries (sum 1, sum x, sum x^2 ...) are easy to manage; 2) many important summary statistics (count, mean, variance, kurtosis ...) are easy functions of moments; 3) once you decide to think parametric you can use that many distributions are completely determined by the first one, two or three sufficient statistics. But since all of the above can be clearly taught- it isn't as exciting.
Also if we want to get all math abusey (using graduate level concepts to do basic work): how about something like "moments are a co-product"? The emphasized property in the article is calculations on the raw data can be done on the moments. So all the calculations of interest map through the moment summaries (hence you can think of them as a co-product). Surely that is more exotic than just saying "monoid." And monoid is kind of used up by the crowd that says "free monoid" when all they mean is "set of all strings."
...so my knowledge of Abstract Algebra and Category Theory could actually have practical applications if I learn Haskell?
Edit: I phrased this flippantly, but it's a serious question. My educational background is in Math, specifically Abstract Algebra. Would learning Haskell actually help me use those concepts for practical purposes?
There are "issues" with those structures in GHC, i.e. how the Functor/Applicative/Monoid hierarchy evolved in GHC (applicatives were discovered a little late)
But: (and this one of the beauties of Learn You a Haskell), learning the categories (functor, applicative, monoid through arrows, duality, groups etc), is a superb way to abstract the software analysis/development process and is one of the winning features of languages with good type systems (haskell, the *ML's incl F#, scala)
I have a math background, and I've found Haskell to be very confusing. There's a lot of syntax that's not like other programming languages or mathematics, which makes programs very dense and hard to read. (Perl and Ruby suffer from similar problems.)
I'd encourage you to try Python. Its syntax is simple, natural, powerful, and very similar to mathematical notation (I'm thinking of list comprehensions). If you want to use Python for mathematics -- in the sense of an open-source replacement for Maple/Mathematica -- I recommend Sage [1].
Actually, here Haskell is just like any other language: i.e. you can use the described properties to define similar operations in any language. It's not Haskell, that allows to define them, but the abstract properties of this class of objects
Surely the burden is squarely on the author to show what the generalization (in this context) offers that the specific concept of sufficient statistics does not? In other words, how else can a statistician or ML expert use this foreign "monoid" concept to improve or better understand parameter estimation?
"show what the generalization (in this context) offers that the specific concept ... does not"
What does group theory tell me in the context of 1+1 that knowing the answer is 2 doesn't? Generalizations are useful because they apply to more than one context.
That said; the interesting connection between two areas of mathematics is what I took from the article, but I'd agree that his title (well, subtitle) is way overblown - his argument for "Why ML experts should care" seems to boil down to speed, but the Haskell statistics package itself contains faster algorithms that are marked unsafe, and he hasn't demonstrated safety.
As I recall the Gaussian distributions are the only distribution that have the desired property (i.e. form a monoid). While the Gaussian is important it is by far the only distribution of interest in machine learning. I would like to hear what the author has to say about handling, for example, the beta distribution.
In summary: interesting idea but I'm not sure it applies outside a very narrow domain.
Gaussians definitely are not the only distribution. For example, the Poisson distribution is uniquely determined by its mean, so it also forms a monoid by the same reasoning. The Categorical distribution is also implemented in the HLearn library as a monoid.
The beta distribution is uniquely determined by two parameters called alpha and beta. Typically, these represent the sum of observations we've seen of some events. (On the same website is an example with heads and tails from coin flips.) The beta distribution forms a monoid simply by adding these parameters from each distribution. This is very similar to the Categorical distribution, and also generalizes to the Dirichlet distribution.
Things get really interesting when you stop talking about distributions, and start using them in the context of actual machine learning algorithms. If you know how the Naive Bayes algorithm works, for example, it should be plausible that it forms a monoid in the same way that the Gaussian forms a monoid. The future posts will cover how to apply this to some of the "important" learning algorithms like this.
i wonder if you are thinking of another monoid property of a gaussian?
suppose we have:
x ~ N(0,1)
y|x ~ N(x, 1)
then we have:
y ~ N(0, 2)
i.e., gaussians are closed under marginalization.
however, i believe gaussians are not the only distribution with this property either: i think this property corresponds to the stable family of distributions: https://en.wikipedia.org/wiki/Stable_distributions
Stable distributions are something else, not related to marginals or conditioning. They come up when studying laws of averages.
Gaussian distributions belong to the class of stable distributions, though, because of another of their properties: independent Gaussians, when added, are again Gaussian.
Nope. Closure under convolution is the same as closure under summation of the associated random variable, which is the defining property of stable distributions. This is explained in the first paragraphs of the wikipedia page you linked to ;-)
Closure under marginalization is something else.
It so happens that the functional form of the gaussian satisfies both, but the two properties are not at all the same.
P1: X, Y gaussian => Z = X+Y gaussian
P2: X, Y gaussian => X | Y gaussian
Noel, did you mean to say "while the Gaussian is important it is by far NOT the only distribution of interest in machine learning"? Not trying to put words in your mouth, just seems to me you may have intended that.
Very interesting to be able to subtract inputs from the learning rather than starting the learning over again. I had been wondering if this were possible.
if you combine this property with a bayesian analysis, and put a conjugate prior on the parameters of an exponential family distribution, then the posterior distribution, and the marginal likelihood depend upon on the data only through these sufficient statistics and everything else is easily computed. in this form, one of the sufficient statistics often has an interpretation as a "pseudo-count"; how many effective samples are encoded in your prior?
exponential family distributions include: poisson, exponential, bernoulli, multinomial, gaussian, negative binomial, etc.