Hacker News new | past | comments | ask | show | jobs | submit login
K-means Clustering 86 Single Malt Scotch Whiskies (revolutionanalytics.com)
135 points by platz on Jan 2, 2014 | hide | past | favorite | 25 comments



A few technical issues:

-High dimensional dataset, probably some redundant dimension are being "counting twice". Solution PCA your data first.

-sampling artefacts. Run the k-nearest algorithm lots of times and average

If the data points were lower dimension, the clusters would be able to cover the space better. Bowmore is a smoky whisky (score 3), but it has enough interesting other characteristics that means it does not align with the obvious "Islay" cluster (smoky centre of 2.8). I think it's in the wrong cluster, due to redundant fields having too much power.


Couldn't resist a quick play with PCA. Initially because, looking at the cluster centres, cluster 4 looked much more distinct than the others and I wondered if the rest could be sharpened up. Then realised quite how sharply the cluster sizes differ - two small (6 distilleries each) and two large (41 and 33). Quite an interesting result for k-means, and a reason for sharper/less sharp clusters. Having played with the data a bit, I buy into the smoky whiskies being a very different beast to the others, but I'm less sure about the other small segment (tobacco-but-not-smoky)

On your point, turns out PCA alone doesn't tend to help with Bowmore (or the similar Highland Park). The first component you get is basically medicinal-and-smoky, and these two only score 1 for medicinal. They're also probably a bit too honeyed or floral for cluster 4. But clustering on the components and tweaking the starting values can draw Isle of Jura, Oban, Old Pulteney and occasionally Glen Garioch into the smoky cluster. On the other hand, looking at a biplot of the first two components your could argue for two clusters - the six in OP's original cluster 4 vs. the rest.

The most important thing is that this work led OP to Talisker - result. But I suspect a definitive clustering will require more features - so, sadly, someone needs to taste them all again...


There was previously a published paper that did a hierarchical clustering of a bunch of whiskies:

     A Classification of Pure Malt Scotch Whiskies
     Lapointe and Legendre
     Appl. Statist. (1994) 43, No. 1, pp. 237-257
http://www.dcs.ed.ac.uk/home/jhb/whisky/lapointe/text.html

http://www.albany.edu/psychology/bcd/share/scotch_classifica...

It would be interesting to compare their results with the OP.


He's missing the newest and smallest distillery on Islay on that map - Kilchoman. Might want to give it a go.

http://kilchomandistillery.com/

... also the bewildering array sold by the others, particularly Bruichladdich, and the ridiculously expensive remaining stock of the defunct Port Ellen:

http://www.theguardian.com/lifeandstyle/2013/oct/26/scotch-w...


This is awesome! I wonder if anyone's done something similar with beers?

Anyway a few "next thing to try" suggestions from a machine learning perspective:

The model selection process used here is by its own admission quite ad-hoc, based on a gut feel about diminishing returns. There are various more principled methods you can use to find the sweet spot between over- and under-fitting with these kind of models, a lot of them based on held-out validation data.

One way to do this would be leave-one-out cross validation (LOO-CV): hold out one whisky, fit the model, and see how 'surprised' the model is by the held-out whisky, repeat for the next whisky and average over all the folds. Because the dataset is tiny this should be quite feasible.

To measure 'surprisal' you could e.g. look at the distance from the held-out data point to the nearest cluster, although something better motivated would be if you switched to a probabilistic model and used likelihood of the held-out data. Probably the simplest next thing you could try in that direction would be a Gaussian mixture model (GMM) trained using EM. K-means is actually a degenerate limiting case of this.

A probabilistic model would also allow you to use Bayesian model selection criteria, which can get quite interesting (and might lead you eventually to things like Dirichlet process mixture models).

It would make it easier to compare the model's explanatory power with other unsupervised probabilistic models. For example some kind of latent factor model like Factor analysis or pPCA would be quite interesting to investigate too, whether taken alone or in combination with clustering as a dimensionality reduction step as tlarkworthy is suggesting.

Also concur that doing multiple runs with different randomised initialization is generally a good idea for k-means or EM, since they can get stuck in poor local minima. Perhaps more common practise to pick the best of multiple runs than to average them though.


Something on beer: http://bit.ly/JLNgZA (wisc.edu)


At a certain point I suppose taste is subjective, but Laphroig is definitely not the smokiest or peatiest of the Islay malts.


There's a tendency to think that we're all very unique in our preferences, but think about it -- the only way you could really know that, is if you were to already have all of data of what many other people's preferences are. Do we really already know?

I study food, fragrance, software and other types of product preferences. In the studies I've done, often people's very subjective preferences (or so we assume) turn out to be highly clustered into just one or two clusters of highly similarly-preferenced people. Sometimes when you'd assume there would be total disagreement, there's actually total preference agreement.


This is super interesting to me. Do you have anything that you can share along these lines?


I would put a Lagavulin as smokier than a Laphroig, but they are both fantastic whiskeys, and I need to do a lot more testing.


Ach, you put an "e" in whisky!



Hey, at least I didn't call it scotch!


It is a little subjective. I've always thought Ardbeg has a smokier taste than Laphroig, but a fellow whisky-enthusiast friend of mine thinks the complete opposite.

edit: on the post. Very curious to see your thoughts on Talisker and whether it fits the profile the results gave it.


Talisker, to me, tastes like the "magic smoke" from burning microcontrollers. That's a good thing.


I find Ardbeg (at least the Uigeadail) to be far smokier than Laphroig. The Lagavulin as well. To me the Talisker is quite bland, the sherry flavor really drowned out any peatiness.


Agreed. I love the Talisker Storm, but I have a Lagavulin 16 in my hand right now that gives it a run for it's money. Definitely more punch in the smoke and peat departments.


The elements effecting the taste of a Whiskey are mind-bogglingly complex, e.g. - https://www.youtube.com/watch?v=O_aLgTRQjmM&t=10m21s


I know, and I love Ralfy's reviews. Can't say my palette is up to his level, but luckily, practicing is fun.


Thoroughly enjoyed the Talisker 12. Plenty of peatiness there. I recently got a bottle Uigeadail and I might prefer the Talisker which is also cheaper.


Bruichladdich(sp) Octomore is the smokiest in MHO. Ardbeg makes some fine smoky spirits too. Laphroig is close, but no cigar.


IMO Laphroaig (why is everyone leaving out the "a"?) is smoky but many are smokier. What Laphroaig brings to the table for me is it's extreme "medical" and "salty" taste.


taste is subjective, but peating levels are not. the standard levels for the Trinity of Peat are as follows:

Ardbeg: 55ppm (this has changed over time)

Lagavulin: 40-45ppm

Laphroaig: 40ppm

however, it's important to distinguish between standard levels and novelty bottlings. the Ardbeg Supernova is peated to over 100ppm, twice the usual level. Bruichladdich, which is a relatively lightly-peated malt, produces the highly-peated Port Charlotte line, as well as the Octomore, peated to about 170ppm.

so, unless you're talking about the standard line, you can't simply talk about the peating level of Ardbeg.

finally, since all single malts are aged in a charred oak cask which acts as a filter over time, in general, the older the malt, the less smoky it will taste. compare a Laphroaig 10yo cask strength to a Laphroaig 18yo, and/or 25yo, if you're ever lucky enough to do so.


For those looking to learn more, there's a guy by the name of Ralfy who does fantastic in-depth reviews that will get you excited to enjoy whisky: http://www.youtube.com/user/ralfystuff


I have to agree on the nod to Talisker. My favorite drink at the moment is Talisker Storm. Not a fan of the packaging, but who cares? It's delicious.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: