Great writeup. I think this should go hand in hand with articles explaining why p <= 0.05 is not an end-all be-all confirmation of your hypotheses/conclusions. Before I jumped to the software engineering world, I did biostats and was essentially a "bioinformatician". You quickly realize how many experts in the field misuse statistical tools entirely while using their results to prove a point, or worse, draw conclusions incorrectly from their results. A big faux pas I saw a lot was using normality-assumed parametric tests on non-normal data where the skew was clearly significant (i.e. you couldn't get away with it like some non-normal data dists.). Seriously, go take a look at some bioinfo papers (or any biology papers for that matter), it's getting pretty bad. However when we learn (even in master's or Ph.D. programs) about these mathematical tools, we are not often taught about what the math means. I'm sure I'm not the only one to have been taught formulas as a means to prove your research rather than the theory/intuition behind them. Luckily there are articles such as these that force you to step back and consult the maths again to learn what is really going on.
As an aside, I'm sure this sort of thing happens in all walks of life, not just maths/data science. Some programmers don't understand the intuition of certain things that they code and when it is time to explain they will likely freeze because they know how to code it, but don't really know why the code works fundamentally.
> Do you take every observation: square it, average the total, then take the square root? Or do you remove the sign and calculate the average?
Neither of these is even an attempt to measure average daily temperature variation. (Assuming any reasonable definition...)
If you're talking about variation from day to day, then looking at differences in max and min, or at given points in time, or on the same day in different years, would be some approaches. Averaging absolute values makes no sense whatsoever -- if the observations were 20 and -20 the result would be the same as if they were 20 and 20. And calculating the standard deviation of the observations is calculating something else again (it might be standard deviation, or it might just be something dumb). Neither of these are problems with standard deviation.
It's sad if newspapers or their readers don't know what standard deviation means, but they're pretty much innumerate across the board so it's not clear whether further muddying terminology is going to help anyone.
They used the word "change" in the paragraph before the one you quoted. I think by "observation" in this paragraph, they mean "x_i - M(x)" as in https://en.wikipedia.org/wiki/Average_absolute_deviation: the difference between the current value and some measure of central tendency (mean, median, mode). It's not explained well, but if you make this assumption, the article makes sense.
The averaging of absolute values in the article's context is happening around a mean of 0(averaging the variations). In general, one would average the absolute values of the difference from the mean.
Yes, it would be silly to convert a -20 to 20 if the mean isn't 0.
The advantage with STD is that it is easier to compute relative to MAD in a situation like a random walk. For instance when one has a +1, -1 equal probability one dimensional random walk, so the mean is 0, X = sum of X_i, E(X²) = n is straightforward, whereas computing E(|X|) is not so easy.
That's as clear as mud. What does a -20 vs. a 20 mean? That it was colder at noon than at midnight?
Assuming that there was some sensible definition (e.g. deviation from the past mean temperature, which is not at all what was stated) then "average deviation" has two obvious interpretations. Re-labeling "standard deviation" or "mean deviation" would not be helpful in this case, and the -20 and 20 values STILL don't help the case.
The article isn't clear but the deviation is either from the sample mean or the mean of the underlying random variable. You are trying to find the expected difference from the average and hence the absolute value(otherwise, the positive and negative deviations cancel out).
> A big faux pas I saw a lot was using normality-assumed parametric tests on non-normal data where the skew was clearly significant (i.e. you couldn't get away with it like some non-normal data dists.)
Depends on how much data you have. With a couple hundred observations, you can have as much skew as you like and the normal approximation will still be pretty good. I only mention this because there are a lot of misconceptions about how statistics relies on normal data, when really it mostly just relies on the distribution of the mean being normal, which is pretty much a given because of the central limit theorem. There are much worse sins and abuses -- which yes, unfortunately you do see all the time in scientific papers.
In general it is a misconception that real data always follows a normal distribution. It is true that if you sum many /independent/ random quantities, then the result is approximately normal (e.g., the central limit theorem and generalizations). But real data tends not to be independent. Many real world quantities follow extremely skewed distributions. E.g, Zipf's law, Korcak's law, Pareto's laws.
For a concrete example, if you look at the distribution of the number of friends users have on social networks, you might expect that 95% of people have the mean number of friends +/- a few standard deviations (since this would be the case for a normal distribution). It would be virtually impossible for someone with a number of friends that is say thousands of standard deviations away to exist, yet there will be many such users in social networks (celebrities, bot networks, etc). In reality, the empirical distribution in this case follows an extremely skewed distribution.
> It is true that if you sum many /independent/ random quantities, then the result is approximately normal...
Don't forget defined variances! The end result could be more generally levy-alpha (alpha < 2.0), as it is with many financial instruments.... Normality requires defined variance.
And yet a Pareto distribution still has a mean, and the sampling distribution of that mean is approximately normally distributed.
Of course I'm not claiming that you can just pretend that a Pareto distribution is a normal distribution, but statistical tests are generally concerned with differences in means (group A does on average 25% better than group B) so it's the sampling distribution we're interested in, not the parent distribution.
You make a good point about autocorrelation and dependent data, but that's a very different issue. To riff on your example about social networks, you'd have dependent data if you're trying to see what kind of news articles people like to read, if those preferences turn out to be mostly guided by what friends are reading.
yeah, that's why I had to use the parens saying that it was clearly data that would affect their stats (think about studies where you can only extract data from 5 rats per group, etc.). Also why I said normality-assumed parametric tests. There are many other parametric tests which aren't based on "normality" assumptions. But yes as you said, you can absolutely use normality-assumed parametric tests on non-normal data within reason and your conditions you listed are within reason. Of course this is my opinion and my opinion could just as easily be entirely wrong as decided by the expert community :D
And as always, your allowances depend on the test you are performing :P Also gotta love statistics for keeping a large list of customary exceptions determined by the community too.
A parametric test is making an assumption about the distribution of the random variable, not it's mean.
Inside many of these tests, they might make use of CLT assumptions of sample means, but that doesn't mean they don't still depend on the distribution assumptions.
Skewness is a major consideration and can lead to completely different inferences.
Let's say we wanted to find the median and our distribution was assumed to be normal. Under no skewness, the sample mean would be a good approximation. Under skewness the sample mean would be a very bad approximation.
If the test in question is solely about the mean of the random variable, and nothing else about the distribution, then it's possible that the normality assumption only need to extend as far as the sample mean (ala t-test). But that's hardly a parametric test anymore is it?
It's not clear to me what kinds of tests you are referring to. Ordinary least squares regression, for example, is all about estimating conditional means, and it is very much parametric. Just finding the best estimates for the parameters of a one-dimensional distribution is usually not particularly interesting, is certainly not what statisticians spend most of their time on and in any case nobody's suggesting that the population mean is always equal to the population median.
Why do you think we have the classification 'parametric' if the only thing that matters is the distribution of the sample mean? If it's all going to converge to be normal as you say, why is there parametric and non-parametric tests?
Regression works like this: E[Y|X] = Xβ. It is parametric because you model the conditional mean as a weighted sum of various predictors, and these beta "weights" are your parameters. This is true of ordinary regression, Poisson regression, binomial (logistic) regression and so on. An example of nonparametric regression would be something like regression splines.
Why are there nonparametric tests? Because for small sample sizes you can't always trust the normal approximation, and as you state this might be due to something like skew. This takes nothing away from the fact that inferential statistics is almost always about comparisons of means. And yes, the t-test is a parametric test, of which the Mann-Whitney or Wilcoxon would be the nonparametric equivalents.
I confess that I found this article unhelpful. There are interesting tidbits in there, but I don't think it helped me identify any specific errors you'd reach by using a standard deviation rather than mean average deviation. The closest it came was:
"1) MAD is more accurate in sample measurements, and less volatile than STD since it is a natural weight whereas standard deviation uses the observation itself as its own weight, imparting large weights to large observations, thus overweighing tail events."
More accurate how? Less volatile, not overweighing tail events: what inference would I make incorrectly by using the standard deviation?
To be clear, I'm not arguing "for" standard deviation, I'm just saying that I wish this article had said more about why it's potentially misleading/less powerful.
I agree. For both the MAD and STD, we are trying to reduce information about the "spread" of a distribution to a single number. Any such reduction must lose information, so you should pick whichever quantity is suitable for your needs.
E.g., in the article they mention that the Pareto distribution has finite MAD but infinite variance. This is meant to be an argument against using the STD, but actually the infinite variance tells us something really important that the MAD does not: the law of large numbers /does not apply/ for the Pareto distribution!
I think the real message should be to avoid blindly applying techniques and tools (especially formal ones) without thinking about why or what they capture.
Taleb is exasperating. Pareto-Levy distributions are statistical nihilism the way Taleb talks about them.
Data is very often approximately normal. Or can be approximated with something like Student-T. That includes estimators for volatility in stock returns. If you assume your risk profile can be characterized with standard deviation, well, you're an asshole. It also can't be characterized with MAD.
Then you have stuff like this: "MAD is more accurate in sample measurements" -what does this even mean?
Thank you for saying this! I'm just a struggling armchair intellectual, but it seems to me like every half year Taleb comes up with something to loudly hand-wring about, something that nobody else gives a damn about because they're not in the attention whoring business.
No, he's completely legitimate actually. He's just so far ahead of people that they can't tell. There's a good Kahneman quote saying he's top 100 intellectuals.
Bingo bingo bingo. Reducing the spread of a distribution to a single number is correct for very special distributions. Beyond these special distributions you have to do more study, take more measurements, do more simulations to understand what you have underlying your mean or median.
Standard deviation is not the best terminology to use because it sounds like it's referring to the mean average deviation (MAD) rather than the square root of all the summed squares.
And when humans think of mean deviation, it's more intuitive to think of deviation in terms of regular units in relation to the mean rather than the square root of the sum of squared deviations. The former more accurately reflects human intuition.
This is what Taleb is saying. MAD is more intuitive to humans, and we can see this in particular because experienced statisticians, when asked to describe what standard deviation "means", actually describe MAD.
I don't understand. The usual explanation I hear (and that I think of) when explaining what a STD of x is, fall somewhere along the lines of "most (about 2/3rds) of the data will be within +/- x of the average". Is this wrong?
If not, can you give me an example of the typical description people give for STD that actually describes MAD?
Yes, that is wrong. It sounds like you might be thinking about the standard deviation of normally distributed data. In this case, you can say something like, "the probability an observation will be within about [mean-2sd, mean+2sd] is 95%".
But that's assuming the distribution is normal. In other cases, this doesn't hold, but there are more general statements, like Chebyshev's inequality.
I have no idea when people would describe SD as MAD, but wouldn't be too surprised, since people first coming into statistics often seem to have trouble conceptualizing how a squared difference could be viewed. It would be surprising if a trained statistician mixed the two up, because SD and MAD arise from something they should be familiar with--Lebesgue spaces.
I've got a very non-statistics math background, but what you say suggests that there would be a nice way to visualize standard deviations two-dimensionally (since they arise from an L_2 norm), and that it's the one-dimensional "bell curve cross-section width" pictures that confuse people.
It's really pointless to argue about the "best" deviation algorithm, at least on the basis of how it responds to outliers. The process of identifying and ignoring/deprecating outliers isn't something that can or should be lumped in with a simple notion of deviation, be it RMS, MAD, SD, or whatever. Any simple algorithm that you come up with to represent one data set may fail badly with another for this reason (and others.)
Outliers need to be removed, or at least understood, before performing any statistical calculations.
I don't think that answer helps. How are you assigning "too much" weight on outliers, and the process behind deciding the right amount of weight? Can you think of any concrete examples?
Obviously "too much" depends on the subject matter.
But the point is that,
1. STD gives a lot more weight to outliers than MAD.
2. People constantly hear STD and think it means MAD, for all the reasons the article mentions.
The argument isn't that STD always gives too much. It is that it gives a lot more than people expect.
The extreme example given in the article is that a statistical process can have infinite STD, but finite MAD. In other cases, say income, the STD might be double the MAD. That's bad if you think STD means MAD.
Anyhow, this could be solved by educating people on what STD actually means, or by just using MAD. The article apparently thinks the latter is more practical, especially since the benefits of STD have decreased over time.
In this case what Taleb is concerned with is decision making. The right amount of weight is what allows human beings to make good decisions. He believes that MAD is much more intuitive to humans and therefore leads to better decisions.
edit: OK I get that you wanted an example of what "too much" weight is. If you're looking for "how much the next datapoint will deviate from the mean, on average", then the MAD will tell you that, not the STDV. Except in some specific fields (maths, physics), people are much more interested in the MAD than the STDV, but all they get to make decisions is the STDV.
In many cases outliers are extremely important. One that comes to mind is high spenders in mobile games.
Trust me, if analysis was as simple as getting rid of outliers, treating everything as Gaussian, and retrieving simple summary statistics, then good data scientists wouldn't be paid $150k+ :)
(2) Generate the absolute deviations of your data from this median which is {4,0,4,0,4,0,4,0,4,0,4,0,4,0,4,0,4,0,998}
(3) Find the median of the absolute deviations which is 4.
It's ironic that Taleb prefers a statistic that ignores extreme examples (i.e. black swans) but he nevers seems to make sense to me. I've found MAD useful in dealing with noisy data.
> It's ironic that Taleb prefers a statistic that ignores extreme examples (i.e. black swans)
No that it incorrect on two fronts.
1) MAD does not "ignore" extreme examples, it just weights them the same as other examples. Nassim argues that the weighting of extreme examples in STD is excessive and makes STD less intuitive. I really don't know how you could say that MAD "ignores" extreme examples - they obviously do influence MAD.
2) The act of computing MAD or STD on a sample of observations has no relevance to Black Swan theory. In Black Swan, Nassim defines a black swan event as an unexpected event of large magnitude or consequence. Hence, by definition, an event that has already been observed cannot be a black swan event.
To put it another way, Nassim's main point in Black Swan is that using historical observations to estimate forward risk renders one fragile to Black Swan events - you could use any dispersion metric and this is still the case.
I went with Taleb's proposed definitions:
"Do you take every observation: square it, average the total, then take the square root? Or do you remove the sign and calculate the average?"
In my experience MAD refers to either Median Absolute Deviation or the Mean Absolute Deviation. I was using the median version which is a pretty common "robust" statistic. Although I have occasionally seen the mean version it seems to be less common in practice.
Take a look at the Wikipedia you linked. No version of Average Absolute Deviation is consistent with Taleb's definition. No squaring, no square root. Sounds more like a geometric mean.
This is exactly what is so frustrating about Taleb. His ideas only partly makes sense. He often seems to see the problem but his solutions are poorly thought out. Of course, he thinks his solutions are perfect and everyone else is an idiot.
In what field do you work that the median absolute deviation is used at all, let alone more than the mean absolute deviation?
When he talked about mean absolute deviation being sqrt(pi/2) sigma did that not make it abundantly clear what he was discussing?
>No squaring, no square root. Sounds more like a geometric mean
Do you even know what the geometric mean is? (It has a root function so your statement just sounds stupid)
Dispersion functions are built off the distance function under the metric you want to use. Standard deviation uses the L2 metric, which implies a euclidean distance function. (L2 corresponds to summing pow(u-x,2) and pow(sum,-2) as your functions)
Mean absolute deviation takes the L1 metric, which implies pow(1) and pow(-1).
This becomes summing pow(abs(u-x),1) and then pow(sum,-1), which, needless to say is the same thing as averaging the absolute differences.
It depends on whether the numbers provided are the actual data points themselves, or the deviation from median (the second is what the article provided).
>Except in some specific fields (maths, physics), people are much more interested in the MAD than the STDV, but all they get to make decisions is the STDV.
com'n guys, it all comes down to whether you like more romb or circle :) Interesting that MOND (modified Newtonian), if true, would suggest that a circle at very big distances looks like square (notice not like romb :), so physics may start to like it more.
I met Nassim Taleb a few years ago. He was doing due diligence on a fund I was working at.
He's an incredibly colourful character. We chatted about various authors in the statistical space, and chastised them all! "What about (this guy)? Idiot!" It was a rant worthy of a Hitler parody. "Everyone who believes in standard deviation, leave the room!"
It was hilarious. He had a point, too, about our statistical methods. A few months later the fund blew up in a textbook way. He'd have made a packet if he followed his own advice. No idea if he did.
"What is worse, Goldstein and I found that a high number of data scientists (many with PhDs) also get confused in real life."
Wow. Just wow.
Some of people are calling themselves "Data Scientists" who don't know the difference between σ and MAD?
I don't care how many letters are after your name. If you don't know the absolute most basic types of summary statistics, you have no business calling yourself a Data Scientist.
---
To the phonies, please stop. Most of us work hard to stay at the bleeding edge, lest we fall behind, as I'm sure all HN devs strive to do. You're essentially defrauding people, and if you're working on anything important, the cascading effects can hurt a lot of real live actual human beings.
It's one thing to play around with data science to satiate your curiosity. It's an entirely different matter to declare it your profession. For example, I play around with KSP. That does not make me an Aerospace Engineer.
---
Edit in reply @tgb: Actually I think you and I are on the same page about what Taleb means. In fact, MAD is probably the more intuitive summary statistic for most people.
I just find the number amongst Data Scientists surprisingly high, since half the job is spotting these kinds of misinterpretations.
I don't know what Taleb is referring to, but I assumed it was something related to this line:
>In fact, whenever people make decisions after being supplied with the standard deviation number, they act as if it were the expected mean deviation.
That is, I suspect he means that if you gave them equations or data and asked them to run some computations on them, they wouldn't mess up the two. But if you were to then say, "okay, what does this result tell you to do when you're buying groceries at the super market?" they would then make that decision incorrectly.
This isn't to dismiss the problem. Just I don't think he's saying it's quite what you think he's saying. Too bad Taleb didn't write more on either of these two lines. (Well, chances are he did, but I just don't know where.)
Even worse there are people (utter frauds of course) who confuse MAD and MAD! In all seriousness, there is much confusion between Median Absolute Deviation and Mean Absolute Deviation out there. Ironically the MAD in this article is still not a robust measure of variation in data as it will break for the many distributions that have undefined/infinite mean (Cauchy and Levy as examples).
Even then many summary statistics rely on a well-defined PDF which is also not true for many real life cases. I think most data scientists out there are very familiar with quantiles, which are often more useful as all random variables have a CDF (and the quantile is just the inverse CDF).
I quite enjoy Taleb's writing (I tend to find his ego a bit amusing) but I think even he is guilty of Jaynes' "Mind Projection Fallacy"[0] in regards searching for more meaning than exists in Fat-tailed distributions. When we model our data with infinite/undefined mean and variance distributions we're just saying "I don't know". No amount of cleverness with summary statistics, or understanding of pathological distributions will create information where there is none.
The overall point being: there are many, many ways of viewing statistics and it's pretty trivial to find a perspective that allows you to call someone a "fraud". Sure there are actual frauds in data science, but one of the biggest strengths in this trend is bringing quantitative people from a wide range of backgrounds to gain refreshing insights. It is much more useful to encourage cross-discipline exploration than to simply say "you don't belong here".
MAD is about as simple as it gets, but it is not "the absolute most basic type of summary statistic." In many statistics courses, it is not taught at all, so it's perfectly possible for someone to know how multiple regression and random forests work but to have never before heard of quantifying spread in any other way than with squared deviations.
Just because someone doesn't know something that you do know doesn't make them stupid.
I'm an undergraduate engineer who has taken stats and I've never even heard of MAD before. I've thought about it [mean(abs(x)) ] but I've always assumed it would behave similarly . Which also seems to be case (apparently MAD is better though I don't really see why, as I understood it his argument is practically the same as when arguing about using median vs mean), which is why I'm not really understanding what the problem is. It's just a rescaling, as long as one sticks with one measurement type what's the problem?
> Some of people are calling themselves "Data Scientists" who don't know the difference between σ and MAD?
Uh, that's just terminology. Data science is quite diverse and people have different backgrounds and work in different domains. Many mathematically/CS/EE-trained data scientists probably just look at the whole thing in terms of norms without explicitly using the term, MAD.
A mix, some people really can't code on a whiteboard, others freeze in an interview. Overall, I find FuzBuzz fairly useless and it was originally just a question if you knew the modular symbol, but it's gained ridiculous status.
The modern day "professional", they may hold degrees, may sound impressive, probably read a few books on the topic from O'Reilly, but basically are not that good. Same with programmers, most are not that good. Same with decorators, or writers, or any field really. There will be the few who like it, learn it, practise it, aim to get better at it, but the vast majority, no.
I submitted this for Nassim's article but this whole site is pretty great. Check out their "big questions" on the left hand side where they get people to weigh in on them:
I understand we disagree, so I'll try to clarify my point of view a bit which hopefully makes things better not worse.
One often (despite promises not to) is forced to optimize what one measures or shares. It may or may not be relevant to a particular project, but I doubt it is completely invalid (despite the claimed clean room separation of descriptive and inferential procedures).
The idea is: if one believes absolute deviation is the one true measure then it would not make sense to optimize over a different measure (variance).
I in fact like quantile regression, but it has its own caveats.
So, let's say we want to make an industrial process more reliable by reducing the variability of its output. We repeatedly measure the variability by means of MAD. We'd like to know what causes the variability, so we regress MAD on various predictors to see what causes variable performance. The regression allows us to optimize MAD but the regression itself is fit using ordinary least squares. I don't think anyone would object to that?
I guess you're thinking more along the lines of describing the performance of a model in terms of the MAD, and then optimizing using MAD / L1, which isn't always a wise choice? Then I guess we don't disagree at all. I do like MAD as an easy to communicate loss statistic in many cases (as well as % of cases with predictions further than a domain specific distance from the truth), but I don't think many people would consider loss to be a descriptive statistic at all – it describes not the world but a model of the world.
You are mistaking him for someone who is ignorant and slagging the best practices; in fact, he's incredibly knowledgeable and experienced and slagging best practices. I'm not saying he's guaranteed to be right about everything, but he's made rather a lot of money by turning his criticisms into actions, and that's a pretty tall bar to leap.
I'll say it again because I can already hear the reply buttons clicking... I'm not saying he's right about everything or that money is the only measure of rightness. I'm just saying that it's definitely a measure worth paying attention to, and definitely puts you on the experienced side rather than the ignorant side. If it was just chance, it's still interesting.
FWIW I am also someone who knows what he's talking about, as I am a seasoned professional in the same line of work.
This guy's accomplishments don't make the Chebyshev inequality any less true (nor all the other theorems involving variance), so I don't see how he can claim something like this and be taken seriously by people in the field.
Because it's not about the theorems being wrong, it's about people using them in impractical ways.
The Central Limit Theorem is true. Full stop. It can't be wrong. However, in the real world, a lot fewer things are truly Gaussian than may initially meet the eye. It doesn't make the CLT "false", it just means that people who apply it too carelessly are making a mistake. Standard deviation is a thing, but that doesn't make it the right thing for a given task.
A lot of people apply statistics inappropriately. It's hardly their fault, it's basically what they are taught. I remember seeing my wife take her biology statistics courses, which at times seemed to be a course in which you would repeatedly calculate p-values. Just that, over and over; calculate this p-value. Calculate that p-value. Calculate this other p-value. Say "Yes" if it's less than 0.05 and "No" if it's greater. Now do it again. And again. And again. Certainly numbers went in one end of the calculator and came out the other, but did they mean anything? If not, it's not because the p-value isn't "true", just not even remotely as useful as the course was implicitly teaching.
(Yes, words were said about how it wasn't the only useful thing, but the actions spoke loud and clear. Compute p-value. Say yes if below 0.05. Say no if above. Repeat. The current problems all the fields are having with statistics aren't that surprising if you look back to the beginning.)
Looks like you're committing your own fallacy there. Jerf's topic is Taleb's level of authority, not the veracity of the statement of the article. How is it "argument from authority" when the argument is trying to establish how much authority the actor has in the first place?
The comment even has two separate disclaimers saying that Taleb might not be right about this, just that he has enough experience to be worth listening to on the subject.
It's not really appeal to authority. He's not claiming that Taleb is right because of who he is, rather that we shouldn't dismiss him out of hand because he's very accomplished and credible.
I remember that when I learnt Probability Theory at Uni around 1996, I got a 3- in the exams (german grades go from 6 (the worst) to 1 (the best), a - indicating a tendency towards the worse grade, and + indicating a tendency towards the better grade). Everyone else in that year got a worse grade ... And we were all (aspiring) mathematicians. The thing is, if you want to teach that stuff in a way that is both rigorous with respect to theoretical underpinnings, while at the same time making sure that the student can actually apply in practice what they learnt ... that's a pretty difficult task, even if your students are all mathematicians. Now, I have no idea what a good way to teach these things to non-mathematicians would be!
Edit: I always was quite astonished of how easily Biologists etc. seem to have understood quite complicated probabilistic mathematics, but now I understand that mostly they are just cargo culting stuff they don't really understand.
I have replaced uses of the Standard Deviation with the Mean Absolute Deviation at work on several occasions, for just the reasons described here. It often leads to substantial improvement in predictive validity, in some cases fixing a broken process.
Might be a little premature to call for retirement of sigma. The mathematical concept of standard deviation is super useful but I agree that the name is confusing and that we need to improve the naming and ideally the notation and teaching of statistics. Ability to deal with uncertainty and variance is becoming more and more important in all sorts of fields as data volumes get larger so I'd hate to see us give up just because it is hard to understand.
"Dealing with variance" is the same as dealing with mean average deviation.
Variance is "super useful" when one works with Gaussian distributions, and overusing variance is part of the greater problem of overusing Gaussian distributions and Euclidean distances.
agree. You have to wonder if these articles come up as a way to draw people in to looking back at the theory to justify their use. A sort of "trolling" to get people back on board thinking of the theory and not just the use. Both forms of deviation measurement have their use, along with the other many forms of deviation measurement used in stats.
I've never really been a fan of the notation. But I don't know how you can conceivably enforce this because just like programming, people will pontificate about 'design', 'clean code', 'maintanability' and other similar cargo-cult buzzwords but they will go ahead and do whatever they want with their code.
Completely agree that standardizing notation takes decades as a lot of it is how textbooks and tutorials from experts are written. I for one though am really glad that we standardized on Leibniz's notation rather than Newton's for calculus though and aren't using roman numerals anymore.
Squared curves have a nice-looking derivative. Absolute value curves have a constant derivative. The only reason the squared version won was to make the calculus easy. Now that we have computers, the absolute value curve is better for most uses.
Using MAD instead of standard deviation in the formula for correlation would result in the possibility of random values being "correlated" by values that exceed 1 in magnitude (the proof for a correlation coefficient being bounded by 1 relies on the Cauchy-Schwarz inequality, which could no longer be appealed to).
This doesn't detract from Taleb's points at all, but it does show a mathematically "nice" property that results in the usage of standard deviations.
Thanks! How does this relate to real-world statistics? In what context(s) does this pop up?
Edit: what's wrong with my question? I did mention a gentle introduction was needed, so if the answer is obvious to some, please forgive my ignorance and help me fix it.
Imagine an infinite line and a spinner[1] a short distance away from it. Spin the spinner, wait for it to stop, and then mark the point on the line that the spinner is point directly at (or away from). Repeat lots of times. The resulting points have a Cauchy distribution. If you tried to figure out where the spinner was along the line by taking the arithmetic mean of the points, you would fail miserably. Taking the median is much more likely to give you a good answer.
That was still a somewhat contrived example to demonstrate the point, but if you replace the spinner's pointing with photons, you realize that a Cauchy distribution describes the intensity of light shining on a flat surface from a point light source[2].
If we regard taking the absolute value as a squaring followed by taking the positive square root, then basically we have "root mean square" (STD) versus "root square mean" (MAD), that is all. The one calculation takes the square root after the mean, the other moves it before.
If we extend MAD to vectors, then we average the vector norms.
What is the norm? It is the root mean square of the vector components. So then MAD is then the "root square mean" of "root mean squares".
The difference between MAD and STD is the use of the mean instead of the quadratic mean. Sometimes the quadratic mean is better, for example: if you have n some particles with velocity vi then the quadratic mean allows you to replace your system with another where each particle has the same velocity, the quadratic mean, this way the total kinetic energy is the same as in the original system, and this conservation of energy is a very important property in physics.
To understand better. Lets take the following deviations from mean in series A&B.
A: 2, 1, 3, 1, 2
B: 4, 1, 2, 1, 1
MAD(A) = 9/5 = 1.8
STD(A) = sqrt(21/5) = sqrt(4.2) = 2.05
MAD(B) = 9/5 = 1.8
STD(B) = sqrt(23/5) = 2.14
So it clearly shows that STD penalizes any significant error. It weights any deviation by itself (Sqare). So the message is all the values in a series should be in a close range. If you look at its applications.
One comment on HN says that its M is for Median in MAD. Taleb's note itself makes it clear by the temperature example that it is Mean average deviation.
Also, whats with the Taleb bashing on HN in these past two articles. The man may be arrogant or he may be not. So what? Bringing that aspect of someone's personality in technical discussions is bringing them to a very low level, apart from being Ad hominem. I think, even in this post he is speaking from a point of view of a practitioner, who has seen a tool abused a lot. And he may not have all the time to utter out his full knowledge about a subject. Just by that mention of Karl Pearson, it is obvious he knows compeletly what he is talking about, as that person seems to be the first who gave the term 'standard deviation' to 'root mean square error (As per wiki on STD).
A similar thing is using least squares for linear regression rather than minimizing MAD. In past the argument was that the least squares sum has a closed expression, but with computers even that advantage eliminated.
The nice thing about minimizing MAD is that in typical settings the liner regression line-plane-hyperplane goes through measurement points. As such there is no interpolation and outliers are nicely cut off making the result very robust to measurements errors.
One of the barriers to adopting MAD() is the two passes over a dataset needed to compute it.
As it happens, I made a MySQL feature request for a MAD() aggregate function when Dr. Taleb's article first appeared. Any upvotes on that request would be welcome.
Can you give some examples where it's more representative? Average height? Average income? There are a lot of arguments in favor of median as a more intuitive replacement (and it is frequently used thus) but consider that:
a) Some people have ZERO income.
b) Either an odd or even number of people have NEGATIVE income.
Every day I deal in execution times, rounded to the nearest integral number of milliseconds. Lots of zeros.
A measure that is (a) not actually intuitively correct, (b) very hard to calculate in your head, and (c) useless for many, many non-trivial cases is NOT a good replacement for "linear mean" which is (a) often intuitively correct, (b) pretty easy to calculate in your head, and (c) always works, seems better.
Geometric mean also interacts with unit conversions which use a different zero-point. For instance, if you take the geometric mean of daily temperatures, you will get different means depending on whether you work in F or C or K.
Also, we should embrace log curves instead of linear curves for changes of perspective, e.g. the default "this is bigger than that" should be thought of in log (vs linear) when taught in school
I'm rather surprised how exceptionally naive articles are upvoted on HN when the subject matter is specialized. If you're going to talk about retiring standard deviation you should have some pretty detailed mathematical arguments, this is a waste of time.
> This is about how people interpret statistical results. Which is not a mathematical process - it is what comes after the math.
If the later is not a process with a scientific method which can be detailed and substantiated, than this is mysticism and my interest stops there.
That being said you really don't need to do math to practice mysticism, but I'm sure having numbers and equations help with the hand-waving when selling it to non-technical people.
What are you an economist? Why do you need detailed mathematical arguments for something that can be explained quite simply? Anybody who cares to listen to the field will see these issues time and time again.
Statistics like any other discipline has biases, misconceptions and traditions. Just look at how poorly it's used by academia.
Do you really think that we follow the best practices/procedures we can be? Or do you think for the most part practitioners and academics are doing what they were taught?
> Standard deviation, STD, should be left to mathematicians, physicists and mathematical statisticians deriving limit theorems.
E.g., for positive integer n and a sequence of n random variables with the same expectation and with finite variance and, as n grows to infinity, with the variance converging to zero, the random variables, actually points in the Hilbert space commonly called L^2, converge in the norm of that space, and then a subsequence must converge almost surely, that is, the strongest case of convergence. Of course, this is a very old result and standard when consider convergence of random variables.
But standard deviation still has an important role in common applications of statistics without "deriving limit theorems". And, with some irony, we don't derive a limit theorem but use one, indeed, likely the most important one, the central limit theorem (CLT).
With the CLT, under mild assumptions, for positive integer n, as n grows to infinity, the probability distribution of the mean of n independent and identically distributed (the i.i.d. case) converges to a Gaussian. Likely the mildest assumptions are from the Lindeberg-Feller case (don't ask but look it up if you wish, and to read the proof set aside much of an afternoon).
Now, when have convergence to a Gaussian and have the standard deviation of that Gaussian, we can calculate any and all confidence intervals we want on our estimate of the mean of that Gaussian. So, THAT'S one case of where and why even in just common work we still want standard deviation.
Yes, how fast the convergence is to a Gaussian can be relevant in protecting against Talib's "black swans" and avoiding, say, the disaster of Long Term Capital Management (LTCM) in their estimates of volatility.
That is, suppose we want to estimate the standard deviation of an average (as above). Suppose the random variables we are averaging have a distribution that has in its probability density function a bump way, way, way out in a tail. The way out in the tail means that if get such a value, then it's really large (in absolute value, and in practice really far from the expectation of that random variable). So, if get a value in that bump, then can have a "black swan". But the probability of the bump is quite small. So, we can take samples from the distribution of that random variable and average them for weeks before we ever get a sample from the bump, before ever see a black swan.
So, doing this, in our sampling never seeing a black swan, we can have an estimate of standard deviation that is significantly too small. So, with that small standard deviation, can believe that some highly leveraged financial positions are relatively safe, that is, also have low volatility.
Then, bad day, the Russians default on something, we get a "black swan", and suddenly lose some billions of dollars where before we were really sure that wouldn't happen for millennia. Sorry 'bout that.
Roughly, that is what happened in the famous, expensive crash of LTCM.
As an aside, I'm sure this sort of thing happens in all walks of life, not just maths/data science. Some programmers don't understand the intuition of certain things that they code and when it is time to explain they will likely freeze because they know how to code it, but don't really know why the code works fundamentally.