The paper is interesting because they calculate the theoretically optimal difficulty for a specific class of learning algorithms: https://dx.doi.org/10.1038/s41467-019-12552-4 (I think the method might be applicable for scheduling flashcards better than the rule-of-thumb spacing of Anki et al.)
The Independent article is unfounded speculation about this applying to the way humans learn, without any discussion of whether the model is actually applicable. (Most things humans are trying to learn aren't binary classification tasks.)
The paper has the Law & Gold model of (monkey) perceptual learning as one of their cases, and has a bit of discussion about how stochastic gradient descent in ML is quantitatively similar to observed characteristics of human learning. I don't think it's unfounded speculation, except perhaps on the part of the paper.
While human learning might be similar, is it similar enough? Humans seem to require very few data points to learn something. Just a few examples are enough for a human to grasp a pattern. As far as I know, we haven't been able to do the same in ML.
I’m not so sure about that. Adult humans can learn new things with a remarkably small number of additional data points, but they’re drawing on a lifetime of human experience which is also data. Children tend to need more help.
Humans have always thought or related the latest technology to their own functioning, particularly of the brain. It's amazing to see the process of people finding pistons and mechanical automatons as a metaphor for the mind. Then sequential programming and now ML. Humans have an incredible ability to restructure their own ideas about their brain with machines and adapt to new ideas, in a way that would break a machine. Not to mention the rest of human biology that contributes to our functioning minds outside of the brain.
I don't know how to separate out the difference between learning something through immersion and seeing the broader world view that nestles those ideas within them.
Yes, Anki’s goal of “let’s quickly hide all the cards you know and only show you cards you are forgetting or on the edge of forgetting” has never seemed overly useful. The end result is the following algorithm: “let’s present you with the toughest things you can’t remember over and over again, and ignore all the facts you’ve done so well at learning (until you have almost, or completely, forgotten those facts)”
Throwing up cards you can answer in 1 second is not wasting anyone’s time. It is more likely to be encouraging to people to remind people of what they know.
> The end result is the following algorithm: “let’s present you with the toughest things you can’t remember over and over again, and ignore all the facts you’ve done so well at learning (until you have almost, or completely, forgotten those facts)”
In my understanding, that's closer to optimal. Effortful retrieval is much more effective at strengthening future retrieval.
I haven’t used either Anki not wanikani (I use SuperMemo) but from what I’ve heard wanikani has no leech management meaning you end up stuck repeating the kanji you keep failing, making the experience miserable. Leech management is really important for any decent SRS
Sometimes you'll have much more difficulty with certain items than others. No matter what you do, even after repeated attempts to learn, you just can't get them to stick.
These are "leeches". They burn productive study and review time.
I have a theory that these items have lower adjacency to past experience or knowledge and it's difficult to form mnemonics or other connections. Or they're less novel and don't cause our brain to take interest. That's where all of my leeches lie -- in the realm of things I don't particularly care about.
A good leech management algorithm will back-burner unproductive items so you can focus on the rest of the concept population. There are different types of leeches too -- things you don't get during introduction, or things that you can commit to short term memory but won't stick for long. A good algorithm will identify all of them and block them.
That's really interesting, I'll have to give supermemo a try. I definitely remember those items in WaniKani; a lot of the times they were mnemonics based on pop culture references that I just didn't get, or the mnemonic was just kind of a stretch, or too many similar concepts had been introduced at once. When I stopped I had definitely hit a wall where I just didn't feel like I could keep learning.
Unless this article is leaving out some major points, the whole thing seems flawed. So their machine-learning models learn best if they fail 15% of the time. Fair enough - but trying to discern anything meaningful about how often people should fail based on that, seems like quite a bit of a stretch
Anecdotally, I'd say I fail most things 95% or more of the attempts. That's why we rewrite, debug, practice, drill, google, avoiding failure takes a lot of human effort.
Humans get way more information out of each failure than ML systems. When you or I fail just a few times we can analyse the failures and often discover huge classes of wrong behaviors, and never repeat any of them. We're also good at differentiating what parts of the failure caused it and can even learn what parts were successful. We might even test dozens of hypothesis at once in a single attempt, even if we're focusing on just one of them. A computer often only gets one bit of learning from a failure or a success: this single behavior in particular either did or did not work.
My hypothesis is that we model the system we're studying and simulate many 'attempts' for every real world attempt. I.e. we grow a low-fidelity, but much faster, model of the system in our brain that we can use to make medium-low confidence predictions about the real system many times for each time we test against the real system.
So when you say you fail 95% of the time, I'm saying each of those failures actually have 200 mini-successes embedded that you can still use to train your mental model.
> and often discover huge classes of wrong behaviors, and never repeat any of them
Once burned, twice shy. And often that results in irrational aversion to huge classes of behaviors just because they appeared in the larger context of the failure of an endeavor as a whole, which I'd say is not a good way to learn from failures.
Sometimes people go into a situation confused, fail, and don't know how to interpret why they failed or what parts caused the failure. I think it is worth distinguishing that type of failure from the type you're talking about. Why? Often when you tell someone you don't know how to do something or you think you'll fail at something, they have your type of failure in mind and they encourage you to just try again.
It depends what "failure" even means. If you mis-place a semicolon and have to go back and fix it, have you "failed"? I'd say no. I'd say a project has only failed if it never gets to a working state.
If you failed to succeed in the expected amount of time/effort, it's a failure. Maybe you add some tolerance of going past the estimate before classifying as a failure, but it's still rooted in the expectation.
For example, if it took you ten years to pass kindergarten, you've failed.
On a binary classification task, it is a priori true that you would likely not learn if you were right 50% or 100% of the time. This is not a function of any particular learning algorithm.
If you got 100% right, you already know everything that was being tested.
If you got 50% right, you don't know if you are guessing or if you should be picking up on any features.
So you would expect that the rate would not be close. 50.1% would be similar to 50% for most intents and purposes. Similarly 99.99%.
So you might expect that the optimal learning rate would be close to 75%/25% in general. This would apply to humans too because it is a statement of the information you need to solve the problem, not a statement about the algo.
This paper finds it to be 85%/15% for a particular algorithm. Perhaps humans learn similarly, perhaps not. However, you might expect the optimal examples to be somewhere in the 65-85% range for any particular algorithm.
The portion 15% seems to crop up suspiciously in optimization contexts... this was noted by Gell-Mann in The Quark and the Jaguar. It's roughly the portion of false warnings sounded by certain tropical birds to gain uncontested access to food. Gell-Mann speculates that it is close to 1/(2pi)...
> The portion 15% seems to crop up suspiciously in optimization contexts
That's approximately the vale of the area under one tail of a normal distribution, from one standard deviation above the mean to infinity.
I'm not statistically mature enough to say whether it's just coincidence. For one thing, oodles of natural phenomena in no way follow the normal distribution.
Not too sure about the case in the article and if relates exactly, but it made me think about spaced-reptition systems. I aim for a 80-90% success rate as I've found that to be the optimal range (arrived at that after doing this for 10+ years now with varying settings).
Interestingly, this correlates well with what is happening in the ad-tech world.
Specifically in performance marketing spend, 15% of the budget is very often allocated to "new initiatives & new partners", with the thought process that it'll either allow to find a previously un-identified improvement, or it'll allow to learn what to avoid in the future on the 85% of spend.
Learning in this case is really about recall -- ensuring that information already captured is successfully retrieved.
That's a different sense from learning as discovery, or at least, learning as search. In searching a graph of possible hypotheses, yes, it is a better rule to look for opportunities to halve the search space.
Wasn't that "10,000 hrs of practice to master anything" paper discredited? This one sounds very similar in trying to quantify a very chaotic and qualitative process. The usefulness of such a stat on any one person is probably nil.
> Wasn't that "10,000 hrs of practice to master anything" paper discredited?
No, AFAIK it literally never existed. That was, as I recall, a popular misinterpretation of what was itself an unwarranted generalization made by Malcolm Gladwell based on a paper with much more limited scope and conclusions.
> This one sounds very similar in trying to quantify a very chaotic and qualitative process.
The actual direct conclusion—that this error rate is optimal for a variety of machine-learning processes—does not seem to ha r the problem you describe. The suggestion in the paper that this extends to “biologically plausible” neural networks that may model animal learning also does not seem problematic in the way you describe. The news article’s claim that this is a finding of a sweet spot for human learning is, while it is a possibility suggested by the paper, simply unwarranted as a conclusion.
It's certainly plausible that a quantifiable sweet spot of this type exists for some kinds of human learning at the optimization of effectiveness in a curriculum that can be dynamically scaled to individual learners could effectively be guided by it, but there is not a strong reason without actually testing in concrete human learning scenarios to believe that the particular number here is a guide to that.
I have a data point/anecdote! I like to play certain sports. After about a year of intense focus and determination, you can get good at pretty much any sport. What I've noticed though is that on a good day, I'm messing up about 15% of the time. If I mess up significantly more than that, I get discouraged and want to go home and try again the next day. If I'm not messing up enough, I feel like I'm overfitting a particular technique and should probably be messing up more to become more well-rounded.
Bert has an 15% masking rate, seems co-related, also 90% is what works well when you are trying to do label smoothing using entropy minimisation, what's going on!
As though getting 100% on my calculus quizzes indicated that I wasn't learning. Or, that I would learn more if I didn't study as hard and got 85% correct.
In that case, I'd be surprised that optimally 85% is best, if it is testing me on stuff I hadn't learned yet. 85% seems till too easy, if it is testing material I haven't had a lesson on yet.
No, 1/7 < 0.15 < 1/6. Whether that's an optimal failure level for human learning is a different question that this research doesn't seem to really answer.
Just because you learn more from failures doesn't mean that you will learn more in aggregate over time with more failures. If you fail too much, your brain will tell you that it's better to quit and spend your attention/time in another way.
Elon Musk recently said it well. He said something along the lines of always assuming you are wrong. The goal is to try to be less wrong all the time. Which is basically what good science is. And that guarantees long term improvement and success. So you need to fail a little to learn and get better. If you always succeeded, I think you'd never really know why you succeeded which is important knowledge and guarantees to a certain degree not making those mistakes again in the future.
Grades are a bit backwards, because learning actually happens when we 1) make an attempt 2) see the result 3) learn from/reflect on the result. Typically school teaches us to do numbers 1 and 2, with very little emphasis on 3
This rings so true to me! At school (high school, for you Americans -- I don't mean university), I was an "A" student without really trying. Unfortunately, this meant that I didn't learn to try.
Eventually (at a prestigious university) I found I could no longer "coast", and studying required real work. It wasn't easy to come to grips with that reality, and I wish I'd learned earlier.
A's are in part due to attention to detail - reading the instructions very carefully, interpreting the questions just right, setting up problems as intended by the test writer,...
The Independent article is unfounded speculation about this applying to the way humans learn, without any discussion of whether the model is actually applicable. (Most things humans are trying to learn aren't binary classification tasks.)