>> I see this stuff everywhere online and it's often taught this way so I don't ...

>> I see this stuff everywhere online and it's often taught this way so I don't blame folks for repeating it, but I think it's likely promulgated by folks who don't train LSTMs with long contexts.

To clarify, this wasn't taught to me. I studied LSTMs during my MSc in 2014, by my own initiative, because they were popular at the time [1]. I remember there being a hefty amount of literature on LSTMs, and I mean scholarly articles, not just blog posts. Rather at the time I think there were only two blog posts, the ones by Andrey Karpathy and Chris Olah that I link above. The motivation with respect to vanishing gradients is well documented in previous wok by Hochreiter (I think it's his thesis), and maybe a little less so in the 1997 paper that introduces the "constant error carousel".

What kind of "instability" did you see? Vanishing gradients weren't something I noticed in my experiments. If that was because I didn't use a long enough context, as you say, I wouldn't be able to tell but there was a different kind of instability: loss would enter an oscillatory pattern which I put down to the usual behaviour of gradient descent (either it gets stuck on local minima, or in saddle points). Is that what you mean?

_______________

[1] More precisely, our tutor asked us to study an RNN architecture expecting we'd look at something relatively simple like an Elman network but I wanted to try out the hot new stuff. The code and report is here:

https://github.com/stassa/lstm_rnn

There may be errors in the code and I don't know if you'll be able to run it, in case you got really curious. I don't think I really grokked automatic differentiation at the time.