> Transformers required ~2.5x more training steps to achieve comparable performance, overfitting eventually.
> RNNs are particularly suitable for sequence modelling settings such as those involving time series, natural language processing, and other sequential tasks where context from previous steps informs the current prediction.
I would like to draw an analogy to digital signal processing. If you think of the recurrent-style architectures as IIR filters and feedforward-only architectures as FIR filters, you will likely find many parallels.
The most obvious to me being that IIR filters typically require far fewer elements to produce the same response as an equivalent FIR filter. Granted, the FIR filter is often easier to implement/control/measure in practical terms (fixed-point arithmetic hardware == ML architectures that can run on GPUs).
I don't think we get to the exponential scary part of AI without some fundamentally recurrent architecture. I think things like LSTM are kind of an in-between hack in this DSP analogy - You could look at it as FIR with dynamic coefficients. Neuromorphic approaches seem like the best long term bet to me in terms of efficiency.
Again from signal processing: depending on position of the poles in z-transformed filter transfer function the output of IIR has a narrow stability region that is typically carefully designed for. Otherwise IIR filters either exponentially decay to zero to exponentially grow to infinity. RNN cells like LSTM are "decaying filters" with non-linear gates introduced to stop decay and to "remember" things.
FIR filters are way simpler to design and can capture memory without hacks.
ELI5: Could you explain what neuromorphic approaches mean, and how they contribute to AI/AGI? My first impression as a layperson (probably wrong) is that this approach resembles ideas from the book "The Society of the Mind", where the system isn't just simulating neurons but involves a variety of methods and interactions across "agents" or sub-systems.
Neuromorphic mostly just means "like how the brain works". It encompasses a variety of software & hardware approaches.
The most compelling and obvious one to me is hardware purpose-built to simulate spiking neural networks. In the happy case, SNNs are extremely efficient. Basically consuming no energy. You could fool yourself into thinking we can just do this on the CPU due to the sparsity of activations. I think there is even a set of problems this works well for. But, in the unhappy cases SNNs are impossible to simulate on existing hardware. Neuronal avalanches follow power law distribution and meaningfully-large ones would require very clever techniques to simulate with any reasonable fidelity.
> the system isn't just simulating neurons but involves a variety of methods and interactions across "agents" or sub-systems.
I think the line between "neuron" and "agent" starts to get blurry in this arena.
We somehow want a network that is neuromorphic in structure but we don't want it to be like the brain and take 20 years or more to train?
Secondly how do we get to claim that a particular thing is neuromorphic when we have such a rudimentary understanding of how a biological brain works or how it generates things like a model of the world, understanding of self etc etc.
Something to consider is that it really could take 20+ years to train like a brain.
But once you’ve trained it, you can replicate at ~0 cost, unlike a brain.
Yes, but the underlying point is that in this case you can train the AI in parallel, and there's a decent chance this or something like it will be true for future AI architectures too. What does it matter that the AI needs to be trained on 20 years of experiences if all of those 20 years can be experienced in 6 months given the right hardware?
I think we're talking at cross-purposes here. I understand you, but what if the type of learning that leads to intelligence is inherently serial in some important way and can't just be parallelized? What if the fact that it takes a certain amount of chronological time is important? etc
What I'm trying to express is we seem to want to cherry pick certain features from nature but ignore others that are inconvenient and that is understandable, but currently because our knowledge of the biological systems is so incomplete we really don't know which (if any) of these features gives rise to intelligence. For all we know we could be doing training in a seemingly-efficient way that completely precludes intelligence actually emerging.
My take, for pragmatic reasons rather than how the brain actually works, is that an agent-based architecture is great because some tasks can be solved more effectively by specific algorithms or workflows rather than operating at the low level of neural networks (NN).
Neuromorphic has been an ongoing failure (for general purpose processors or even AI accelerators),
ever since Carver Mead introduced (and quickly abandoned them) them nearly half a century ago.
Bill Dally (NVidia CTO) concurs:
"I keep getting those calls from those people who claim they are
doing neuromorphic computing and they claim there is something
magical about it because it's the way that the brain works ...
but it's truly more like building an airplane by putting
feathers on it and flapping with the wings!"
From: Hardware for Deep Learning, HotChips 2023 keynote.
We have NO idea how the brain produces intelligence, and as long
as that doesn't change, "neuromorphic" is merely
a marketing term, like
Neurotypical,
Neurodivergent,
Neurodiverse,
Neuroethics,
Neuroeconomics,
Neuromarketing,
Neurolaw,
Neurosecurity,
Neurotheology,
Neuro-Linguistic Programming:
the "neuro-" prefix is suggesting a deep scientific insight to
fool the audience.
There is no hope of us cracking the question of how the human brain produces high-level intelligence in the next decade or so.
Neuromorphic does work for some special purpose applications.
I like the feather analogy. Early on all humans knew about flight was from biology (watching birds fly) but trying to make a flying machine modeled after a bird would never work. We can fly today but plane designs are nothing like biological flying machines. In the same way, all we know about intelligence comes from biology and trying to invent an AGI modeled on biological intelligence may be just as impossible as a plane designed around how birds fly.
And it's only now, having built our own different kind of flying machine, that we understand the principles of avian flight well enough to build our own ornithopters. (We don't use ornithopters because they're not practical, but we've known how to build them since the 1960s.) We would have never gotten here had we just continued to try to blindly copy birds.
I love this book and have it sitting on my shelf right now! Read it when I was a kid and was amazed at the ideas in it, nowadays it's clearer to me that the author only had a grasp of how things like that would be built but still cool nonetheless.
I would highly recommend it to people who love a good "near future" scifi book.
I'm sure you know this, but I think "the author" Marvin Minsky should be mentioned by name since he was one of the foundational theorists in the field of AI in general, but particularly in NNs.
> I don’t think we get to the exponential scary part of AI without some fundamentally recurrent architecture
I’ve been thinking the same for a while, but I’m starting to wonder if giant context windows are good enough to get us there. I think recurrency is more neuromorphic, and possibly important in the longer run, but maybe not required for SI.
I’m also just a layman with just a surface level understanding of these things, so I may be completely ignorant and wrong.
I don't think so. FIR filters can be unrolled and parallelized over the data. These are definitely possible to do on GPU to great effect. But, IIR filters constantly depend on the output of the prior time step, so you can't unroll anything. These would probably be faster to simulate on the CPU.
I find the entire field lacking when it comes to long-horizon problems. Our current, widely used solution is to scale, but we're nowhere near achieving the horizon scales even small mammal brains can handle. Our models can have trillions of parameters, yet a mouse brain would still outperform them on long-horizon tasks and efficiency. It's something small, simple, and elegant—an incredible search algorithm that not only finds near-optimal routes but also continuously learns on a fixed computational budget.
I'm honestly a bit envious of future engineers who will be tackling these kinds of problems with a 100-line Jupyter notebook on a laptop years from now. If we discovered the right method or algorithm for these long-horizon problems, a 2B-parameter model might even outperform current models on everything except short, extreme reasoning problems.
The only solution I've ever considered for this is expanding a model's dimensionality over time, rather than focusing on perfect weights. The higher dimensionality you can provide to a model, the greater its theoretical storage capacity. This could resemble a two-layer model—one layer acting as a superposition of multiple ideal points, and the other layer knowing how to use them.
When you think about the loss landscape, imagine it with many minima for a given task. If we could create a method that navigates these minima by reconfiguring the model when needed, we could theoretically develop a single model with near-infinite local minima—and therefore, higher-dimensional memory. This may sound wild, but consider the fact that the human brain potentially creates and disconnects thousands of new connections in a single day. Could it be that these connections steer our internal loss landscape between different minima we need throughout the day?
Yes... The field lacks the HOLY GRAIL (long-horizon problems). But we don't need a mouse-brain to sort spam emails. The Hail Mary 2B+ parameter models and above are still niche uses of these algorithms (too heavy to run practically). There is plenty of room for clever and small models running on limited hardware and datasets to solve useful problems and nothing more.
Models that change size as needed have been experimented with, but they are either too inefficient or difficult to optimize at a limited power budget. However, I agree that they are likely what is needed if we want to continue to scale upward in size.
I suspect the real bottleneck is a breakthrough in training itself. Backpropagation loss is too simplistic to optimize our current models perfectly, let alone future larger ones. But there is no guarantee a better alternative exists which may create a fixed limit to current ML approaches.
I'm not proposing a change in model size; rather, I'm suggesting a higher dimensionality within the current structure. There’s an interesting paper on LLM explainability, which found that individual neurons often represent a superposition of various data elements.
What I’m advocating is a substantial increase in this aspect—keeping model size the same while expanding dimensionality. The "curse of dimensionality" illustrates how a modest increase in dimensions leads to a significantly larger volume.
While I agree that backpropagation isn’t a complete solution, it’s ultimately just a stochastic search method. The key point here is that expanding the dimensionality of a model’s space is likely the only viable long-term direction. To achieve this, backpropagation needs to work within an increasingly multidimensional space.
A useful analogy is training a small model on random versus structured data. With structured data, we can learn an extensive amount, but with random data, we hit a hard limit imposed by the network. Why is that?
"Interesting work on reviving RNNs. https://arxiv.org/abs/2410.01201 -- in general the fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)
Curve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape. As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime."
> The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape
I have almost the opposite take. We've had a lot of datasets for ages, but all the progress in the last decade has come from advances how curves are architected and fit to the dataset (including applying more computing power).
Maybe there's some theoretical sense in which older models could have solved newer problems just as well if only we applied 1000000x the computing power, so the new models are 'just' an optimisation, but that's like dismissing the importance of complexity analysis in algorithm design, and thus insisting that bogosort and quicksort are equivalent.
When you start layering in normalisation techniques to minimise overfitting, and especially once you start thinking about more agentic architectures (eg. Deep Q Learning, some of the search space design going into OpenAI's o1), then I don't think the just-an-optimisation perspective can hold much water at all - more computing power simply couldn't solve those problems with older architectures.
I see what you are saying, and I made a similar comment.
However it's still an interesting observation that many architectures can arrive at the same performance (even though the training requirements are different).
Naively, you wouldn't expect eg 'x -> a * x + b' to fit the same data as 'x -> a * sin x + b' about equally well. But that's an observation from low dimensions. It seems once you add enough parameters, the exact model doesn't matter too much for practical expressiveness.
I'm faintly reminded of the Church-Turing Thesis; the differences between different computing architectures are both 'real' but also 'just an optimisation'.
> When you start layering in normalisation techniques to minimise overfitting, and especially once you start thinking about more agentic architectures (eg. Deep Q Learning, some of the search space design going into OpenAI's o1), then I don't think the just-an-optimisation perspective can hold much water at all - more computing power simply couldn't solve those problems with older architectures.
You are right, these normalisation techniques help you economise on training data, not just on compute. Some of these techniques can be done independent of the model, eg augmenting your training data with noise. But some others are very model dependent.
I'm not sure how the 'agentic' approaches fit here.
Is multiplication versus sine in the analogy hiding it, perhaps?
I've always pictured it as just "needing to learn" the function terms and the function guts are an abstraction that is learned.
Might just be because I'm a physics dropout with a bunch of whacky half-remembered probably-wrong stuff about how any function can be approximated by ex. fourier series.
So (most) neural nets can be seen as a function of a _fixed_ form with some inputs and lots and lots of parameters.
In my example, a and b were the parameters. The kinds of data you can approximate well with a simple sine wave and the kinds of data you can approximate with a straight line are rather different.
Training your neural net only fiddles with the parameters like a and b. It doesn't do anything about the shape of the function. It doesn't change sine into multiplication etc.
> [...] about how any function can be approximated by ex. fourier series.
Fourier series are an interesting example to bring up! I think I see what you mean.
In theory they work well to approximate any function over either a periodic domain or some finite interval. But unless you take special care, when you apply Fourier analysis naively it becomes extremely sensitive to errors in the phase parameters.
(Special care could eg be done by hacking up your input domain into 'boxes'. That works well for eg audio or video compression, but gives up on any model generalisation between 'boxes', especially for what would happen in a later box.)
Another interesting example is Taylor series. For many simple functions Taylor series are great, but for even moderately complicated ones you need to be careful. See eg how the Taylor serious for the logarithm around x=1 works well, but if you tried it around x=0, you are in for a bad time.
The interesting observation isn't just that there are multiple universal approximators, but that at high enough parameter count, they seem to perform about equally well in how good they are at approximating in practice (but differ in how well they can be trained).
> Training your neural net only fiddles with the parameters like a and b. It doesn't do anything about the shape of the function. It doesn't change sine into multiplication etc.
It definitely can. The output will always be piecewise linear (with ReLU), but the overall shape can change completely.
You can fit any data with enough parameters. What’s tricky is to constrain a model so that it approximates the ground truth well where there are no data points. If a family of functions is extremely flexible and can fit all kinds of data very efficiently I would argue it makes it harder for those functions to have correct values out of distribution.
Definitely. That's a fundamental observation called the bias-variance tradeoff. More flexible models are prone to overfitting, hitting each training point exactly with wild gyrations in between.
Big AI minimizes that problem by using more data. So much data that the model often only sees each data point once and overfitting is unlikely.
But while keeping the data constant, adding more and more parameters is a strategy that works, so what gives? Are the functions getting somehow regularized during training so effectively you could get away with fewer parameters, it's just that we don't have the right model just yet?
More directly than my first attempt: you're continuing the error here. The nave's approach of "it's approximating some function" both maps to reality and makes accurate predictions. The more we couple ourselves to "no no no, it's modeling a precise function", the more we end up wrong, both on how it works in theory and in practice.
Huh? Who says anything about 'precise functions'? And what's a precise function in the first place?
I am saying that training (at least for conventional neural nets) only fiddles with some parameters. But it does not change the shape of the network, no new nodes nor different connections. (Which is almost equivalent to saying training doesn't change the abstract syntax tree, if you were to write the network out as a procedure in, say, Python.)
The geometric shape you get when you print out the function changes, yes.
This reminds me of control systems theory where provided there's feedback, the forward transfer function doesn't matter beyond very basic properties around the origin.
Wait! We certainly did NOT have huge datasets (like current internet) for ages. Not even decades. I’ve seen a lecture by a MIT professor (which I cannot find now) where he asserted categorically, that the advances in AI are mostly because of the huge data that we now have and we didn’t before. And that was an old video.
Whichever way it's true in, it's not true in the sense that eg you can approximate any curve with a single layer neural net, and you're not actually going to be able to do it for problems CNNs or transformers work decently on. And Google indexed all of the public Internet way before its researchers came up with transformers.
Another way to look at it is that like you say, it was an old video but there has been progress since though we had large datasets when it came out by its own definition
I think by far the biggest advances are related to compute power. The amount of processing needed to run training algorithms on the amounts of data needed for the latest models was just not possible even five years ago, and definitely not ten years ago.
I'm sure there are optimizations from the model shape as well, but I don't think that running the best algorithms we have today with hardware from five-ten years ago would have worked in any reasonable amount of time/money.
I think the size of the model is only one part of it. They're still training these 7bn parameter models on the whole data set, and just crunching through that takes enormous compute, that people just didn't have at the current price points until now.
I should also mention that the idea itself of using GPUs for compute and then specifically for AI training was an innovation. And the idea that simply scaling up was going to be worth the investment is another major innovation. It's not just the existence of the compute power, it's the application to NN training tasks that got us here.
Here[0] is an older OpenAI post about this very topic. They estimate that between 2012 and 2018, the compute power used for training the SotA models at those times increased roughly 300,000 times, doubling every ~3.5 months.
> "As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime."
I haven't fully ingested the paper yet, but it looks like it's focused more on compute optimization than the size of the dataset:
> ... and (2) are fully parallelizable during training (175x faster for a sequence of length 512
Even if many types of architectures converge to the same loss over time, finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale.
> Even if many types of architectures converge to the same loss over time, finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale.
This! Not just fastest but with the lowest resources in total.
Fully connected neural networks are universal functions. Technically we don’t need anything but a FNN, but memory requirements and speed would be abysmal far beyond the realm of practicality.
Can't wait to see this defiantly spray painted across a torn up brick wall while computronium brained super intelligences slowly disassemble our planet to make paperclips.
There is no known quantum algorithm that can compute the result of a fully-connected neural network exponentially faster than classical computers can. QCs have a known exponential advantage over classical computers only for a very limited class of problems, mostly related to the Quantum Fourier Transform.
Animal brains have little to nothing in common to artifical neural networks. There is no reason whatsoever to think that there is any relation between the complexity class of brain functions and ANN inference.
And the hypothesized (and still wildly speculative) quantum behaviors happening in the animal brain are at the level of the behavior of individual neurons, not of the network connections between neurons. So even if there is some kind of quantum computation happening, it's happening in individual neurons, not at the network level, and that would only go to show even more that animal brains are profoundly different from ANNs.
> finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale
Not to him, he runs the ARC challenge. He wants a new approach entirely. Something capable of few-shot learning out of distribution patterns .... somehow
One big thing that bells and whistles do is limit the training space.
For example when CNNs took over computer vision that wasn't because they were doing something that dense networks couldn't do. It was because they removed a lot of edges that didn't really matter, allowing us to spend our training budget on deeper networks. Similarly transformers are great because they allow us to train gigantic networks somewhat efficiently. And this paper finds that if we make RNNs a lot faster to train they are actually pretty good. Training speed and efficiency remains the big bottleneck, not the actual expressiveness of the architecture
This is true. This is the reason, in many of our experiments we find that using a new algorithm, KESieve, we actually find the planes much faster than the traditional deep learning training approaches. The premise is, a neaural network builds planes which separate the data and adjusts these planes through an iterative learning process. What if we can find a non iterative method which can draw these same planes. We have been trying this and so far we have been able to replace most network layers using this approach. haven't tried for transformers though yet.
I figured this was pretty obvious given that MLPs are universal function approximators. A giant MLP could achieve the same results as a transformer. The problem is the scale - we can’t train a big enough MLP. Transformers are a performance optimization, and that’s why they’re useful.
What it will come down to is computational efficiencies. We don’t want to retrain once a month - we want to retrain continuously. We don’t want one agent talking to 5 LLMs. We want thousands of LLMs all working in concert.
I remember one of the initial transformer people saying in an interview that they didn't think this was the "one true architecture" but a lot of the performance came from people rallying around it and pushing in the one direction.
On the other hand, while "As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime." is true, a sufficiently expressive mechanism may not be computationally or memory efficient. As both are constraints on what you can actually build, it's not whether the architecture can produce the result, but whether a feasible/practical instantiation of that architecture can produce the result.
> I remember one of the initial transformer people saying in an interview that they didn't think this was the "one true architecture" but a lot of the performance came from people rallying around it and pushing in the one direction.
You may be referring to Aidan Gomez (CEO of Cohere and contributor to the transformer architecture) during his Machine Learning Street Talk podcast interview. I agree, if as much attention had been put towards the RNN during the initial transformer hype, we may have very well seen these advancements earlier.
Yes. Do stuff that other people have been successful doing. Monkey see, monkey do - it's not a tech people thing, it's a human thing.
Tech just happens to be most on display at the moment - because tech people are building the tools and the parameters and the infrastructure handling all our interactions.
Architecture matters because while deep learning can conceivably fit a curve with a single, huge layer (in theory... Universal approximation theorem), the amount of compute and data needed to get there is prohibitive. Having a good architecture means the theoretical possibility of deep learning finding the right N dimensional curve becomes a practical reality.
Another thing about the architecture is we inherently bias it with the way we structure the data. For instance, take a dataset of (car) traffic patterns. If you only track the date as a feature, you miss that some events follow not just the day-of-year pattern but also holiday patterns. You could learn this with deep learning with enough data, but if we bake it into the dataset, you can build a model on it _much_ simpler and faster.
So, architecture matters. Data/feature representation matters.
I second that thought. There is a pretty well cited paper from the late eighties called "Multilayer Feedforward Networks are Universal Approximators". It shows that a feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function. For non continous function additional layers are needed.
Well, you also need an approach to 'curve fitting' where it's actually computationally feasible to fit the curve. The approach of mixing layers of matrix multiplication with a simple non-linearity like max(0, x) (ReLU) works really well for that. Earlier on they tried more complicated non-linearities, like sigmoids, or you could try an arbitrary curve that's not split into layers at all, you would probably find it harder. (But I'm fairly sure in the end you might end up in the same place, just after lots more computation spent on fitting.)
If you spent some time actually training networks you know that's not true, that's why batch norm, dropout, regularization is so successful. They don't increase the network's capacity (parameter count) but they increase its ability to learn.
well yes but actually no I guess: the transformers benefit at the time was that they were more stable while learning, enabling larger and larger network and dataset to be learnt.
No. An RNN has an arbitrarily-long path from old inputs to new outputs, even if in practice it can't exploit that path. Transformers have fixed-size input windows.
They are different but transformers don't have fixed windows, you can extend the context or make it smaller.
I think you can extend a positional encoding if it's not a learned encoding.
You can't have a fixed state and have arbitrarily-long path from input. Well you can but then it's just meaningless because you fundamentally cannot keep stuffing information of arbitrary length into a fixed state. RNNs effectively have fixed-size input windows.
The path is arbitrarily long, not wide. It is possible for an RNN to be made that remembers the first word of the input, no longer how long the input is. This is not possible with a transformer, so we know they are fundamentally different.
But an RNN isn't going to remember the first token of input. It won't know until it sees the last token whether that first token was relevant after all, so it has to learn token-specific update rules that let it guess how long to hold what kinds of information. (In multi-layer systems, the network uses ineffable abstractions rather than tokens, but the same idea applies.)
What the RNN must be doing reminds me of "sliding window attention" --- the model learns how to partition its state between short- and long-range memories to minimize overall loss. The two approaches seem related, perhaps even equivalent up to implementation details.
The most popular RNNs (the ones that were successful enough for Google translate and the like) actually had this behavior baked in to the architecture, called "LSTMs", "Long-Short Term Memory"
One reason why I'm excited about o1 is that it seems like OpenAI have cracked the nut of effective RL during training time, which takes us out of the domain of just fitting to the curve of "what a human would have said next." I just finished writing a couple blog posts about this; the first [1] covers some problems with that approach and the second [2] talks about what alternatives might look like.
TLDR: “statistically fitting token output is not the same as human intelligence, and human intelligence and AGI are contradictory anyways (because humans make mistakes)”
Saved you the paywall click to the poorly structured medium article :)
Chollet is just a philosopher.
He also thinks that keras and tensorflow are important, when nobody uses those. And he punished false days about their usage.
Most LLMs aren't even using a "curve" yet at all, right? All they're using is a series of linear equations because the model weights are a simple multiply and add (i.e. basic NN Perceptron). Sure there's a squashing function on the output to keep it in a range from 0 to 1 but that's done BECAUSE we're just adding up stuff.
I think probably future NNs will be maybe more adaptive than this perhaps where some Perceptrons use sine wave functions, or other kinds of math functions, beyond just linear "y=mx+b"
It's astounding that we DID get the emergent intelligence from just doing this "curve fitting" onto "lines" rather than actual "curves".
The "squashing function" necessarily is nonlinear in multilayer nueral networks. A single layer of a neural network can be quite simply written a weight matrix, times an input vector, equalling an output vector, like so
Ax = y
Adding another layer is just multiplying a different set of weights times the output of the first, so
B(Ax)= y
If you remember your linear algebra course, you might see the problem: that can be simplified
(BA)x = y
Cx = y
Completely indistinguishable from a single layer, thus only capable of modeling linear relationships.
To prevent this collapse, a non linear function must be introduced between each layer.
Right. All the squashing is doing is keeping the output of any neuron in a range of below 1.
But the entire NN itself (Perceptron ones, which most LLMs are) is still completely using nothing but linearity to store all the knowledge from the training process. All the weights are just an 'm' in the basic line equation 'y=m*x+b'. The entire training process does nothing but adjust a bunch of slopes of a bunch of lines. It's totally linear. No non-linearity at all.
The non linearities are fundamental. Without them, any arbitrarily deep NN is equivalent to a shallow NN (easily computable, as GP was saying), and we know those can't even solve the XOR problem.
> nothing but linearity
No, if you have non linearities, the NN itself is not linear.
The non linearities are not there primarily to keep the outputs in a given range, though that's important, too.
Nonlinearity somewhere is fundamental, but it doesn't need to be between each layer. You can, for instance, project each input to a higher dimensional space with a nonlinearity, and the problem becomes linearly separable with high probability (cf Cover's Theorem).
So, for XOR, (x, y) -> (x, y, xy), and it becomes trivial for a linear NN to solve.
Architectures like Mamba have a linear recurrent state space system as their core, so even though you need a nonlinearity somewhere, it doesn't need to be pervasive. And linear recurrent networks are surprisingly powerful (https://arxiv.org/abs/2303.06349, https://arxiv.org/abs/1802.03308).
> The non linearities are not there primarily to keep the outputs in a given range
Precisely what the `Activation Function` does is to squash an output into a range (normally below one, like tanh). That's the only non-linearity I'm aware of. What other non-linearities are there?
All the training does is adjust linear weights tho, like I said. All the training is doing is adjusting the slopes of lines.
"only" is doing a lot work here because that non-linearity is enough to vastly expand the landscape of functions that an NN can approximate. If the NN was linear, you could greatly simplify the computational needs of the whole thing (as was implied by another commenter above) but you'd also not get a GPT out of it.
All the trainable parameters are just slopes of lines tho. Training NNs doesn't involve adjusting any inputs to non-linear functions. The tanh smashing function just makes sure nothing can blow up into large numbers and all outputs are in a range of less than 1. There's no "magic" or "knowledge" in the tanh smashing. All the magic is 100% in the weights. They're all linear. The amazing thing is that all weights are linear slopes of lines.
Simply squashing the output of a linear signal would be multiplying by a small value. To avoid large y, you add a step y' = y/1000.
That would still be linear. And the result would be that despite squashing, no matter how many layers a model had, it could only fit linear problems. Which can always be fit with a single layer, i.e. single matrix.
So nobody does that.
The nonlinearity doesn't just squash some inputs. But create a new rich feature, decision making. That's because on one side of a threshold y gets converted very differently than another. I.e if y > 0, y' = y, otherwise y = 0.
Now you have a discontinuity in behavior, you have a decision.
Multiple layers making decisions can do far more than a linear layer. They can fit any continuous function (or any function with a finite number of discontinuities) arbitrarily well.
Non-linearities add a fundamental new feature. You can think of that features as being able to make decisions around the non-linear function's decision points.
---
If you need to prove this to yourself with a simple example, try to create an XOR gate with this function:
y = w1 * x1 + w2 * x2 + b.
Where you can pick w1, w2 and b.
You are welcome to linearly squash the output, i.e. y' = y * w3, for whatever small w3 you like. It won't help.
Layers with non-linear transformations are layers of decision makers.
Layers of linear transforms are just unnecessarily long ways of writing a single linear transform. Even with linear "squashing".
Right, it's obvious that the ReLU is just a gating mechanism, and you can think of that as a decision maker. It's like a "pass thru linearly proportionally" or "block" function.
But I still find it counter-intuitive that it's not common practice in standard LLM NNs to have a trainable parameter that in some way directly "tunes" whatever Activation Function is being applied on EACH output.
For example I almost started experimenting with trigonometric activation functions in a custom NN where the phase angle would be adjusted, inspired by Fourier Series. I can envision a type of NN where every model "weight" is actually a frequency component, because Fourier Series can represent any arbitrary function in this way. There has of course already been similar research done by others along these lines.
> The tanh smashing function just makes sure nothing can blow up into large numbers and all outputs are in a range of less than 1.
That's not the main point even though it probably helps. As OkayPhysicist said above, without a nonlinearity, you could collapse all the weight matrices into a single matrix. If you have 2 layers (same size, for simplicity) described by weight matrices A and B, you could multiply them and get C, which you could use for inference.
Now, you can do this same trick not only with 2 layers but 100 million, all collapsing into a single matrix after multiplication. If the nonlinearities weren't there, the effective information content of the whole NN would collapse into that of a single-layer NN.
You can explain the "effect" of tanh at any level of abstraction you like, up to including describing things that happen in Semantic Space itself, but my description of what tanh is doing is 100% accurate in the context I used it. All it's doing is squashing a number down to below one. My understanding of how the Perceptron works is fully correct, and isn't missing any details. I've implemented many of them.
Your description of tanh isn't even correct, it squashes a real number to `(-1, 1)`, not "less than one".
You're curious about whether there is gain in parameterising activation functions and learning them instead, or rather, why it's not used much in practice. That's an interesting and curious academic question, and it seems like you're already experimenting with trying out your own kinds of activation functions. However, people in this thread (including myself) wanted to clarify some perceived misunderstandings you had about nonlinearities and "why" they are used in DNNs. Or how "squashing functions" is a misnomer because `g(x) = x/1000` doesn't introduce any nonlinearities. Yet you continue to fixate and double down on your knowledge of "what" a tanh is, and even that is incorrect.
When discussing `tanh squashing` among other AI experts it's generally assumed that even the most pedantic and uncharitable parsing of words won't be able to misinterpret "smashing to less than one" as an incorrect sentence fragment, because the "one", in that context, obviously refers to distance from zero.
With a ReLU activation function, rather than a simple linear function of the inputs, you get a piecewise linear approximation of a nonlinear function.
ReLU enables this by being nonlinear in a simple way, specifically by outputting zero for negative inputs, so each linear unit can then limit its contribution to a portion of the output curve.
ReLU technically has a non-linearity at zero, but in some sense it's still even MORE linear than tanh or sigmoid, so it just demonstrates even better than tanh-type squashing that all this LLM stuff is being done ultimately with straight line math. All a ReLU function does is choose which line to use, a sloped one or a zero one.
Well. The word “linear” the way you use it doesn’t seem to have any particular meaning, certainly not the standard mathematical meaning, so I’m not sure we can make further progress on this explanation.
I’ll just reiterate that the single “technical” (whatever that means) nonlinearity in ReLU is exactly what lets a layer approximate any continuous[*] function.
[*] May have forgotten some more adjectives here needed for full precision.
If you're confused just show a tanh graph and a ReLU graph to a 7 year old child and ask which one is linear. They'll all get it right. So you're not confused in the slightest bit about anything I've said. There's nothing even slightly confusing about saying a ReLU is made of two lines.
Well, 7-year-olds don’t know a lot of math, typically, so I wouldn’t ask one that question. “Linear” has a very precise mathematical definition, which is not “made of some straight lines”, that when used properly enables entire fields of endeavor.
It would be less confusing if you chose a different word, or at least defined the ones you’re using. In fact, if you tried to precisely express what you mean by saying something is “more linear”, that might be a really interesting exploration.
It's perfectly legitimate to discuss the linear aspects of piecewise linear functions. I've heard Andrej Karpathy do it in precisely same way I did on this thread, talking about ReLU.
We just have a lot of very pedantic types on HN who intentionally misinterpret other people's posts in order to have something to "disprove".
I.e. ReLU is _piecewise_ linear. That discontinuity that separates the 2 pieces is precisely what makes it non linear. Which is what enables the actual universal approximation.
Followed by "in some sense it's [ReLU] still even MORE linear than tanh or sigmoid functions are". There's no way you misunderstood that sentence, or took it as my "definition" of linearity...so I guess you just wanted to reaffirm I was correct, again, so thanks.
This isn't the primary purpose of the activation function, and in fact it's not even necessary. For example see ReLU (probably the most common activation function), leaky ReLU, or for a sillier example: https://youtu.be/Ae9EKCyI1xU?si=KgjhMrOsFEVo2yCe
You can change the subject by bringing up as many different NN architectures, Activation Functions, etc. as you want. I'm telling you the basic NN Perceptron design (what everyone means when they refer to Perceptrons in general), has something like a `tanh` and not only is it's PRIMARY function to squash a number, that's it's ONLY function.
You need a non-linear activation function for the universal approximation theorem to hold. Otherwise, as others have said the model just collapses to a single layer.
Technically the output is still what a statistician would call “linear in the parameters”, but due to the universal approximation theorem it can approximate any non-linear function.
As you can see in what I just posted about an inch below this, my point is that the process of training a NN does not involve adjusting any parameter to any non-linear functions. What goes into an activation function is a pure sum of linear multiplications and an add, but there's no "tunable" parameter (i.e. adjusted during training) that's fed into the activation function.
If course they do exist. A parameterized activation function is the most obvious thing to try in NN design, and has certainly been invented/studied by 1000s of researchers.
How was that person derailing the convo? Nothing says an activation function has to "squash" a number to be in some range. Leaky ReLUs for instance do `f(x) = x if x > 0 else ax` (for some coefficient `a != 0`), that doesn't squash `x` to be in any range (unless you want to be peculiar about your precise definition of what it means to squash a number). The function takes a real in `[-inf, inf]` and produces a number in `[-inf, inf]`.
> Sure there's a squashing function on the output to keep it in a range from 0 to 1 but that's done BECAUSE we're just adding up stuff.
It's not because you're "adding up stuff", there is specific mathematical or statistical reason why it is used. For neural networks it's there to stop your multi layer network collapsing to a single layer one (i.e. a linear algebra reason). You can choose whatever function you want, for hidden layers tanh generally isn't used anymore, it's usually some variant of a ReLU. In fact Leaky ReLUs are very commonly used so OP isn't changing the subject.
If you define a "perceptron" (`g(Wx+b)` and `W` is a `Px1` matrix) and train it as a logistic regression model then you want `g` to be sigmoid. Its purpose is to ensure that the output can be interpreted as a probability (given that use the correct statistical loss), which means squashing the number. The inverse isn't true, if I take random numbers from the internet and squash them to `[0,1]` I don't go call them probabilities.
> and not only is it's PRIMARY function to squash a number, that's it's ONLY function.
Squashing the number isn't the reason, it's the side effect. And even then, I just said that not all activation functions squash numbers.
> All the training does is adjust linear weights tho, like I said.
Not sure what your point is. What is a "linear weight"?
We call layers of the form `g(Wx+b)` "linear" layers but that's an abused term, if g() is non-linear then the output is not linear. Who cares if the inner term `Wx + b` is linear? With enough of these layers you can approximate fairly complicated functions. If you're arguing as to whether there is a better fundamental building block then that is another discussion.
In the context of discussing linearity v.s. non-linearity adding the word "linear" in front of "weight" is more clear, which is what my top level post on this thread was all about too.
It's astounding to me (and everyone else who's being honest) that LLMs can accomplish what they do when it's only linear "factors" (i.e. weights) that are all that's required to be adjusted during training, to achieve genuine reasoning. During training we're not [normally] adjusting any parameters or weights on any non-linear functions. I include the caveat "normally", because I'm speaking of the basic Perceptron NN using a squashing-type activation function.
> It's astounding to me (and everyone else who's being honest) that LLMs can accomplish what they do when it's only linear "factors" (i.e. weights) that are all that's required to be adjusted during training, to achieve genuine reasoning.
When such basic perceptrons are scaled enormously, it becomes less surprising that they can achieve some level of 'genuine reasoning' (e.g., accurate next-word prediction), since the goal with such networks at the end of the day is just function approximation. What is more surprising to me is how we found ways to train such models i.e., advances in hardware accelerators, combined with massive data, which are factors just as significant in my opinion.
Yeah, no one is surprised that LLMs do what they're trained to do: predict tokens. The surprise comes from the fact that merely training to predict tokens ends up with model weights that generate emergent reasoning.
If you want to say reasoning and token prediction are just the same thing at scale you can say that, but I don't fall into that camp. I think there's MUCH more to learn, and indeed a new field of math or even physics that we haven't even discovered yet. Like a step change in mathematical understanding analogous to the invention of Calculus.
> It's astounding that we DID get the emergent intelligence from just doing this "curve fitting" onto "lines" rather than actual "curves".
In Ye Olden days (the 90’s) we used to approximate non-linear models using splines or seperate slopes models - fit by hand. They were still linear, but with the right choice of splines you could approximate a non-linear model to whatever degree of accuracy you wanted.
Neural networks “just” do this automatically, and faster.
In college (BSME) I wrote a computer program to generate cam profiles from Bezier curves. It's just a programming trick to generate curves from straight lines at any level of accuracy you want just by letting the computer take smaller and smaller steps.
It's an interesting concept to think of how NNs might be able to exploit this effect in some way based on straight lines in the weights, because a very small number of points can identify avery precise and smooth curves, where directions on the curve might equate to Semantic Space Vectors.
In fact now that I think about it, for any 3 or more points in Semantic Space, there would necessarily be a "Bezier Path" which would have genuine meaning at every point as a good smooth differentiable path thru higher dimensional space to get from one point to another point while "visiting" all intermediate other points. This has to have a direct use in LLMs in terms of reasoning.
My feeling is that the answer is "no", in the sense that these RNNs wouldn't be able to universally replace Transformers in LLMs, even though they might be good enough in some cases and beat them in others.
Here's why.
A user of an LLM might give the model some long text and then say "Translate this into German please". A Transformer can look back at its whole history. But what is an RNN to do? While the length of its context is unlimited, the amount of information the model retains about it is bounded by whatever is in its hidden state at any given time.
Transformers actually have an quantifiable state size (see https://hazyresearch.stanford.edu/static/posts/2024-06-22-ac...) although it's anywhere between 200k and 2M floats (for 360M and 1.33B respectively iinm). So a sufficiently sized RNN could have the same state capacity as a transformer.
> Transformers actually have an quantifiable state size
Are you griping about my writing O(X^2) above instead of precisely 2X^2, like this paper? The latter implies the former.
> So a sufficiently sized RNN could have the same state capacity as a transformer.
Does this contradict anything I've said? If you increase the size of the RNN, while keeping the Transformer fixed, you can match their recurrent state sizes (if you don't run out of RAM or funding)
> a Transformer layer can look back see O(X^2) numbers, while an RNN can only see O(X) numbers
The thing is RNN can look back infinitely if you don't exceed the state capacity. For transformers the state it is defined semi-implicitly (you can change the hidden dims but you cannot extend the look back; ignoring transformer-xl et al.) defined by the amount of tokens, for an RNN it's defined explicitly by the state size.
The big-O here is irrelevant for the architectures since it's all in the configuration & implementation of the model; i.e. there is no relevant asymptote to compare.
As an aside this was what was shown in the based paper, the fact that you can have a continuity of state (as with RNN) while have the same associative recall capability as a transformer (the main downfall of recurrent methods at that point).
> The big-O here is irrelevant for the architectures since it's all in the configuration & implementation of the model; i.e. there is no relevant asymptote to compare.
?!
NNs are like any other algorithm in this regard. Heck, look at the bottom of page 2 of the Were RNNs All We Needed paper. It has big-O notation there and elsewhere.
> I was responding to
>> a Transformer layer can look back see O(X^2) numbers, while an RNN can only see O(X) numbers
In the BASED paper, in Eq. 10, sizeof(s) = 2dN. But I defined d = N = X above. Ergo, sizeof(s) = 2X^2 = O(X^2).
For minGRU, sizeof(s) = d. Ergo, sizeof(s) = X = O(X).
That's just the state calculation which would be O(N) and O(1) respectively.
The based paper is saying if you made Transformers recurrent you would have a state size of 2Nd -> O(N), while based has a state size of d*d' -> O(1).
Transformers do have O(N^2) time & memory complexity, and Based/RNN/SSM {O(N) time, O(1) mem}, with respect to sequence length if that's what you mean. The point is it doesn't really give an indication of quality.
We can choose our constant arbitrarily so the big-O you've stated only indicates memory/time-complexity not 'look-back' ability relevant to any task. If you input the entire sequence N times into an RNN, you also have perfect recall with O(N^2) but it's not exactly an efficient use of our resources.
Ideally our state memory is maximally utilized, this is the case for RNNs in the limit (although likely oversubscribed) but is not the case for transformers. The holy grail is to have an input-dependent state-size, however that is quite difficult.
Simplistic thinking. An RNN hidden parameter space of high dimension provides plenty of room for linear projections of token histories. I think people just do not realize just how huge R^N can be.
> Simplistic thinking. An RNN hidden parameter space of high dimension provides plenty of room for linear projections of token histories. I think people just do not realize just how huge R^N can be.
16N bits as hard limit, but more realistically, about 2N bits or less of useful information probably.
You'd need to grow the network dimension in proportion to the maximum sequence length just to avoid the information theoretical limit.
That problem has plagued RNNs since the 90s: there's an information precision problem (how many bits do you need older states to carry), a decay problem (the oldest information is the weakest) and a mixing problem (it tends to mix/sum representations).
The counterargument here is that you can just scale the size of the hidden state sufficiently such that it can hold compressed representations of whatever-length sequence you like. Ultimately, what I care about is whether RNNs could compete with transformers if FLOPs are held constant—something TFA doesn't really investigate.
Well, that's what Transformer already does... One problem with the scaling you're describing is that there would be a massive amount of redundant information stored in hidden activations during training the RNN. The hidden state at each time step t in the sequence would need to contain all info that (i) could be useful for predicting the token at time t and (ii) that could be useful for predicting tokens at times >t. (i) is obvious and (ii) is since all information about the past is transferred to future predictions through the current hidden state. In principle, Transformers can avoid storing redundant info in multiple hidden states at the cost of having to maintain and access (via attention) a larger hidden state at test/eval time.
> there would be a massive amount of redundant information stored in hidden activations
Is there a way to prove this? One potential caveat that comes to mind for me is that perhaps the action of lerping between the old state and the new could be used by the model to perform semantically meaningful transformations on the old state. I guess in my mind it just doesn't seem obvious that the hidden state is necessarily a collection of "redundant information" — perhaps the information is culled/distilled the further along in the sequence you go? There will always be some redundancy, sure, but I don't think that such redundancy necessarily means we have to use superlinear methods like attention.
All information about the past which will be available for predicting future tokens must be stored in the present state. So, if some bits of info about some past tokens at times less than t_p will be used for predicting some future token at time t_f, those bits must be passed through all states at times from t_p to t_f. The bits are passed through the recurrence. Once information about past tokens is lost from the hidden state it is gone forever, so it must be stored and carried across many steps up until it finally becomes useful.
The information cost of making the RNN state way bigger is high when done naively, but maybe someone can figure out a clever way to avoid storing full hidden states in memory during training or big improvements in hardware could make memory use less of a bottleneck.
> The information cost of making the RNN state way bigger is high when done naively, but maybe someone can figure out a clever way to avoid storing full hidden states in memory during training or big improvements in hardware could make memory use less of a bottleneck.
Isn't this essentially what Mamba [1] does via its 'Hardware-aware Algorithm'?
>> A user of an LLM might give the model some long text and then say "Translate this into German please". A Transformer can look back at its whole history.
Which isn't necessary. If you say "translate the following to german." Instead, all it needs is to remember the task at hand and a much smaller amount of recent input. Well, and the ability to output in parallel with processing input.
It's necessary for arbitrary information processing if you can forget and have no way to "unforget".
A model can decide to forget something that turns out to be important for some future prediction. A human can go back and re-read/listen etc, A transformer is always re-reading but a RNN can't and is fucked.
If the networks are to ever be a path to a closer to general intelligence, they will anyway need to be able to ask for context to be repeated, or to have separate storage where they can "choose" to replay it themselves. So this problem likely has to be solved another way anyway, both for transformers and for RNNs.
For a transformer, context is already always being repeated every token. They can fetch information that became useful anytime they want. I don't see what problem there is to solve here.
That's just because we twisted it's arm. One could for example feed the reversed input after, ie abc|cba where | is a special token. That would allow it to react to any part of the message.
Having 2 passes can enable so much more than a single pass.
Another alternative could be to have a first pass make notes in a separate buffer while parsing the input. The bandwidth of the note taking and reading can be much much lower than that required for fetching the billions of parameters.
I remember that, the way I understood it, Transformers solved two major "issues" of RNNs that enabled the later boom: Vanishing gradients limiting the context (and model?) size and difficulty in parallelisation limiting the size of the training data.
Transformers can also fetch at any moment any previous information that become useful.
RNN are constantly updating and overwriting their memory. It means they need to be able to predict what is going to be useful in order to store it for later.
This is a massive advantage for Transformers in interactive use cases like in ChatGPT. You give it context and ask questions in multiple turns. Which part of the context was important for a given question only becomes known later in the token sequence.
To be more precise, I should say it's an advantage of Attention-based models, because there are also hybrid models successfully mixing both approaches, like Jamba.
You could theoretically run the input twice, allowing the model to correlate later tokens with previous ones. It would fix the problem with not knowing what information to retain. A more complicated approach would train the RNN to request replaying some earlier data when needed.
A great thing about RNNs is they can easily fork the state and generate trees, it would be possible to backtrack and work on combinatorial search problems.
Also easier to cache demonstrations for free in the initial state, a model that has seen lots of data is not using more memory than a model starting from scratch.
> They were solved by LSTMs first proposed in 1997.
I see this stuff everywhere online and it's often taught this way so I don't blame folks for repeating it, but I think it's likely promulgated by folks who don't train LSTMs with long contexts.
LSTMs do add something like a "skip-connection" (before that term was a thing) which helps deal with the catastrophic vanishing gradients you get from e.g. Jordan RNNs right from the jump.
However (!), while this stops us from seeing vanishing gradients after e.g. 10s or 100s of time-steps, when you start seeing multiple 1000s of tokens, the wheels start falling off. I saw this in my own research, training on amino acid sequences of 3,000 length led to a huge amount of instability. It was only after tokenizing the amino acid sequences (which was uncommon at the time) which got us down to ~1500 timesteps on average, did we start seeing stable losses at training. Check-out the ablation at [0].
You can think of ResNets by analogy. ResNets didn't "solve" vanishing gradients, there's a practical limit of the depth of networks, but it did go a long way towards dealing with it.
EDIT: I wanted to add, while I was trying to troubleshoot this for myself, it was super hard to find evidence online of why I was seeing instability. Everything pertaining to "vanishing gradients" and LSTMs were blog posts and pre-prints which just merrily repeated "LSTMs solve the problem of vanishing gradients". That made it hard for me, a junior PhD at the time, to suss out the fact that LSTMs do demonstrably and reliably suffer from vanishing gradients at longer contexts.
>> I see this stuff everywhere online and it's often taught this way so I don't blame folks for repeating it, but I think it's likely promulgated by folks who don't train LSTMs with long contexts.
To clarify, this wasn't taught to me. I studied LSTMs during my MSc in 2014, by my own initiative, because they were popular at the time [1]. I remember there being a hefty amount of literature on LSTMs, and I mean scholarly articles, not just blog posts. Rather at the time I think there were only two blog posts, the ones by Andrey Karpathy and Chris Olah that I link above. The motivation with respect to vanishing gradients is well documented in previous wok by Hochreiter (I think it's his thesis), and maybe a little less so in the 1997 paper that introduces the "constant error carousel".
What kind of "instability" did you see? Vanishing gradients weren't something I noticed in my experiments. If that was because I didn't use a long enough context, as you say, I wouldn't be able to tell but there was a different kind of instability: loss would enter an oscillatory pattern which I put down to the usual behaviour of gradient descent (either it gets stuck on local minima, or in saddle points). Is that what you mean?
_______________
[1] More precisely, our tutor asked us to study an RNN architecture expecting we'd look at something relatively simple like an Elman network but I wanted to try out the hot new stuff. The code and report is here:
There may be errors in the code and I don't know if you'll be able to run it, in case you got really curious. I don't think I really grokked automatic differentiation at the time.
Highway networks add a skip connection, but LSTMs don't. Btw you might be interested in truncated backprop thru time, which we introduced in our ULMFiT paper.
I was referring to how the context vectors help avoid vanishing gradients by behaving very similarly to skip-connections, but yes, they aren't skip-connections as-such. That's been my understanding, at least.
We haven't tried truncated BPTT, but we certainly should.
Funnily enough, we adopted AWD-LSTMs, Ranger21, and Mish in the paper I linked after I heard about them through the fast.ai community (we also trialled QRNNs for a bit too). fast.ai has been hugely influential in my work.
Recent comments from him have said that any architecture can achieve transformer accuracy and recall, but we have devoted energy to refining transformers, due to the early successes.
LSTM and GRU did not quite solve the issue, but they made it less bad. Overall, recurrent units are nutritiously prone to vanishing and exploding gradients.
I don't want to downplay the value of these models. Some people seem to be under the perception that transformers replaced or made them obsolete, which is faar from the truth.
From my (admittedly loose) reading of the paper, this paper particularly targets parallelization and fast training, not "vanishing gradients." However, by simplifying the recurrent units, they managed to improve both!
This is very clever and very interesting. The paper continuously calls it a "decade-old architecture," but in practice, it's still used massively, thanks to its simplicity in adapting to different domains. Placing it as a "competitor" to transformers is also not quite fully fair, as transformers and RNNs are not mutually exclusive, and there are many methods that merge them.
Improvement in RNNs is an improvement in a lot of other surprising places. A very interesting read.
And since the proposed hidden states and mix factors for each layer are both only dependent on the current token, you can compute all of them in parallel if you know the whole sequence ahead of time (like during training), and then combine them in linear time using parallel scan.
The fact that this is competitive with transformers and state-space models in their small-scale experiments is gratifying to the "best PRs are the ones that delete code" side of me. That said, we won't know for sure if this is a capital-B Breakthrough until someone tries scaling it up to parameter and data counts comparable to SOTA models.
One detail I found really interesting is that they seem to do all their calculations in log-space, according to the Appendix. They say it's for numerical stability, which is curious to me—I'm not sure I have a good intuition for why running everything in log-space makes the model more stable. Is it because they removed the tanh from the output, making it possible for values to explode if calculations are done in linear space?
EDIT: Another thought—it's kind of fascinating that this sort of sequence modeling works at all. It's like if I gave you all the pages of a book individually torn out and in a random order, and asked you to try to make a vector representation for each page as well as instructions for how to mix that vector with the vector representing all previous pages — except you have zero knowledge of those previous pages. Then, I take all your page vectors, sequentially mix them together in-order, and grade you based on how good of a whole-book summary the final vector represents. Wild stuff.
FURTHER EDIT: Yet another thought—right now, they're just using two dense linear layers to transform the token into the proposed hidden state and the lerp mix factors. I'm curious what would happen if you made those transforms MLPs instead of singular linear layers.
This architecture, on the surface, seems to preclude the basic function of recognizing sequences of tokens. At the very least, it seems like it should suffer from something like the pumping lemma: if [the ][cat ][is ][black ] results in the output getting close to a certain vector, [the ][cat ][is ][black ][the ][cat ][is ][black ][the ][cat ][is ][black ] should get even closer to that vector and nowhere close to a "why did you just repeat the same sentence three times" vector? Without non-linear mixing between input token and hidden state, there will be a lot of linear similarities between similar token sequences...
Counterpoint: the hidden state at the beginning of ([the][cat][is][black]) x 3 is (probably) initialized to all zeros, but after seeing those first 4 tokens, it will not be all zeros. Thus, going into the second repetition of the sentence, the model has a different initial hidden state, and should exhibit different behavior. I think this makes it possible for the model to learn to recognize repeated sequences and avoid your proposed pitfall.
The new hidden state after the first repetition will just be a linear combination between zero and what the non-recurring network outputs. After more repetitions, it will be closer to what the network outputs.
I don't think it's a capital-B Breakthrough, but recurrent networks are everywhere, and a simplification that improved training and performance clears the stage to build back complexity up again to even higher hights.
Log space is important if the token probabilities span a large range of values (powers). There is a reason that maximum likelihood fitting is always performed with log likelihoods.
I made a RNN for a college project because I was interested in obsolete historical technology and I thought I needed to seize the opportunity while it lasted, because once I was out of school, I'd never hear about neural networks ever again.
Mine worked, but it was very simple and dog slow, running on my old laptop. Nothing was ever going to run fast on that thing, but I remember my RNN being substantially slower than a feed-forward network would have been.
I was so confident that this was dead technology -- an academic curiosity from the 1980s and 1990s. It was bizarre to see how quickly that changed.
I feel old. I made my masters thesis on RNN's for learning dynamic systems e.g. for control purposes (quite a novelty at the time, around 2000). We wrote the backprop in C++ and ran it over night. Yes it was slow as hell with the tiny gradients. The network architectures were e.g. 5 or 10 neurons in a single hidden layer. NN's were a tiny subject that you were lucky to find courses in. Then closed my eyes for two seconds and looked at the subject again in 2015. Wow.
To their credit, the authors (Y. Bengio among them) end the paper with the question, not suggesting they know the answer. These models are very small even by academic standards so any finding would not necessarily extend to current LLM scales. The main conclusion is that RNN class networks can be trained as efficiently as modern alternatives but the resulting performance is only competitive at small scale.
>> These models are very small even by academic standards so any finding would not necessarily extend to current LLM scales.
Emphasis on not necessarily.
>> The main conclusion is that RNN class networks can be trained as efficiently as modern alternatives but the resulting performance is only competitive at small scale.
Shouldn't the conclusion be "the resulting competitive performance has only been confirmed at small scale"?
yes, that is clearer indeed. However S4 and Mamba class models have also performed well at small scale and started lagging with larger models and larger context sizes, or at particular tasks.
The model in the paper isn't a "real" RNN due making it parallelizable, for same the reasons described in https://arxiv.org/abs/2404.08819 , and hence is theoretically less powerful than a "real" RNN (struggles at some classes of problems that RNNs traditionally excel at). On the other hand, https://arxiv.org/abs/2405.04517 contains a "real" RNN component, which demonstrates a significant improvement on the kind of state-tracking problems that transformers struggle with.
These are real RNNs, they still depend upon the prior hidden state, it’s just that the gating does not. The basic RNN equation can be parallelized with parallel prefix scan algorithms.
I haven’t gone through the paper in detail yet but maybe someone can answer.
If you remove the hidden state from an rnn as they say they’ve done, what’s left? An mlp predicting from a single token?
They didn't remove the hidden state entirely, they just removed it from the input, forget and update gates. I haven't digested the paper either, but I think that in the case of a GRU this means that the hidden state update masking (z_t and r_t in the paper's formulas) only depends on the new input, not the input plus the prior hidden state.
It doesn't completely remove it, it removes certain dependencies on it so that it can be computed by parallel scan, there is still a hidden state. It bears some similarity to what was done with Mamba.
I only had a quick look, but it looks like they tweaked the state update so the model can be run with parallel scan instead of having to do it sequentially.
Everyone wants to use less compute to fit more in, but (obviously?) the solution will be to use more compute and fit less. Attention isn't (topologically) attentive enough. All these RNN-lite approaches are doomed, beyond saving costs, they're going to get cooked by some other arch—even more expensive than transformers.
Would you mind expanding upon your thesis? If that compute and all those parameters aren't "fitting" the training examples, what is it that the model is learning, and how should that be analyzed?
I think there are two distinct areas. One is the building of the representations, which is achieved by fitting. The other area is loosely defined as "computing" which is some kind of searching for a path through representation space. All of that is wrapped in a translation layer that can turn those representations into stuff we humans can understand and interact with. All of that is achieved to some extent by current transformer architectures, but I guess some believe that they are not quite as effective at the "computation/search" stage.
But how does it get good at "computing"? The way I see it, we either program them to do so manually, or we use ML, at which case the model "fits" the computation based on training examples or environmental feedback, no? What am I missing?
In 2016 & 2017 my team at Capital One built several >1B parameter models combining LSTMs with a few other tricks.
We were able to build generators that could replicate any dataset they were trained on, and would produce unique deviations, but match the statistical underpinnings of the original datasets.
We built several text generators for bots that similarly had very good results. The introduction of the transformer improved the speed and reduced the training / data requirements, but honestly the accuracy changed minimal.
I still find it remarkable how we need such an extreme amount of electrical energy to power large modern AI models.
Compare with one human brain. Far more sophisticated, even beyond our knowledge. What does it take to power it for a day? Some vegetables and rice. Still fine for a while if you supply pure junk food -- it'll still perform.
Clearly we have a long, long way to go in terms of the energy efficiency of AI approaches. Our so-called neural nets clearly don't resemble the energy efficiency of actual biological neurons.
Food is extremely dense in energy. 1 food calorie is about 1.1 Watt-hours. A hamburger is about 490 Wh. An AI model requires 0.047 kWh = 47 Wh to generate 1000 text responses.[1] If an LLM could convert hamburgers to energy, it could generate over 10000 prompt completions on a single hamburger.
Based on my own experience, I would struggle to generate that much text without fries and a drink.
During that time, your brain would do far more than just that text generation though, beyond what we even know scientifically.
But yes, food energy could be useful for AI. A little dystopian potentially too, if you think about it. Like DARPA's EATR robot, able to run on plant biomass (although potentially animal biomass too, including human remains):
This is more likely to be a hardware issue than an algorithms issue. The brain physically is a neural network, as opposed to a software simulation of one.
It’d be nice to see more of how this compares to Mamba. Looks like, in performance, they’re not leagues apart and it’s just a different architecture, not necessarily better or worse?
The only strength of transformers is that they can run once for each token and they can pass to themselves intermediate state as they solve your problems. They have to conceal it in tokens that look to humans like a part of the response.
It's obvious why the newest toy from openai can solve problems better mostly by just being allowed to "talk to itself" for a moment before starting the answer that human sees.
Given that, modern incarnation of RNN can be vastly cheaper than transformers provided that they can be trained.
Convolutional neural networks get more visual understanding by "reusing" their capacity across the area of the image. RNN's and transformers can have better understanding of a given problem by "reusing" their capacity to learn and infer across time (across steps of iterative process really).
When it comes to transformer architecture the attention is a red herring. It's just more or less arbitrary way to partition the network so it can be parallelized. The only bit of potential magic is with "shortcut" links between non adjacent layers that help propagate learning back through many layers.
Basically the optimal network is deep, dense (all neurons connect with all belonging to all preceding layers) that is ran in some form of recurrence.
But we don't have enough compute to train that. So we need to arbitrarily sever some connections so the whole thing is easier to parallelized. It really doesn't matter which unless we do in some obviously stupid way.
Actual inventive magic part of LLMs possibly happens in token and positional encoders.
There has been some work, but the problem is that its such a massive search space. Philosophically speaking, if you look at how humans came into existence, you could make an argument that the process of evolution from basic lifeforms can be represented as one giant compute per minute across of all of earth, where genetic selection happens and computation proceeds to the next minute. Thats a fuckload of compute.
In more practical terms, you would imagine that an advanced model contains some semblance of a CPU to be able to truly reason. Given that CPUs can be all NAND gates (which take 2 neurons to represent), and are structured in a recurrent way, you fundamentally have to rethink how to train such a network, because backprop obviously won't work to capture things like binary decision points.
I thought the whole point of neural networks was that they were good at searching through these spaces. I'm pretty sure OpenAI is pruning their models behind the scenes to reduce their costs because that's the only way they can keep reducing the cost per token. So their secret sauce at this point is whatever pruning AI they're using to whittle the large computation graphs into more cost efficient consumer products.
When you train a neural network, it is not search, it is descending through a curve.
If you were to search for billions of parameters by brute force, you literally could not do it in the lifespan of the universe.
A neural network is differentiable, meaning you can take the derivative of it. You train the parameters by taking finding gradient with respect to each parameter, and going in the opposite direction. Hence the name of the popular algorithm, gradient descent.
A biological neural network is certainly not differentiable. If the thing we want to build is not realizable with this technique, why can't we move on from it?
Gradient descent isn't the only way to do this. Evolutionary techniques can explore impossibly large, non-linear problem spaces.
Being able to define any kind of fitness function you want is sort of like a super power. You don't have to think in such constrained ways down this path.
The issue is that its still a massive search space.
You can do this yourself, go play nandgame, and beat it, at which point you should be able to make a cpu out of nandgates. Then set up a rnn that is the same layers at total layers of the nandgates and as wide as all the inputs, with every output being fed back into the first input. Then do PSO or GA on all the weights and see how long it takes you to make a fully functioning cpu.
>A biological neural network is certainly not differentiab
Biology is biology and has its constraints. Doesn't necessarily mean a biologically plausible optimizer would be the most efficient or correct way in silicon.
>If the thing we want to build is not realizable with this technique, why can't we move on from it?
All the biologically plausible optimizers we've fiddled with (and we've fiddled with quite a lot) just work (results wise) like gradient descent but worse. We've not "moved on" because gradient descent is and continues to be better.
>Evolutionary techniques can explore impossibly large, non-linear problem spaces.
Sure, with billions of years (and millions of concurrent experiments) on the table.
The search space is all off too wide, difficult to parameterize, and there is a wide gap between effective and ineffective architectures - ie: a very small change can make a network effectively DOA.
Notably architecture search was popular for small vision nets where the cost of many training runs was low enough. I suspect some of the train-then-prune approaches will come back, but even there only by the best funded teams.
Neural networks are Turing complete, i.e. there is a universal neural network that can compute any effectively computable function¹. Incidentally, when this is combined with Rice's theorem² it means that safety research is essentially an unsolvable problem because any non-trivial property of a sufficiently complex neural network, e.g. one that can simulate a Turing machine, will have properties which can not be predicted with finite computation.
I finally got around to reading this. Nice paper, but it fails to address a key question about RNNs:
Can RNNs be as good as Transformers at recalling information from previous tokens in a sequence?
Transformers excel at recalling info, likely because they keep all previous context around in an ever-growing KV cache.
Unless proponents of RNNs conclusively demonstrate that RNNs can recall info from previous context at least as well as Transformers, I'll stick with the latter.
Yes, all machine learning can be interpreted in terms of approximating the partition function.
This is obvious when one considers the connections between Transformers, RNNs, Hopfield networks and the Ising model, a model from statistical mechanics which is solved by calculating the partition function.
This interpretation provides us with some very powerful tools that are commonplace in math and physics but which are not talked about in CS & ML.
I'm working on a startup http://traceoid.ai which takes this exact view. Our approach enables faster training and inference, interpretability and also scalable energy-based models, the Holy Grail of machine learning.
The name of the paper contrasts with the paper that spawned Transformer architecture, which itself is a reference to the song "All You Need Is Love" by the Beatles. https://en.wikipedia.org/wiki/Attention_Is_All_You_Need
I eagerly await the backlash to suggesting any one thing is all you need, the first shot of which shall surely be titled: “‘All you need’ Considered Harmful”
This is such a interesting paper, sadly they don't have big models, I'd like to see a model trained on TinyStories or even C4 since it should be faster than the transformer variant and see how it compares.
1) Yoshua's reputation would take a hit if this paper were bullshit, so he has extrinsic motivation to make it good
2) Yoshua has enough experience in the field to know what is going on in the field, you don't have to ask if he forgot about a certain architecture or the work of a certain research group which would contradict his findings-- if such work exists and is credible, it is very likely to be discussed in the paper.
3) This test answers something a leader in the field thinks is important enough for them to work on, else he wouldn't be involved.
Also note, the poster said the paper shouldn't be taken lightly. That doesn't mean we need to take it blindly. It only means we cannot dismiss it out of hand, if we have a different view we would need substantive arguments to defend our view.
I've overturned the field leader several times in science, but that's only because I acknowledged what they got right and that they were indeed the person who got it right.
> It only means we cannot dismiss it out of hand, if we have a different view we would need substantive arguments to defend our view.
You will need to do that anyway, no matter if Yoshua is on the paper, or not. I understand that people have limited bandwidth, and so they need shortcuts, and they need to justify these shortcuts to themselves somehow (of course the justifications are nonsense). Maybe AI will help here.
“ I've overturned the field leader several times in science”
Either that makes you a field leader yourself, or you did it for trivial things, or you’re BSing. Which one is it?
there's a big space between leader and trivial. it's entirely possible to point out the top leader in your field is wrong on ten things over a career, without becoming the top leader yourself.
On speculative things or trivial things, sure! On substantive matters (recall: the choice of words is “overturned”), in empirical realms or theory (physics, CS) or math, it’s rather doubtful. Anonymous, self-declared geniuses aren’t to be taken at face value.
Excited to see more people working on RNNs but wish their citations were better.
In 2016 my team from Salesforce Research published our work on the Quasi-Recurrent Neural Network[1] (QRNN). The QRNN variants we describe are near identical (minGRU) or highly similar (minLSTM) to the work here.
The QRNN was used, many years ago now, in the first version of Baidu's speech recognition system (Deep Voice [6]) and as part of Google's handwriting recognition system in Gboard[5] (2019).
Even if there are expressivity trade-offs when using parallelizable RNNs they've shown historically they can work well and are low resource and incredibly fast. Very few of the possibilities regarding distillation, hardware optimization, etc, have been explored.
Even if you need "exact" recall, various works have shown that even a single layer of attention with a parallelizable RNN can yield strong results. Distillation down to such a model is quite promising.
Other recent fast RNN variants such as the RWKV, S4, Mamba et al. include citations to QRNN (2016) and SRU (2017) for a richer history + better context.
The SRU work has also had additions in recent years (SRU++), doing well in speech recognition and LM tasks where they found similar speed benefits over Transformers.
I note this primarily as the more data points, especially when strongly relevant, the better positioned the research is. A number of the "new" findings from this paper have been previously explored - and do certainly show promise! This makes sure we're asking new questions with new insights (with all the benefit of additional research from ~8 years ago) versus missing the work from those earlier.
Opinions probably differ, for example, on John Backus's paper "Can programming be liberated from the Von Neumann style?" Many fans of functional programming would say the answer is yes, but Backus himself expressed less enthusiasm in interviews later in his life.
I think the important point, though, is that academic papers and newspaper articles are not the same, and titles in the form of questions function differently in the two domains. Journalists tend to use titles like these to dissemble and sensationalize. When academics use these kinds of titles for peer-reviewed articles, it's because they really are asking an honest question. Backus was doing it in his paper. The authors of this paper are doing the same. They end the paper by re-iterating the question before launching into a discussion of the limitations that prevent them from reaching any firm conclusions on the answer to this question.
To me this is further evidence that these LLMs learn only to speak English, but there is no reasoning at all in them. If you simplify a lot and obtain the same results and we know how complex the brain is.
Every LLM expert on the planet agrees LLMs are doing "reasoning". No one says they have feelings or qualia, but we all know there's definitely genuinely artificial reasoning happening.
What LLMs have shown both Neuroscience and Computer Science is that reasoning is a mechanical process (or can be simulated by mechanical processes) and is not purely associated only with consciousness.
Those are all the people that have not yet decoupled "reasoning" from "consciousness" in their own way of thinking. It's admittedly hyperbolic to say "everyone". I love hyperbole on HN. :)
Planning, by definition, takes multiple reasoning steps. A single LLM inference is a fundamental single reasoning step, but it's a reasoning step nonetheless.
It's like I'm saying a house is made of bricks. You can build a house of any shape out of bricks. But once bricks have been invented you can build houses. The LLM "reasoning" that even existed as early as GPT3.5 was the "brick" with which highly intelligent agents can be built out of, with no further "breakthroughs" being required.
The basic Transformer Architecture was enough and already has the magical ingredient of reasoning. The rest is just a matter of prompt engineering.
Yeah, these kinds of discussions always devolve purely into debates about what's the proper definition of words. Especially on HN where everyone has their "Pedantic Knob" dialed up to 11.
You weren't being pedantic yourself. My point is that this discussion is ultimately about the definition of words, and that all by itself, makes the discussion meaningless.
I think a "granule" of "reasoning" happens at each inference, and you think there is no reasoning in a single inference. To discuss it further would be a game of whose definition of any given word is correct.
You're not being pedantic at all. It's a crucial distinction that people try to wave away in favor of hype. Especially since we are so vulnerable to anthropomorphizing.
> RNNs are particularly suitable for sequence modelling settings such as those involving time series, natural language processing, and other sequential tasks where context from previous steps informs the current prediction.
I would like to draw an analogy to digital signal processing. If you think of the recurrent-style architectures as IIR filters and feedforward-only architectures as FIR filters, you will likely find many parallels.
The most obvious to me being that IIR filters typically require far fewer elements to produce the same response as an equivalent FIR filter. Granted, the FIR filter is often easier to implement/control/measure in practical terms (fixed-point arithmetic hardware == ML architectures that can run on GPUs).
I don't think we get to the exponential scary part of AI without some fundamentally recurrent architecture. I think things like LSTM are kind of an in-between hack in this DSP analogy - You could look at it as FIR with dynamic coefficients. Neuromorphic approaches seem like the best long term bet to me in terms of efficiency.