LLMs are all fake AI. As the recently released Apple study demonstrates, LLMs don't reason, they just pattern match. That's not "intelligence" however you define it because they can only solve things that are already within their training set.
In this case, it would have been better for the AI industry if it had been 700 programmers, because then the rest of the industry could have argued that the utter trash code Builder.ai generated was the result of human coders spending a few minutes haphazardly typing out random code, and not the result of a specialty-trained LLM.
> because they can only solve things that are already within their training set
I just gave up on using SwiftUI for a rewrite of a backend dashboard tool.
The LLM didn't give up. It kept suggesting wilder, and less stable ideas, until I realized that this was a rabbithole full of misery, and went back to UIKit.
It wasn't the LLM's fault. SwiftUI just isn't ready for the particular functionality I needed, and I guess that a day of watching ChatGPT get more and more desperate, saved me a lot of time.
But the LLM didn't give up, which is maybe ot-nay oo-tay ight-bray.
>As the recently released Apple study demonstrates, LLMs don't reason, they just pattern match
Hold on a minute I was under the impression that "reasoning" was just marketing buzzword the same as "hallucinations", because how tf anyone expected GPUs to "reason" and "hallucinate" when even neurology/psychology don't have a strict definition of those processes.
No, the definitions are very much up for debate, but there is an actual process here. "Reasoning" in this case means having the model not just produce whatever output is requested directly, but also spend some time writing out its thoughts about how to produce that output. Early version of this were just prompt engineering where you ask the model to produce its "chain of thought" or "work step by step" on how to approach the problem. Later this was trained into the model directly with traces on this intermediate thinking, especially for multistep problems, without the need for explicit prompting. And then architecturally these models now have different ways to determine when to stop "reasoning" to skip to generating actual output.
I don't have a strict enough definition to debate if this reasoning is "real" - but from personal experience it certainly appears to be performing something that at least "looks" like inductive thought, and leads to better answers than prior model generations without reasoning/thinking enabled.
It's gradient descent. Why are we surprised when the answers get better the more we do it? Sometimes you're stuck in a local max/minima, and you hallucinate.
Am I oversimplifying it? Is everybody else over-mystifying it?
Gradient descent is how the model weights are adjusted during training. There is no gradient descent, and nothing even remotely similar to it, that happens during inference.
If you allow me to view the weights of a model as the axioms in an axiomatic system, my (admittedly limited) understanding of modern "AI" inference is that it adds no net new information/knowledge, just more specific expressions of the underlying structure (as defined by the model weights).
So while that does undercut my original flippancy of it being "nothing but gradient descent" I don't think it runs counter to my original point that nothing particularly "uncanny" is happening here, no?
To some extent, I think the axiom comparison is enlightening. In principle, axioms determine the entire space of all mathematical truths in that system. However, in practice, not all truths in a mathematical system are as easy to discover or verify. Knowing the axioms of arithmetic doesn't mean you don't get anything out of computing how much 271819 × 637281 actually is.
The claims about LLM reasoning are precisely related to this point. Do LLMs follow some internal deductive process when they generate output that resembles such a logical process to humans? Or are they just producing text that looks like reasoning, much like an absurdist play might do, and then simply picking a conclusion that resembles other problems they've seen in the past?
I don't think any arguments about the base nature of the model are particularly helpful here. In principle, deductive reasoning can be expressed as a mathematical function, and for any function, there is a neural net that can approximate it with arbitrary precision. So it's not impossible that the model actually does this, but it's also not a given - this first principles approach is just not helpful. We need more applied study of how the model actually works to probe this deeper.
Well, if the thing is truly capable of reason, then we have an obligation to put the kibosh on the entire endeavor because we're using a potentially intelligent entity as slave labor. At best, we're re-inventing factory farming and at worst we're re-inventing chattel slavery. Neither of those situations is something I'm personally ok with allowing to continue
I also find the assumption that tech-savvy individuals would inherently be for what we currently call AI to itself, be weird. Unfortunately I feel as though being knowledgable or capable within an area is conflated with an over-acceptance of that area.
If anything, the more I've learned about technology, and the more experienced I am, the more fearful and cautious I am with it.
the biggest techno-pessimists I know are all in the tech industry.
On the other hand, most of the "blindly accept every new technique or gadget" folks I know are also tech workers so maybe there's nothing going on there. Wonder if there's study on this
can it? the only example we have of something reasoning on this level is humans and we are most definitely sentient despite our best efforts to sometimes appear otherwise.
And even if the two could theoretically be separated, how sure are we that these AI agents are in that category? I'm pretty sure they are neither, but that doesn't mean it isn't a pretty abhorrent possibility that should be addressed.
small-s skepticism, perhaps. 'Skeptics' can fall foul of the same kind of groupthink, magical and motivated reasoning, and fallacies as everyone else, which are often characterized as 'religious'. Not believing in a god doesn't make you immune to being wrong.
(FWIW, I do think there's some very unhealthy attitudes to AI and LLMs going around, like people feel the only two options are 'the singularity is coming' and 'they're useless scams', which tends to result in a large quantity of bullshit on the topic)
>like people feel the only two options are 'the singularity is coming' and 'they're useless scams'
No, you are turning a few loud people into a false dichotomy. The vast majority of people are somewhere between "LLMs are neat" and "I don't think LLMs are AGI"
The vast majority of people do not comment. Using only the comments that people go out of their way to make as your data source is a huge sampling error.
> As the recently released Apple study demonstrates, LLMs don't reason
Where is everyone getting this misconception? I have seen it several times. First off, the study doesn't even try to qualify whether or not these models use "actual reasoning" - that's outside of the scope. They merely examine how effective thinking/reasoning _is_ at producing better results. They found that - indeed - reasoning improves performance. But the crucial result is that it only improves performance up to a certain difficulty-cliff - at which point thinking makes no discernable difference due to a model collapse of sorts.
It's important to read the papers you're using to champion your personal biases.
The LLMs don't "reason" by any definition of the term. If they did, then the Tower of Hanoi and the river problem would have been trivial for them to handle at any level because ultimately the solutions are just highly recursive.
What the LLMs do is attempt to pattern match to existing solved problems in their training set and just copy those solutions. But this results in overthinking for very simple problems (because they're copying too much of the solutions from their training set), works well for the somewhat complex problems like a basic Tower of Hanoi, and not at all for the problems that would require actual reasoning because...they're just copying solutions.
The point of the paper is that what LLMs do is not reasoning, however much the AI industry may want to redefine the word to suit their commercial interests.
> They found that - indeed - reasoning improves performance.
You're oversimplifying the results a bit here. They show that reasoning decreases performance for simple problems, improves performance for more complex ones, and does nothing for very complex problems.
You're conflating two definitions of reasoning. LLM's have struggled with visual reasoning since their inception because guess what!? they're trained on language mostly, not a 3d environment.
LLM's aren't magic - those who claim they are are hyping for some reason or another. Ignore them. View AI objectively. Ignore your bias.
Visual reasoning is not required to follow an algorithm that was laid out. The fact that the model can't execute an algorithm it was provided proves that it is not able to do deductive synbolic reasoning in general, since that is all that would have been required required.
I would agree with you if this were about inventing the algorithm for itself - it may well be that you'd need some amount of visual reasoning to come up with it. But that's not what the GP (or the paper) were talking about.
> because they can only solve things that are already within their training set.
That is just plain wrong, as anybody who spent more than 10 minutes with a LLM within the last 3 years can attest. Give it a try, especially if you care to have an opinion on them. Ask an absurd question (that can be, in principle, answered) that nobody has asked before and see how it performs generalizing. The hype is real.
I'm interested what study you refer to. Because I'm interested in their methods and what they actually found out.
The crux is that beyond a bit of complexity the whole house of cards comes tumbling down. This is trivially obvious to any user of LLMs who has trained themselves to use LLMs (or LRMs in this case) to get better results ... the usual "But you're prompting it wrong" answer to any LLM skepticism. Well, that's definitely true! But it's also true that these aren't magical intelligent subservient omniscient creatures, because that would imply that they would learn how to work with you. And before you say "moving goalpost" remember, this is essentially what the world thinks they are being sold.
It can be both breathless hysteria and an amazing piece of revolutionary and useful technology at the same time.
The training set argument is just a fundamental misunderstanding, yes, but you should think about the contrapositive - can an LLM do well on things that are _inside_ its training set? This paper does use examples that are present all over the internet including solutions. Things children can learn to do well. Figure 5 is a good figure to show the collapse in the face of complexity. We've all seen that when tearing through a codebase or trying to "remember" old information.
I think apple published that study right before WWDC to have an excuse to not give bigger than 3B foundation models locally and force you to go via their cloud -for reasoning- harder tasks.
beta api's so its moving waters but that's my thoughts after playing with it, the paper makes much more sense in that context
What you think is an absurd question may not be as absurd as it seems, given the trillions of tokens of data on the internet, including its darkest corners.
In my experience, its better to simply try using LLMs in areas where they don't have a lot of training data (e.g. reasoning about the behaviour of terraform plans). Its not a hard cutoff of being _only_ able to reason exactly about solved things, but its not too far off as a first approximation.
The researchers took exiting known problems and parameterised their difficulty [1]. While most of these are not by any means easy for humans, the interesting observation to me was that the failure_N was not proportional to the complexity of the problem, but more with how common solution "printouts" for that size of the problem can typically be encountered in the training data. For example, "towers of hanoi" which has printouts of solutions for a variety of sizes went to very large number of steps N, while the river crossing, which is almost entirely not present in the training data for N larger than 3, failed above pretty much that exact number.
It doesn't help that thanks to RLHF, every time a good example of this gains popularity, e.g. "How many Rs are in 'strawberry'?", it's often snuffed out quickly. If I worked at a company with an LLM product, I'd build tooling to look for these kinds of examples in social media or directly in usage data so they can be prioritized for fixes. I don't know how to feel about this.
On the one hand, it's sort of like red teaming. On the other hand, it clearly gives consumers a false sense of ability.
Indeed. Which is why I think the only way to really evaluate the progress of LLMs is to curate your own personal set of example failures that you don't share with anyone else and only use it via APIs that provide some sort of no-data-retention and no-training guarantees.
I agree that all AI has been fake AI since the term was first coined.
Researchers in the field used to acknowledge that their computational models weren't anywhere close to AI. That all changed when greed became the driving motivation of tech.
> As the recently released Apple study demonstrates
The Apple study that did Towers of Hanoi and concluded that giving up when the answers would have been too long to fit in the output window was a sign of "not reasoning"?
In this case, it would have been better for the AI industry if it had been 700 programmers, because then the rest of the industry could have argued that the utter trash code Builder.ai generated was the result of human coders spending a few minutes haphazardly typing out random code, and not the result of a specialty-trained LLM.