> I guess there’s no new news here; we already knew that LLMs are good at generating plausible-sounding narratives which are wrong. It comes back to what I discussed under the heading of “Meaning”. Still waiting for progress.
> The nice thing about science is that it routinely features “error bars” on its graphs, showing both the finding and the degree of confidence in its accuracy.
> AI/ML products in general don’t have them.
> I don’t see how it’s sane or safe to rely on a technology that doesn’t have error bars.
Exactly, there is no news here. Foundation LLMs are generative models trained to produce text similar to the text on which they were trained (or for the pedantic: minimize perplexity).
They are not trained to output facts or truths or any other specific kind of text (or for the pedantic: instruction-tuned / rlhf types are trained to produce text that humans like after they are trained to minimize perplexity).
"Hallucination" is a terrible term to apply to LLMs. LLMs ONLY produce hallucinations.
The whole "hallucination" business always seemed to me to be a marketing masterstroke -- the "wrong" output it produces is in no way more "wrong" or "right" than any other output given how LLMs are fundamentally operate, but we'll brand it in such terms to give the indication that it is a silly occasional blunder rather than an example of a fundamental limitation of the tech.
LLMs can be useful when used as a glorified version of printf and scanf.
I agree that classifying their mistakes as "hallucinations" is marketing masterstroke, but then again, marketing masterstrokes are hallucinations too.
In fact all human perception is merely glorified hallucination. Your brain is cleaning up the fuzzy upside-down noise your eyes are delivering to it, so much that that you can actually hallucinate "words" with meaning on the screen that you see, or that a flower or a person or a painting is "beautiful".
We have an extremely long way to go until LLM hallucinations are better than human hallucinations, and it's disingenuous to treat LLM hallucinations as a bug that can be fixed, instead of a fundamental core feature that's going to take a long time to improve to the human level, and then also admit that humans have a long way to go in evolutionary scales before our own perception isn't as hallucinatory and inaccurate as it is now.
It was only extremely recently in evolutionary scales that we invented science as a way to do that, and despite its problems and limitations and corruptions and detractors, it's worked out so well that it enabled us to invent LLMs, so at least we're moving in the right direction.
At least it's easier and faster for LLMs to evolve than humans, so they have a much better chance of hallucinating less a lot sooner than humans.
It is made to serve humans so pretty obvious what means what in this context. Oh but why not change the context just for the sake of some pedantic argument.
Treating hallucination as an error rather than a fundamental limitation is simply a practical way of thinking. It means that, depending on how it's handled, hallucination can be mitigated and improved upon. Conversely, if it's regarded as a fundamental limitation, it would mean that no matter what you do, it can't be improved, so you'd just have to twiddle your thumbs. But that doesn't align with actual reality.
Treating hallucinations as an error that can be corrected fights against the nature of the technology and is more hype than reality. LLMs are designed to be a bullshit generator and that’s what they are; it is a fundamental limitation. (“Bullshit” here used in the technical sense: not that it’s wrong, but that the truth value of the output is meaningless to the generator.) Thankfully the hype cycle seems to be on the down slope. Think about the term “generative AI” and what the models are meant to do: generate plausible-sounding somewhat creative text. They do that! Mission accomplished. If you think you can apply them outside that limited scope, the burden of proof is on you; skepticism is warranted.
> Exactly, there is no news here. Foundation LLMs are generative models trained to produce text
Yet even here on HN it isn't uncommon to see top/high ranked comments suggesting baby AGI, intelligence, reasoning, and all that. I suspect I'll get replies about how GPT can do reasoning and world models. Hell, I've seen very prominent people in the space make these and other obtuse claims. A lot of people buy into the hype and have a difficult time distinguishing all the utility from the hype, not understanding that attacking the hype is not attacking utility (plenty of things are overhyped but useful).
I keep saying ML is like we've produced REALLY good chocolate, then decided that this was not good enough so threw some shit on top and called it a cherry. The chocolate is good enough, why are we accepting a world where we keep putting shit on top of good things? We do realize that at some point people get more upset because there's more shit than chocolate, right? Or that the shit's taste overpowers the chocolate's (when people are talking about pure shit, it's because they've reached this point). It's an unsustainable system that undermines all chocolate makers and can get chocolate making banned. For what? Some short term gains?
> so threw some shit on top and called it a cherry
Do you mean metaphorical feces or do you mean "shit" in the general sense of a collection of unspecified materials?
I think the canonical example of deciding chocolate wasn't good enough and throwing shit on top was what happened to the LP400 [0] or "Microsoft Re-Designs the iPod Packaging" [1].
I mean in the sense of overselling, snakeoil, and deception. So metaphorical. The tools have a lot of abilities but they are vastly oversold as intelligent and reasoning systems when they not. We have systems like Devin, Humane Pin, Rabbit, and so many more that were clearly not going to work as they were advertised, yet even big names promote these. "Top notch" researchers who believe the claims and promote them. When they should be dismissing them. But maybe they're playing a different game.
I mean we why can we not recognize the absolute incredible feat that LLMs actually are. You're telling me that we (lossy) compressed the entire (text on the) internet AND built a human language interface to retrieve that information and we can get this all to run on a GPU with <24GB of VRAM? That's wildly impressive! It's wildly impressive even were we to suppose the error rate was 1000x whatever you think it is. I mean Eliza is cool, but this shit (opposite usage) is bonkers. Anyone saying that isn't is just not paying attention or naive. There's no need to sell the "baby" AGI story when this is so incredibly impressive already.
True, although of course in many contexts facts/truth are the best prediction. Maybe we're just not training them as well as possible.
I've argued that to really fix "hallucinations" (least-worst predictions) these models really need to be aware of the source/trustworthiness of their training data, and would presumably learn that trusted sources better help predict factual answers.
However, it turns out that these models do often already have a good idea of their own confidence levels and whether something is true or not, but don't seem to know when to use that information.
Hallucinations seems to be an area where the foundation companies seem confident they can make significant improvements, although I'm not sure what techniques they expect to (or are) using.
> "Hallucination" is a terrible term to apply to LLMs. LLMs ONLY produce hallucinations.
I don't think that's a great way to characterize it. I prefer the word bullshitting vs hallucinations (they "know" what they are doing), but let's just call it what it is - they statistically predict, and some predictions are better than others. Per human-preference fine tuning, these models are also "trying to please", and I wonder if that has been at least a small part of the problem - predicting a low confidence (when they are aware of it) continuation rather than "I don't know" because human evaluators have indicated a preference for longer and/or more specific answers.
I agree, the term might be less than ideal but still:
Whether a model is good or bad at recalling facts from its training material matters to end-users and GPT-4, Claude 3 Opus (and bigger models in general?) tend to be better at this than other models.
I think you have this exactly right. They are so good at what you might call "hallucinations" that we have just gone ahead and repurposed them, dropping them into all kinds of contexts where it's good enough, even though it's not what it's strictly trained for.
Rather than hallucinations being treated like an interesting puzzle or paradox at the center of AI intrinsically, I think it's incidental to the types of models that have been trained, and it's conceivable that they could be trained against a notion of reliable sources and the relation between their statements and such sources.
A fact for one might just be a local minima for another. A falsehood for one person can turn out to be a statistical truth for a population. Are epistemologies to be evaluated for purity or for knowledge utility?
A classic example are food taboos that don't make sense without a public health perspective.
[F]ood taboos for pregnant and lactating women in Fiji selectively target the most toxic marine species, effectively reducing a woman's chances of fish poisoning by 30 per cent during pregnancy and 60 per cent during breastfeeding. We further analyse how these taboos are transmitted, showing support for cultural evolutionary models that combine familial transmission with selective learning from locally prestigious individuals.
I mean, you're not stating anything interesting...
For example lets take how we treat children.
"Act good or Santa won't bring you gifts". Which really means "Please don't act like a turd burglar and I'll buy you some nice things one day a year".
Instead of going straight to the second point, humans love putting hocus pocus of a magical being dropping gifts in the house to their children. Correct, many religious traditions like "hey don't eat pork" make sense if you have no means of cooking meat to temperature. But they have little purpose once you have a basic understanding of the reasoning behind why you got sick. The same goes for anything that is stuck in the stone ages of a sky daddy versus moving to the more complex domain of philosophy.
If humans were optimized to seek truth, we'd have had the scientific revolution soon after we became intelligent. Instead it took 10,000 years or so before that happened.
Treating scientific understanding as truth or fact is effectively converting it into a religion based on faith. We take those "facts" pit of context, wipe away caveats and assumptions made during the research, and leave ourselves with catchy headlines that can only be shared on faith as we don't know the context.
it is almost a definition of ignorance, to mock something that is not understood.. the hubris of science building the castles of economics are literally taking the world down ecologically right now.. both "rational" pursuits. Next, look at media content.. addiction prevelance .. etc .. not rational, not factual..
save the cheap shots for something of similar value
Show me a decision making system with a 0% error rate!
That's not to say that LLMs are currently a poor fit for many domains (and maybe always will be), but I feel like your general objection would apply to many – and even deterministic – models we've been using for a long time as well.
Or would you say that weather forecasts are completely useless as well?
I think that's a bit harsh. There's been some definite progress with LLMs getting better at reasoning, picking apart instructions, producing helpful suggestions, criticism, etc.
What they are weak at is exactly the same stuff we're actually weak at: accurately recalling facts. Except they are able to recall vastly more stuff than any human being would be able to recall. An inhuman amount of facts actually. The core issue is that when asked sufficiently open questions, these models tend to take some liberties with the facts. But most of the knowledge tests that are used to benchmark LLMs, would be hard to pass for the vast majority of humans on this planet as well.
Worse, if you follow the public debate on various political topics a bit you realize that it features a lot of people suffering from confirmation bias parroting each other. Populist politicians seem to get away with a lot of stuff that would put most LLMs to shame; seemingly without affecting their popularity.
IMHO, LLMs by themselves can't be trusted to get things right but paired with some subsystems to produce references, check things, look things up, they become quite capable. Also, it helps asking the right targeted questions and constrain them a little.
We pay for chat gpt at work. At 20$ per month per user, it's pretty much a no-brainer. Do we blindly trust it? Absolutely not. Do we get shit tons of value out of it? Yes. I program with it, I brainstorm with it, I let it review text, I use it to work out bullet points into a coherent narrative, I use it to refine things, I use it to generate unit tests, etc. Is it flawless? No. But it sure saves me a lot of time. And getting the same value from people tends to be a lot harder/more expensive.
This dismissive "it's just a stochastic parrot" type criticism kind of misses the forest for the trees. If you've ever observed toddlers repeating stuff adults tell them, you'd realize that we all start out as stochastic parrots. Forget about getting any coherent/insightfull statement out of a toddler. And most adults aren't that much better and would fail most of the tests we throw at LLMs.
I used to write a classical music blog. I was an enthusiastic listener but only had 1 music theory class in my life. Yet I had enough mastery of the vocabulary to get classical musicians and critics to read what I had to say.
I attribute some of this to years of reading classical music reviews turning me into a stochastic parrot. I then added some value on top of that via enthusiasm, specializing in a niche (American classical music), and just cranking out frequent, reasonable and sometimes interesting content.
Fanfare Magazine had a critic try using ChatGPT to write classical record reviews. But they used 3.5 and had it try to write a review of an old revered album so it "went off the rails." To write a review to the expected standards requires sophistication and breadth of knowledge. I can probably mimic like ChatGPT did but I likely wouldn't reach the (sometimes pedantic if experienced) level of a real critic.
I wonder if it will ever be feasible to have AI listen to a new recording and write a review on an objective basis rather than try to synthesize a review from existing text. That to me would be the real breakthrough.
>> "Hallucination" is a terrible term to apply to LLMs. LLMs ONLY produce hallucinations.
Seems pedantic. If you define "hallucinations" to include correct responses, then nobody cares whether something is a "hallucination," the only important question is -- is it right?
And the answer is yes, across a huge number of metrics [at least for gpt-4]. It'd beat you at jeopardy, it'd beat you at chess, it'd beat you at an AP-biology exam, it'd beat you at a leet code competition.
Yes, gpt4 will beat 99.9% of humans on 99.9% of subjects. But it's terrible at leetcode.
For "easy" questions sure it's great, and even some medium, but often at a medium level and almost always at a hard level, it'll give you a classic confident-but-terribly-wrong response. Prompting it to look at its mistakes will usually lead to the "oops yes that was wrong, here's the corrected version:", which is also wrong.
definitely less than 1%, and we're in agreement in general
i only take issue with leetcode as an example. i would say gpt4 is great at coding, and also api and db schema design. but leetcode is specific coding tasks with clear right and wrong answers, where getting 80% of a right answer is still wrong. gpt4 and my mum have roughly the same chance of getting the right answer to a hard leetcode problem, that's my only point
> LLMs are generative models trained to produce text similar to the text on which they were trained (or for the pedantic: minimize perplexity). They are not trained to output facts or truths or any other specific kind of text.
If you train them on text that is largely true, then the output is also largely true. To ignore this is to miss the point of LLMs altogether.
Have any of the leading models been trained on that kind of corpus? Could a corpus like that even be constructed without vociferous and immediate dispute about what amounts to "largely true"? Would a corpus like that have enough suitable data to generate conversational output that feels satisfying?
I thought that most/all of these big models are relying on a much much larger and more diverse body of content. Is that wrong?
> Have any of the leading models been trained on that kind of corpus?
Does such a corpus even exist? Maybe a bunch of Mathematics papers, and even then...
To build a corpus like that would mean trusting the human gatekeepers or curators, which just shifts the problem, because humans are also prone to this sort of mistake.
> Would a corpus like that have enough suitable data to generate conversational output that feels satisfying?
Yes? The point is that we can use LLMs right now. They mostly work. That is the empirical evidence that is immediately verifiable right now. This is not a hypothetical. LLMs mostly work.
Huh? I don't understand how this engages with what I wrote or what I was responding to.
Yes, it's very very obvious that the conversational LLM's we engagw with mostly work for conversational UX. The discussion is about the above commenter's reference to "largely true" data. I'm not aware of any leading LLM built on something that could be characterized that way, but I'm genuinely curious to hear otherwise.
See Tim Bray’s example; the LLM very much did not mostly work once it was talking about something that he knew about; most of what it said was nonsense.
I suspect a _huge_ driver of the hype is that people are not usually experts on the things that they ask these about, so they see a magical oracle rather than a nonsense generator.
But then companies go train them on "The Internet". This includes not just factual (or mostly factual) content like Wikipedia but also SEO spam sites, YouTube comment sections, literal fake news articles, and fan fiction.
Even if they're limited to "mostly true" content they'll happily and confidently confabulate "facts". They'll make syntactically and grammatically correct sentences that are actually authentic frontier gibberish.
> But then companies go train them on "The Internet".
This is the bit that I think a lot of people miss. If the model was trained on stuff from the internet, then what the model will produce will reflect that, and the internet is chock full of bullshit.
Same thing about when LLMs produce objectionable responses (racist, violent, whatever). Those are a direct reflection of the nature of the training material.
There's still a missing factor i suspect tho. Given how frequently wrong and hallucination prone humans are, i don't think we're _that_ different in this context. Nonetheless we can somehow inspect our thoughts, and come up with a degree of confidence.
But what gives us that ability? I don't trust human thought at all. Police have to be careful what they say as to not pollute the minds of anyone they're questioning - we're just insanely prone to flat out lie without knowing. So to one degree, we seem quite similar to hallucinating LLMs.
So what gives us the ability to nonetheless identify "truth" from our memory? Is it an ability to trace back to training data? The less clear the path is, perhaps the more likely we think we don't "know" the answer?
I also feel this way. We are not inherently different to LLMs, we are just better at some things, worse at others. At the end of the day we both are just atoms and electricity as ruled by the laws of physics.
> Foundation LLMs are generative models trained to produce text similar to the text on which they were trained (or for the pedantic: minimize perplexity).
They are not trained to output facts or truths or any other specific kind of text
> The nice thing about science is that it routinely features “error bars” on its graphs, showing both the finding and the degree of confidence in its accuracy.
> AI/ML products in general don’t have them.
> I don’t see how it’s sane or safe to rely on a technology that doesn’t have error bars.
Exactly, there is no news here. Foundation LLMs are generative models trained to produce text similar to the text on which they were trained (or for the pedantic: minimize perplexity).
They are not trained to output facts or truths or any other specific kind of text (or for the pedantic: instruction-tuned / rlhf types are trained to produce text that humans like after they are trained to minimize perplexity).
"Hallucination" is a terrible term to apply to LLMs. LLMs ONLY produce hallucinations.