> the hallucination phenomena There isn't really such thing as a "hallucination"...

raincole · on June 27, 2024

I don't know why the narrative became "don't call it hallucination". Grantly English isn't my mother tongue so I might miss some subtlty here. If you know how LLM works, call it "hallucination" doesn't make you know less. If you don't know how LLM works, using "hallucination" doesn't make you know less either. It's just a word meaning AI gives wrong[1] answer.

People say it's "anthropomorphizing" but honestly I can't see it. The I in AI stands for intelligence, is this anthropomorphizing? L in ML? Reading and writing are clearly human activities, so is using read/write instead of input/output anthropomorphizing? How about "computer", a word once meant a human who does computing? Is there a word we can use safely without anthropomorphizing?

[1]: And please don't argue what's "wrong".

devjab · on June 28, 2024

I suspect it’s about marketing. I’m not sure it would be so easy to sell these tools to enterprise organisations if you outlined that they are basically just very good at being lucky. With the abstraction of hallucinations you sort of put into language why your tool is sometimes very wrong.

To me the real danger comes from when the models get things wrong but also correct at the same time. Not so much in software engineering, I doubt your average programmer without LLM tools will write “better” code without getting some bad answers. What consents me is more how non-technical departments implement LLMs into their decision making or analysis systems.

Done right, it’ll enhance your capabilities. We had a major AI project in cancer detection, and while it actually works it also doesn’t really work on its own. Obviously it was meant to enhance the regular human detection and anyone involved with the project screamed this loudly at any chance they got. Naturally it was seen as an automation process by the upper management and all the humans parts of the process were basically replaced… until a few years later when we had a huge scandal about how the AI worked as it was meant to do, which wasn’t to be on its own. Today it works along side the human detection systems and their quality is up. It took people literally dying to get that point through.

Maybe it would’ve happened this way anyway if the mistakes weren’t sort of written into this technical issue we call hallucinations. Maybe it wouldn’t. From personal experience with getting projects to be approved, I think abstractions are always a great way to hide the things you don’t want your decision makers to know.

Affric · on June 27, 2024

The AI companies don’t want you “anthropomorphising” the models because it would put them at risk of increased liability.

You will be told that linear algebra is just a model and the fact that epistemology has never turned up a decent result for what knowledge is will be ignored.

We are meant to believe that we are somehow special magical creatures and that the behaviour of our minds cannot be modelled by linear algebra.

ben_w · on June 28, 2024

I don't see how anthropomorphism reduces liability.

If a company does a thing that's bad, it doesn't matter much if the work itself was performed by a blacksmith or by a robot arm in a lights-off factory.

> We are meant to believe that we are somehow special magical creatures and that the behaviour of our minds cannot be modelled by linear algebra

I only hear this from people who say AI will never reach human level; of AI developers that get press time, only LeCun seems so dismissive (though I've not actually noticed him making this specific statement, I can believe he might have).

duggan · on June 28, 2024

No, you’re just meant not to assert that linear algebra is equivalent to any process in the human brain, when the human brain is not understood well enough to draw that conclusion.

p1esk · on June 27, 2024

It's just a word meaning AI gives wrong answer.

No, it’s more specific than just wrong.

Hallucination is when a model creates a bit of fictitious knowledge, and uses that knowledge to answer a question.

olalonde · on June 28, 2024

Can you give an example of a "wrong" answer vs an "hallucinated" answer?

omikun · on June 28, 2024

The issue is there is no difference between a right answer and a hallucinated answer.

stefanve · on June 28, 2024

there are many types of a wrong answer, and the difference is based on how the answer came to be. In case of BS/Hallucination there is no reason or logic behind the answer it is basically, in the case of LLM, just random text. There was no reasoning behind the output or it wasn't based on facts.

You can argue if it matters how a wrong answer came about ofc but there is a difference

influx · on June 28, 2024

Wrong is code that doesn’t compile. Hallucinated is compilable code using a library that never existed.

gbnwl · on June 28, 2024

Can code using a library that doesn't exist compile? I admit ignorance here.

influx · on June 30, 2024

No it can't, I should have said code that has valid syntax, but are using APIs or libraries that don't exist.

ruszki · on June 28, 2024

It doesn't need to create wrong answers. It's enough to recall people who gave wrong answers.

ben_w · on June 28, 2024

I've heard the term originated in image recognition, where models would "see" things that weren't there.

You can still get that with zero bad labels in a supervised training set.

Multiple causes for the same behaviour makes progress easier, but knowing if it's fully solved harder.

VanillaCafe · on June 28, 2024

> I don't know why the narrative became "don't call it hallucination".

Context is "don't call it hallicination" picked up meme energy since https://link.springer.com/article/10.1007/s10676-024-09775-5 on the thesis that "Calling their mistakes ‘hallucinations’ isn’t harmless: it lends itself to the confusion that the machines are in some way misperceiving but are nonetheless trying to convey something that they believe or have perceived."

Which is meta-bullshit because it doesn't matter. We want LLMs to behave more factually, whatever the non-factuality is called. And calling that non-factuality something else isn't going to really change how we approach making them behave more factually.

intended · on June 28, 2024

How are LLMs not behaving factually? They already predict the next most likely term.

If they could predict facts, then these would be gods, not machines. It would be saying that in all the written content we have, there exists a pattern that allows us to predict all answers to questions we may have.

ijk · on June 28, 2024

The problem is that some people are running around and saying they are gods. Which I wouldn't care about, but an alarming number of people do believe that they can predict facts.

intended · on June 28, 2024

Our system can effectively predict facts.

It logics its way to it.

By predicting the next word in a sequence of words.

Sure? It kinda sounds plausible? But man, if it’s that straight forward, what have we been doing as a species for so many years ?

intended · on June 28, 2024

TLDR TLDR: Assuming we dont argue right/wrong, technically everything an LLM does is a hallucination. This completely dilutes the meaning of the word no?

TLDR: Sure. A rose by any other name would be just as sweet. It’s when I use the name of the rose and imply aspects that are not present, that we create confusion and busy work.

Hey, calling it a narrative is to move it to PR speak. I know people have argued this term was incorrect since the first times it was ever shared on HN.

It was unpopular to say this when ChatGPT launched, because chatGPT was just that. freaking. cool.

It is still cool.

But it is not AGI. It does not “think”.

Hell - I understand that we will be doing multiple columns of turtles all the way down. I have a different name for this approach - statistical committees.

Because we couched its work in terms of “thinking”, “logic”, “creativity”, we have dumped countless man hours and money into avenues which are not fruitful. And this isnt just me saying it - even Ilya commented during some event that many people can create PoCs, but there are very few production grade tools.

Regarding the L in ML, and the I in AI ->

1) ML and AI were never quite as believable as ChatGPT. Calling it learning and intelligence doesnt result in the same level of ambiguity.

2) A little bit of anthropomorphizing was going on.

Terms matter, especially at the start. New things get understood over time, as we progress we do move to better terms. Let’s use hallucinations for when a digital system really starts hallucinating.

th0ma5 · on June 27, 2024

AI is a nebulous, undefined term, and many people specifically criticize the use of the word intelligent.

Affric · on June 27, 2024

Always people who consider themselves intelligent

hatthew · on June 27, 2024

LLMs model their corpus, which for most models tends to be factually correct text (or subjective text with no factuality). Sure, there exist factually incorrect statements in the corpus, but for the vast majority of incorrect statements there exist many more equivalent but correct statements. If an LLM makes a statement that is not supported by the training data (either because it doesn't exist or because the equivalent correct statement is more strongly supported), I think that's an issue with the implementation of the model. I don't think it's an intrinsic feature/flaw in what the model is modeling.

Hallucination might not be the best word, but I don't think it's a bad word. If a weather model predicted a storm when there isn't a cloud in the sky, I wouldn't have a problem with saying "the weather model had a hallucination." 50 years ago, weather models made incorrect predictions quite frequently. That's not because they weren't modeling correct weather, it's because we simply didn't yet have good models and clean data.

Fundamentally, we could fix most LLM hallucinations with better model implementations and cleaner data. In the future we will probably be able to model factuality outside of the context of human language, and that will probably be the ultimate solution for correctness in AI, but I don't think that's a fundamental requirement.

intended · on June 28, 2024

I suspect you would fix first response accuracy.

People still want it to be used for thinking.

This isnt going to happen with better data. Better data means it will be better at predicting the next token.

For questions or interactions where you need to process, consider, decompose a problem into multiple steps, solve those steps etc - you need to have a goal, tools, and the ability to split your thinking and govern the outcome.

That isnt predicting the next token. I think it’s easier to think of LLMs as doing decompression.

They take an initial set of tokens and decompress them into the most likely final set of tokens.

What we want is processing.

We would have to set up the reaction to somehow perfectly result in the next set of tokens to then set up the next set of tokens etc - till the system has an answer.

Or in other words, we have to figure out how to phrase an initial set of tokens so that each subsequent set looks similar enough to “logic” in the training data, that the LLM expands correctly.

shrimp_emoji · on June 27, 2024

It should be "confabulation", since that's not carting along the notion of false sensory input.

Humans also confabulate but not as a result of "hallucinations". They usually do it because that's actually what brains like to do, whether it's making up stories about how the world was created or, more infamously, in the case of neural disorders where the machinery's penchant for it becomes totally unmoderated and a person just spits out false information that they themselves can't realize is false. https://en.m.wikipedia.org/wiki/Confabulation

nl · on June 28, 2024

> There isn't really such thing as a "hallucination" and honestly I think people should be using the word less. Whether an LLM tells you the sky is blue or the sky is purple, it's not doing anything different. It's just spitting out a sequence of characters it was trained be hopefully what a user wants. There is no definable failure state you can call a "hallucination," it's operating as correctly as any other output.

This is a very "closed world" view of the phenomenon which looks at an LLM as a software component on its own.

But "hallucination" is a user experience problem, and it describes the experience very well. If you are using a code assistant and it suggests using APIs that don't exist then the word "hallucination" is entirely appropriate.

A vaguely similar analogy is the addition of the `let` and `const` keywords in JS ES6. While the behavior of `var` was "correct" as-per spec the user experience was horrible: bug prone and confusing.

IanCal · on June 27, 2024

It's the new "serverless" and I would really like people to stop making the discussion between about the word. You know what it means, I know what it means, let's all move on.

We won't, and we'll see this constant distraction.

CaptainOfCoit · on June 27, 2024

> It's the new "serverless" and I would really like people to stop making the discussion between about the word. You know what it means, I know what it means, let's all move on.

Well, parent is lamenting the lack of lowerbound/upperbound for "hallucinations", something that cannot realistically exist as "hallucinations" don't exist. LLMs aren't fact-outputting machines, so when it outputs something a human would consider "wrong" like "the sky is purple", it isn't true/false/correct/incorrect/hallucination/fact, it's just the most probable character after the next.

That's why it isn't useful to ask "but how much it hallucinates?" when in reality what you're out after is something more like "does it only output facts?". Which, if it did, LLMs would be a lot less useful.

elif · on June 28, 2024

There is a huge gap between "facts" and "nonfacts" which compose the majority of human discourse. Statements, opinions, questions, when properly qualified, are not facts or nonfacts or hallucinations.

LLM don't need to be perfect fact machines at all to be honest, and non-hallucinating. They simply need to ground statements in other grounded statements and identify the parts which are speculative or non-grounded.

kreyenborgi · on June 28, 2024

If you simply want to ground statements in statements, you quickly get into GOFAI territory where you need to build up the full semantics of a sentence (in all supported languages) in order to prove that two sentences mean the same or have the same denotation or that one entails the other.

Otherwise, how do you prove the grounding isn't "hallucinated"?

mattigames · on June 28, 2024

The root issue is that us humans perceive our own grasp on things better than it is ( "better" may be the wrong word, maybe just "different"), of how exactly concepts are tied to each other in our heads, it's been a primordial tool for our survival, and for our day to day lives but it's at odds with the task of building reasoning skills in the machine, because language evolved first and foremost to communicate among beings that share a huge context, so for example our definition of the word "blue" in "the sky is blue" would be wildly different if humans were all blind (like the machine is, in a sense)

IanCal · on June 28, 2024

> it's just the most probable character after the next.

That's simply not true. You're confusing how they're trained and what they do. They don't have some store of exactly how likely each word is (and it's worth stopping to think about what that would even mean) for every possible sentence.

CaptainOfCoit · on June 28, 2024

> That's simply not true.

It's a simplification. Temperature also influences it to not always be the most probable character, as an example.

IanCal · on June 30, 2024

No it's fundamentally not true because when you say "most likely" it's the highest value output of the model, not what's most likely either in the underlying data or the goal of what is being trained for.

elif · on June 28, 2024

>"you know what it means, I know what it means"

It is somewhat humorous when humans have ontological objections to the neologisms used to describe a system whose entire function is to relate the meanings of words. It is almost as if the complaint is itself a repressed philosophical rejection of the underlying LLM process, only being wrapped in the apparent misalignment of the term hallucination.

The complaint may as well be a defensive clinging "nuh uh, you can't decide what words mean, only I can"

Perhaps the term "gas lighting" is also an appropriate replacement of "hallucination," one which is not predicated on some form of truthiness standard, but rather THIS neologism focuses on the manipulative expression of the lie.

sandworm101 · on June 27, 2024

Hallucination is emergent. It cannot be found as a thing inside the AI systems. It is a phenomena that only exists when the output is evaluated. That makes it an accurate description. A human who has hallucinated something is not lying when they speak of something that never actually happened, nor are they making any sort of mistake in their recollection. Similarly, an AI that is hallucinating isn't doing anything incorrect and doesn't have any motivation. The hallucinated data emerges just as any other output, only to evaluated by outsiders as incorrect.

mortenjorck · on June 27, 2024

It is an unfortunately anthropomorphizing term for a transformer simply operating as designed, but the thing it's become a vernacular shorthand for, "outputting a sequence of tokens representing a claim that can be uncontroversially disproven," is still a useful concept.

There's definitely room for a better label, though. "Empirical mismatch" doesn't quite have the same ring as "hallucination," but it's probably a more accurate place to start from.

NovemberWhiskey · on June 27, 2024

>"outputting a sequence of tokens representing a claim that can be uncontroversially disproven," is still a useful concept.

Sure, but that would require semantic mechanisms rather than statistical ones.

riwsky · on June 28, 2024

Statistics has a semantics all its own

hbn · on June 27, 2024

Regardless I don't think there's much to write papers on, other than maybe an anthropological look at how it's affected people putting too much trust into LLMs for research, decision-making, etc.

If someone wants info to make their model to be more reliable for a specific domain, it's in the existing papers on model training.

emporas · on June 27, 2024

Chess engines, which are used for 25 years by the best human chess players daily, compute the best next move on the board. The total number of all possible chess positions is more than all the atoms in the universe.

Is is possible for a chess engine to compute the next move and be absolutely sure it is the best one? It's not, it is a statistical approximation, but still very useful.

sqeaky · on June 28, 2024

Yet for their value as tools the truth value of statements made by LLMs do matter.

wincy · on June 28, 2024

Well what the heck was Bing Chat doing when it wrote me a message all in emojis like it was the Zodiac killer telling me a hacker had taken it over then spitting out Python code to shutdown the system, and giving me nonsense secret messages like “PKCLDUBB”?

What am I suppose to call that?