I want to sit next to you and stop you every time you use your LLM and say, “Let...

ryandrake · 2025-03-28T02:14:45 1743128085

What's also scary is that we know LLMs do fail, but nobody (even the people who wrote the LLM) can tell you how often it will fail at any particular task. Not even an order of magnitude. Will it fail 0.2%, 2%, or 20% of the time? Nobody knows! A computer that will randomly produce an incorrect result to my calculation is useless to me because now I have to separately validate the correctness of every result. If I need to ask an LLM to explain to me some fact, how do I know if this time it's hallucinating? There is no "LLM just guessed" flag in the output. It might seem to people to be "miraculous" that it will summarize a random scientific paper down to 5 bullet points, but how do you know if it's output is correct? No LLM proponent seems to want to answer this question.

TeMPOraL · 2025-03-28T12:39:52 1743165592

> What's also scary is that we know LLMs do fail, but nobody (even the people who wrote the LLM) can tell you how often it will fail at any particular task. Not even an order of magnitude. Will it fail 0.2%, 2%, or 20% of the time?

Benchmarks could track that too - I don't know if they do, but that information should actually be available and easy to get.

When models are scored on e.g. "pass10", i.e. pass the challenge in under 10 attempts, and then the benchmark is rerun periodically, that literally produces the information you're asking for: how frequently a given model fails at particular task.

> A computer that will randomly produce an incorrect result to my calculation is useless to me because now I have to separately validate the correctness of every result.

For many tasks, validating a solution is order of magnitudes easier and cheaper than finding the solution in the first place. For those tasks, LLMs are very useful.

> If I need to ask an LLM to explain to me some fact, how do I know if this time it's hallucinating? There is no "LLM just guessed" flag in the output. It might seem to people to be "miraculous" that it will summarize a random scientific paper down to 5 bullet points, but how do you know if it's output is correct? No LLM proponent seems to want to answer this question.

How can you be sure whether a human you're asking isn't hallucinating/guessing the answer, or straight up bullshitting you? Apply the same approach to LLMs as you apply to navigating this problem with humans - for example, don't ask it to solve high-consequence problems in areas where you can't evaluate proposed solutions quickly.

joquarky · 2025-03-28T23:38:36 1743205116

> For many tasks, validating a solution is order of magnitudes easier and cheaper than finding the solution in the first place.

A good example that I use frequently is a reverse dictionary.

It's also useful for suggesting edits to text that I have written. It's easy for me to read its suggestions and accept/reject them.

HankStallone · 2025-03-28T18:45:07 1743187507

I think part of it is that, from eons of experience, we have a pretty good handle on what kinds of mistakes humans make and how. If you hire a competent accountant, he might make a mistake like entering an expense under the wrong category. And since he's watching for mistakes like that, he can double-check (and so can you) without literally checking all his work. He's not going to "hallucinate" an expense that you never gave him, or put something in a category he just made up.

I asked Gemini for the lyrics to a song that I knew was on all the lyrics sites. To make a long story short, it gave me the wrong lyrics three times, apparently making up new ones the last two times. Someone here said LLMs may not be allowed to look at those sites for copyright reasons, which is fair enough; but then it should have just said so, not "pretended" it was giving me the right answer.

I have a python script that processes a CSV file every day, using DictReader. This morning it failed, because the people making the CSV changed it to add four extra lines above the header line, so DictReader was getting its headers from the wrong line. I did a search and found the fix on Stack Overflow, no big deal, and it had the upvotes to suggest I could trust the answer. I'm sure an LLM could have told me the answer, but then I would have needed to do the search anyway to confirm it--or simply implemented it, and if it worked, assume it would keep working and not cause other problems.

That was just a two-line fix, easy enough to try out and see if it worked, and guess how it worked. I can't imagine implementing a 100-line fix and assuming the best.

It seems to me that some people are saying, "It gives me the right thing X% of the time, which saves me enough developer time (mine or someone else's) that it's worth the other (100-X)% of the time when it gives me garbage that takes extra time to fix." And that may be a fair trade for some folks. I just haven't found situations where it is for me.

namaria · 2025-03-28T10:35:41 1743158141

Better yet than the whole 'unknown, fluctuating and non deterministic rates of failure' is the whole 'agentic' shtick. People proposing to chain together these fluctuating plausibility engines should study probability theory a bit deeper to understand just what they are in for with these rube goldberg machines of text continuation.

saaaaaam · 2025-03-28T18:18:59 1743185939

I think it’s very odd that you think that people using LLMs regularly aren’t carefully checking the outputs. Why do you think that people using LLMs don’t care about their work?