Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have a small physics-based problem I pose to LLMs. It's tricky for humans as well, and all LLMs I've tried (GPT o3, Claude 3.7, Gemini 2.5 Pro) fail to answer correctly. If I ask them to explain their answer, they do get it eventually, but none get it right the first time. Qwen3 with max thinking got it even more wrong than the rest, for what it's worth.


As they say, we shouldn't judge AI by the current state-of-the-art, but by how far and fast it's progressing. I can't wait to see future models get it even more wrong than that.


Personally (anecdata) I haven't experienced any practical progress in my day-to-day tasks for a long time, no matter how good they became at gaming the benchmarks.

They keep being impressive at what they're good at (aggregating sources to solve a very well known problem) and terrible at what they're bad at (actually thinking through novel problems or old problems with few sources).

E.g. all ChatGPT, Claude and Gemini were absolutely terrible at generating Liquidsoap[0] scripts. It's not even that complex, but there's very little information to ingest about the problem space, so you can actually tell they are not "thinking".

[0] https://www.liquidsoap.info/


Absolutely. All models ar terrible with Objective-C and Swift, compared to let's say JS/HTML/Python.

However, I've realized that Claude Code is extremely useful for generating somewhat simple landing pages for some of my projects. It spits out static html+js which is easy to host, with somewhat good looking design.

The code isn't the best and to some extent isn't maintainable by a human at all, but it gets the job done.


I’ve gotten 0 production usable python out of any LLM. Small script to do something trivial, sure. Anything I’m going to have to maintain or debug in the future, not even close. I think there is a _lot_ of terrible python code out there training LLMs, so being a more popular language is not helpful. This era is making transparent how low standards really are.


> I’ve gotten 0 production usable python out of any LLM

Fascinating, I wonder how you use it because once I decompose code to modules and function signatures, Claude[0] is pretty good at implementing Python functions. I'd say it one-shots 60% of the times, I have to tweak the prompt or adjust the proposed diffs 30%, and the remaining 10% is unusable code that I end up writing by hand. Other things Claude is even better at: writing tests, simple refactors within a module, authoring first-draft docstrings, adding context-appropriate type hints.

0. Local LLMs like Gemma3, Qwen-coder seem to be in the same ballpark in terms of capabilities, it's just that they are much slower on my hardware. Except for the 30b Qwen3 MoE that was released a day ago, that one is freakin' fast.


I agree - you have to treat them like juniors and provide the same context you would someone who is still learning. You can’t assume it’s correct but where it doesn’t matter it is a productivity improvement. The vast majority of the code I write doesn’t even go into production so it’s fantastic for my usage.


What happens to the vast majority of code you write


Different experience here. Production code in banking and finance for backend data analysis and reporting. Sure the code isn't perfect, but doesn't need to be. It's saving >50% effort and the analysis results and reporting are of at least as good a standard as human developed alternatives.


Try o4-mini-high. It’s getting there.


Maybe with the next got version, gpt-4.003741


Interesting, I'll have to try that. All the "static" page generators I've tried require React....


Building a basic static html landing page is ridiculously easy though. What js is even needed? If it's just an html file and maybe a stylesheet of course it's easy to host. You can apply 20 lines of css and have a decent looking page.

These aren't hard problems.


A big part of my job is building proofs of concept for some technologies and that usually means some webpage to visualize that the underlying tech is working as expected. It’s not hard, doesn’t have to look good at all, and will never be maintained. I throw it away a few weeks later.

It used take me an hr or two to get it all done up properly. Now it’s literal seconds. It’s a handy tool.


> These aren’t hard problems.

Honestly, that’s the best use-case for AI currently. Simple but laborious problems.


Laziness mostly - no need to think about design, icons and layout (responsiveness and all that stuff).

These are not hard problems obviously, but getting to 80%-90% is faster than doing it by hand and in my cases that was more than enough.

With that being said, AI failed for the rest 10%-20% with various small visual issues.


> These aren't hard problems.

So why do so many LLMs fail at them?


And humans also.


I like using Vercel v0 for frontend


Absolutely, as soon as they hit that mark where things get really specialized, they start failing a lot. They do generalizations on well documented areas pretty good. I only use it for getting a second opinion as it can search through a lot of documents quickly and find me alternative means.


They have broad knowledge, a lot of it, and they work fast. That should be a useful combination-

And indeed it is. Essentially every time I buy something these days, I use Deep Research (Gemini 2.5) to first make a shortlist of options. It’s great at that, and often it also points out issues I wouldn’t have thought about.

Leave the final decisions to a super slow / smart intelligence (a human), by all means, but for people who claim that LLMs are useless I can only conclude that they haven’t tried very hard.


Yes similar experience querying gpt about lesser known frameworks. Had o1 stone cold hallucinate some non existent methods I could find no trace of from googling. Would not budge on the matter either. Basically you have to provide the key insight yourself in these cases to get it unstuck, or just figure it out yourself. After its dug into a problem to some degree you get a feel for whether continued prompting on the subject is going to be helpful or just more churn


I'm curious what kind of prompting or context you are providing before asking for a liquid soap script - or if you've tried using Cursor and providing a bunch of context with documentation about liquid soap as part of it. My guess was these kinds of things get the models to perform much better. I have seen this work with internal APIs / best practices / patterns.


Yes, I used Cursor and tried providing both the whole Liquidsoap book or the URL to the online reference just in case the book was too large for context or it was triggering some sort of RAG.

Not successful.

It's not that it didn't do what I wanted: most of the time it didn't even run. Iterating on the error messages just arrived at progressively dumber not-solutions and running in circles.


Oh man, that's dissapointing.


What model?


I'm on Pro two-week trial so I tried a mix of mainstream premium models (including reasoning ones) + letting Cursor route me to the "best" model or whatever they call it.


this problem is always going to exist in these models, these models are hungry for good data

if there is focus on improving the model on something, the method do it is known, its just about priority


Haven’t seen much progress in base models since gpt4. Deep thinking and whatever else came in the last year are just bandaids hiding the shortcomings of said models and were achievable before with the right tooling. The tooling got better the models themselves are just marginally better.


You really had me until the last half of the last sentence.


The plural of anecdote is data.


"The plural of anecdote is data.", this is right up there with "1 + 1 = 3, for sufficiently large values of 1".

Had an outright genuine guffaw at this one, bravo.


Only in the same way that the plural of 'opinion' is 'fact' ;)


Except, very literally, data is a collection of single points (ie what we call "anecdotes").


No. Anecdote, ἀνέκδοτον, is a story that points to some abstract idea, commonly having something to do with morals. The word means 'not given out'/'not-out-given'. Data is the plural of datum, and arrives in english not from greek, but from latin. The root is however the same as in anecdote, and datum means 'given'. Saying that 'not-given' and 'collection of givens' is the same is clearly nonsensical.

A datum has a value and a context in which it was 'given'. What you mean by "points" eludes me, maybe you could elaborate.


Except that the plural of anecdotes is definitely not data, because without controlling for confounding variables and sampling biases, you will get garbage.


Based on my limited understanding of analytics, the data set can be full of biases and anomalies, as long as you find a way to account for them in the analysis, no?


The accuracy of your analysis becomes limited to the accuracy of how well you correct for the biases. And it's difficult to measure the bias accurately without lots of good data or cross-examination.


Garbage data is still data, and data (garbage or not) is still more valuable than a single anecdote. Insights can only be distilled from data, by first applying those controls you mentioned.


Or you can apply the Bezos/Amazon anecdote about anecdotes:

At a managers meeting "user stories" about poor support but all the KPIs looked good from the call center so Jeff dials in the number from the meeting speaker phone, gets put on hold, IVR spin cycle, hold again, etc .... His take away was basically "if the data and anecdotes don't match always default to the customer stories".


No, Wittgenstein's rule following paradox, Shannon sampling theorem, the law that infinite polynomials pass through any finite set of points (does that have a name?), etc, etc. are all equivalent at the limit to the idea that no amount of anecdotes-per-se add up to anything other than coincidence


Without structural assumptions, there is no necessity - only observed regularity. Necessity literally does not exist. You will never find it anywhere.

Hume figured this out quite a while ago and Kant had an interesting response to it. Think the lack of “necessity” is a problem? Try to find “time” or “space” in the data.

Data by itself is useless. It’s interesting to see peoples’ reaction to this.


@whatnow37373 — Three sentences and you’ve done what a semester with Kritik der reinen Vernunft couldn’t: made the Hume-vs-Kant standoff obvious. The idea that “necessity” is just the exhaust of our structural assumptions (and that data, naked, can’t even locate time or space) finally snapped into focus.

This is exactly the kind of epistemic lens-polishing that keeps me reloading HN.


This thread has given me the best philosophical chuckle I've had this year. Even after years of being here, HN can still put an unexpected smile on your face.


Anti-realism, indeterminancy, intuitionism, and radical subjectivity are extremely unpopular opinions here. Folks here are to dense to imagine that the cogito is fake bullshit and wrong. You're fighting an extremely uphill battle.

Paul Feyerabend is spinning in his grave.


No, no, no. Each of them gives you information.


In the formal, information-theory sense, they literally don't, at least not on their own without further constraints (like band-limiting or bounded polynomial degree or the like)


They give you relative information. Like word2vec


...which you always have.


“Plural of anecdote is data” is meant to be tongue-in-cheek.

Actual data is sampled randomly. Anecdotes very much are not.


> “Plural of anecdote is data” is meant to be tongue-in-cheek.

To be fair, possibly because of this tongue-in-cheek statement, anecdote is sometimes used in place of datum and not as anecdote is usually defined.


Technically we call it a datum. An anecdote is a story, not a point.

But it is true that colloquially anecdote is sometimes used in place of datum.


one point is a collection of size 1. It is always data.



Ahhhahhahahaha stavros is so right but this is such high level bickering I haven't laughed so hard in a long time. Ya’ll are awesome! dymk you deserve a touché for this one.

The challenge for sharing data at this stage of the game is that the game is rigged in datas favor. So stavros I hear you.

To clarify, if we post our data it’s just going to get fed back into the models making it even harder to vet iterations as they advance.


I think somebody said it may be 'anecdata'


The plural of reliable data is not anecdote.


Depends on the data generating process.


Of course, but then you have a system of gathering information with some rigor which is more than merely a collection of anecdotes. That becomes the difference.


"For what it's worth"? What's wrong with that?


That's the last third of the sentence.


Current models are quite far away from human-level physical reasoning (paper below). An upcoming version of models trained on world simulation will probably do much better.

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

https://phybench-official.github.io/phybench-demo/


This is more about a physics math aptitude test. You can already see that the best model in math is saturating it halfway. It might not indicate its usefulness in actual physical reasoning, or at the very least, it seems like a bit of a stretch.


I similarly have a small, simple spatial reasoning problem that only reasoning models get right, and not all of them, and which Qwen3 on max reasoning still gets wrong.

> I put a coin in a cup and slam it upside-down on a glass table. I can't see the coin because the cup is over it. I slide a mirror under the table and see heads. What will I see if I take the cup (and the mirror) away?


Sonnet 3.7 non-reasoning got it right.

I'll think this through step by step. When you place a coin in a cup and slam it upside-down on a glass table, the coin will be between the table surface and the cup. When you look at the reflection in the mirror beneath the table, you're seeing the bottom side of the coin through the glass. Since the mirror shows heads, you're seeing the heads side of the coin reflected in the mirror. In a mirror, the image is reversed left-to-right, but the "heads" and "tails" sides of a coin would not be flipped in this reflection - you'd still recognize heads as heads. If you take away both the cup and the mirror, you'll see the opposite side of the coin compared to what you saw in the mirror. Since the mirror showed heads, you'll see tails when looking directly at the coin from above the table.


Not reasoning mode, but I struggle to call that “non-reasoning”.


one-shot mode?


> In a mirror, the image is reversed left-to-right

lol


My first try (omitting chain of thought for brevity):

    When you remove the cup and the mirror, you will see tails. Here's the breakdown:

    Setup: The coin is inside an upside-down cup on a glass table. The cup blocks direct view of the coin from above and below (assuming the cup's base is opaque).
    Mirror Observation: A mirror is slid under the glass table, reflecting the underside of the coin (the side touching the table). If the mirror shows heads, this means the underside of the coin is heads.
    Coin Orientation: A coin has two sides: heads and tails. If the underside (facing the table) is heads, the top side (facing away from the table) must be tails.
    Result: Removing the cup reveals the coin on the table. Since the top side was tails all along, you will see tails.

    Key Insight: The mirror reveals the hidden underside of the coin, but the top side (visible after removing the cup) is its opposite. The mirror’s reflection does not alter the physical orientation of the coin’s sides—it simply reflects the existing state.

    Answer: Tails.


Huh, for me it said:

Answer: You will see the same side of the coin that you saw in the mirror — heads .

Why?

The glass table is transparent , so when you look at the coin from below (using a mirror), you're seeing the top side of the coin (the side currently facing up). Mirrors reverse front-to-back , not left-to-right. So the image is flipped in depth, but the orientation of the coin (heads or tails) remains clear. Since the coin hasn't moved during this process, removing the cup and mirror will reveal the exact same face of the coin that was visible via the mirror — which was heads.

Final Answer: You will see heads.


The question doesn't define which side you're going to look from at the end, so either looking down or up is valid.


While technically true, one could argue that because they described sliding a mirror underneath the table, that they did this in order to be able to see from that particular angle, so the implication for any reasonable answer would be that are looking down at the table.


The way the question is phrased, I'm definitely in the "ok, what's the trick, what do you want to hear" mode. I'm not surprised LLMs answer in a weird way with extra details like "this is how mirrors work, is this what you're confused about?"


Manus just answered this correctly for me in one crack:

Me: I put a coin in a cup and slam it upside-down on a glass table. I can't see the coin because the cup is over it. I slide a mirror under the table and see heads. What will I see if I take the cup (and the mirror) away?

Manus: This is a classic riddle! Let's think it through. You placed a coin in a cup, turned it upside down on a glass table, and then slid a mirror underneath the table. Looking into the mirror from below, you saw 'heads'. Since the mirror reflects the side of the coin facing downwards through the glass table, the 'heads' side must be facing down. Therefore, the side facing up, hidden inside the cup, must be 'tails'. When you remove the cup, you will see the side facing up, which is tails.


Except it's wrong about this being a classic riddle.


Yup, it flunked that one.

I also have a question that LLMs always got wrong until ChatGPT o3, and even then it has a hard time (I just tried it again and it needed to run code to work it out). Qwen3 failed, and every time I asked it to look again at its solution it would notice the error and try to solve it again, failing again:

> A man wants to cross a river, and he has a cabbage, a goat, a wolf and a lion. If he leaves the goat alone with the cabbage, the goat will eat it. If he leaves the wolf with the goat, the wolf will eat it. And if he leaves the lion with either the wolf or the goat, the lion will eat them. How can he cross the river?

I gave it a ton of opportunities to notice that the puzzle is unsolvable (with the assumption, which it makes, that this is a standard one-passenger puzzle, but if it had pointed out that I didn't say that I would also have been happy). I kept trying to get it to notice that it failed again and again in the same way and asking it to step back and think about the big picture, and each time it would confidently start again trying to solve it. Eventually I ran out of free messages.


4o with thinking:

By systematic (BFS) search of the entire 32-state space under these rules, one finds no path from to that stays always safe. Thus the puzzle has no solution—there is no way for the man to ferry all four items across without at least one of them being eaten.


You go with the cabbage, goat, wolf and lion all together!


O3 gave me basically that solution. "Below is the shortest safe schedule that really works ‒ but it assumes the boat can hold the man plus two passengers (three beings total). If your version of the puzzle only lets him move one passenger at a time, the puzzle has no solution: at the very first trip he would always leave at least one forbidden pair alone."


i tried grok 3 with Think and it was right also with pretty good thinking


I don't have access to Think, but I tried Grok 3 regular, and it was hilarious, one of the longest answers I've ever seen.

Just giving the headings, without any of the long text between each one where it realizes it doesn't work, I get:

    Solution
        [... paragraphs of text ommitted each time]
    Issue and Revision
    Revised Solution
    Final Solution
    Correct Sequence
    Final Working Solution
    Corrected Final Solution
    Final Correct Solution
    Successful Solution
    Final answer
    Correct Final Sequence
    Final Correct Solution
    Correct Solution
    Final Working Solution
    Correct Solution
    Final Answer
    Final Answer
Each time it's so confident that it's worked out the issue, and now, finally, it has the correct, final, working solution. Then it blows it again.

I'm surprised I didn't start seeing heading titles such as "Working solution-FINAL (3) revised updated ACTUAL-FINAL (2)"


I tried with the thinking option on and it gets into some networking errors, if you don’t turn on the thinking it guesses the answer correctly.

> Summary:

- Mirror shows: *Heads* → That's the *bottom face* of the coin. - So actual top face (visible when cup is removed): *Tails*

Final answer: *You will see tails.*


The only thing I don't like about this test is that I prefer test questions that don't have binary responses (e.g. heads or tails) - you can see from the responses that you got from the thread that the LLMs success rates are all over the map.


Yeah, same.

I had a more complicated prompt that failed much more reliably - instead of a mirror I had another person looking from below. But it had some issues where Claude would often want to refuse on ethical grounds, like I'm working out how to scam people or something, and many reasoning models would yammer on about whether or not the other person was lying to me. So I simplified to this.

I'd love another simple spatial reasoning problem that's very easy for humans but LLMs struggle with, which does NOT have a binary output.


I always feel that if you share a problem here where LLMs fail, it will end up in their training set and it wont fail to that problem anymore, which means the future models will have the same errors but you have lost your ability to detect them.


My favorite part of the genre of “questions an LLM still can’t answer because they’re useless!” is all the people sharing results from different LLMs where they clearly answer the question correctly.


I use LLMs extensively and probably should not be bundled into that genre as I've never called LLMs useless.


I think it's pretty random. qwen3:4b got it correct once, on re-run it told me the coin is actually behind the mirror, and then did this brilliant maneuver:

  - The question is **not** asking for the location of the coin, but its **identity**.
  - The coin is simply a **coin**, and the trick is in the riddle's wording.

  ---

  ### Final Answer:

  $$
  \boxed{coin}
  $$


ChatGPT free gets it right without reasoning mode (still explained some steps) https://chatgpt.com/share/6810bc66-5e78-8001-b984-e4f71ee423...


Tried it with o1-pro:

> You’ll find that the actual face of the coin under the cup is tails. Seeing “heads” in the mirror from underneath indicates that, on top, the coin is really tails‑up.


Simple Claude 3.5 with no reasoning gets it right.


Neat, because 3.7 doesn't.


3.7 works for me too. Of course it's going to depend on temperature a bit.


They all are using these tests to determine their worth, but to be honest they don't convert well to real world tests.

For example I tried Deepseek for code daily over a period of about two months (vs having used ChatGPT before), and its output was terrible. It would produce code with bugs, break existing code when making additions, totally fail at understanding what you're asking etc.


Exactly. If I'm going to be solving bugs, I'd rather they be my own.


I don’t know about physics, but o3 was able to analyze a floor plan and spot ventilation and circulation issues that even my architect brother wasn’t able to spot in a single glance

Maybe it doesn’t make physicists redundant, but it’s definitely making expertise in more mundane domains way more accessible


My favorite test is "Build an MTG Arena Deck in historic format around <strategy_and_or_cards> in <these_colors>. It must be exactly 60 cards and all cards must be from Arena only. Search all sets/cards currently availble on Arena, new and old".

Many times they’ll include cards that are only available in paper and/or go over the limit, and when asked to correct a mistake they'll continue to make mistakes. But recently I found that Claude is pretty damn good now at fixing its mistakes and building/optimizing decks for Arena. Asked it to make a deck based on insights it gained from my current decklist, and what it came up with was interesting and pretty fun to play.


This reads like a great story with a tragic ending!


Can you please share the problem?


I don't really want it added to the training set, but eh. Here you go:

> Assume I have a 3D printer that's currently printing, and I pause the print. What expends more energy, keeping the hotend at some temperature above room temperature and heating it up the rest of the way when I want to use it, or turning it completely off and then heat it all the way when I need it? Is there an amount of time beyond which the answer varies?

All LLMs I've tried get it wrong because they assume that the hotend cools immediately when stopping the heating, but realize this when asked about it. Qwen didn't realize it, and gave the answer that 30 minutes of heating the hotend is better than turning it off and back on when needed.


What kind of answer do you expect? It all depends on the hotend shape and material, temperature differences, how fast air moves in the room, humidity of the air, etc.


Keeping something above room temperature will always use more energy than letting it cool down and heating it back up when needed


> It all depends on

No it doesn't.


Sounds like the LLM you used when writing this slop comment struggled with the problem too. :>


Qwen3-32b did it pretty accurately it seems. Calculated heat loss over time going to ambient temp, offered to keep it at standby 100C for short breaks under 10 minutes. Shut down completely for longer breaks.


The correct answer is that it's always better to turn it off, though.


Unless you care about warmup time. LLMs have a habit of throwing in common-sense assumptions that you didn’t tell it to, so you have to be careful of that.

It’s not a bug. Outside of logic puzzles that’s a very good thing.


No, warmup time doesn't change anything, I can simply factor it in.

It is a bug, because I asked it precisely what I wanted, and it gave the wrong answer. It didn't say anything about warmup time, it was just wrong.


Ah! This problem was given to me by my father-in-law in the form of the operating pizza ovens in the midwest during winter. It's a neat, practical one.


Some calculation around heat loss and required heat expenditure to reheat per material or something?


Yep, except they calculate heat loss and required energy to keep heating, but room temperature and energy required to heat from that in the other case, so they wildly overestimate one side of the problem.


Unless I'm missing something holding it hot is pure waste.


Maybe it will help to have a fluid analogy. You have a leaky bucket. What wastes more water, letting all the water leak out and then refilling it from scratch, or keeping it topped up? The answer depends on how bad the leak is vs how long you are required to maintain the bucket level. At least that’s how I interpret this puzzle.


Does it depend though?

The water (heat) leaking out is what you need to add back. As water level drops (hotend cools) the leaking will slow. So any replenishing means more leakage then you are eventually paying for by adding more water (heat) in.


You can stipulate conditions to make the solution work out in either direction.

Suppose the bucket is the size of lake, and the leak is so miniscule that it takes many centuries to detect any loss. And also I need to keep the bucket full for a microsecond. In this case it is better to keep the bucket full, than to let it drain.

Now suppose the bucket is made out of chain-link and any water you put into it immediately falls out. The level is simply the amount of water that happens to be passing through at that moment. And also the next time I need the bucket full is after one century. Well in that case, it would be wasteful to be dumping water through this bucket for a century.


All heat that is lost must be replaced (we must input enough heat that the device returns to T_initial)

Hotter objects lose heat faster, so the longer we delay restoring temperature (for a fixed resume time) the less heat is lost that will need replacement.

Hotter objects require more energy to add another unit of heat, so the cooler we allow the device to get before re-heating (again, resume time is fixed) the more efficient our heating can be.

There is no countervailing effect to balance, preemptive heating of a device before the last possible moment is pure waste no matter the conditions (although the amount of waste will vary a lot, it will always be a positive number)

Even turning the heater off for a millisecond is a net gain.


Does it depend on whether you know in advance _when_ you need it back at the hot temperature?

If you don’t think ahead and simply switch the heater back on when you need it, then you need the heater on for_longer_.

That means you have to pay back the energy you lost, but also the energy you lose during the reheating process. Maybe that’s the countervailing effect?

> Hotter objects require more energy to add another unit of heat

Not sure about this. A unit of heat is a unit energy, right? Maybe you were thinking of entropy?


No, you should always wait until the last possible moment to refill the leaky bucket, because the less water in the bucket, the slower it leaks, due to reduced pressure.


Allowing it to cool below the phase transition point of the melted plastic will cause it to release latent heat, so there is a theoretically possible corner case where maintaining it hot saves energy. I suspect that you are unlikely to hit this corner case, though I am too lazy to crunch the numbers in this comment.


don't worry, it is really trickly for training


Same experience with my personal benchmarks. Generally unimpressed with Qwen3.


I do the same with a small math problem and so far only Qwen3 got it right (tested all thinking models). So your mileage may vary, as they say!


I was expecting a different outcome, that you tell us that Qwen3 nailed at first.


Qwen3-235B-A22B?


Yep, on Qwen chat.


Show prompt


Hi, I'm starting an evals company, would love to have you as an advisor!


Not OP, but what exactly do I need to do.

I'll do it for cheap if you'll let me work remote from outside the states.


I believe they're kidding, playing on "my singular question isn't answered correctly"


Absolutely, please reach out! arthur@distributed.systems




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: