Hierarchical Reasoning Model – 1k training samples SoTA reasoning v/s CoT

riknos314 · 2025-07-30T21:25:42 1753910742

Prior thread on the paper about this: https://news.ycombinator.com/item?id=44699452

dreamer7 · 2025-07-28T10:28:18 1753698498

To a casual observer, this seems like a big deal. Can knowledgeable folks comment on this work?

AIPedant · 2025-07-30T23:07:36 1753916856

I am still reading the paper, but it is worth noting that this is not an LLM! It is closer to something like AlphaGo, trained only on ARC, Sudoku and mazes. I am skeptical that you could add a bunch of science facts and programming examples without degrading the performance on ARC / etc - frankly it’s completely unclear to me how you would make this architecture into a chatbot, period, but I haven’t thought about it very much.

Comparing the maze/Sudoku results to LLMs rather than maze/Sudoku-specific AIs strikes me as blatantly dishonest. “1k Sudoku training examples” is also dishonest, they generate about a million of them with permutations: https://news.ycombinator.com/item?id=44701264 (see also https://github.com/sapientinc/HRM/blob/main/dataset/build_su... And they seem to have deleted the Sudoku training data! Or maybe they made it private. It used to be here: https://github.com/imone and according to the Git history[1] they moved it here https://github.com/sapientinc but I cannot find it. Might be an innocent mistake; I suspect they got called out for lying about “1000 samples” and are hiding their tracks.

[1] https://github.com/sapientinc/HRM/commit/171e2fcde636bcb7e6c...

algo_trader · 2025-07-31T05:57:06 1753941426

> not an LLM! closer to something like AlphaGo, trained only on ARC, Sudoku and mazes.

ah! this explains the performance..

What is the conventional wisdom on improving codegen in LLMs? Sample n solutions and verify, or run a more expensive tree search?

I have thoughts on a very elaborate add-a-function-verify-and-rollback testing harness and i wonder if this has been tried

munro · 2025-07-30T21:26:09 1753910769

Link to paper here https://arxiv.org/pdf/2506.21734

Still reading, but the benchmarks for ARC-AGI-1, ARC-AGI-2, Sudoku-Extreme (9x9), and Maze-Hard (30x30) look impressive.

tough · 2025-08-02T01:53:18 1754099598

on gh someone reproduced but paper lacks total gpu hours and their benchmark results where 10-20% lower (read on gh issue)