> "According to the laws of thermodynamics, all that exists does so solely to consume, destroy and extinguish, and in this way to accelerate the slide toward cosmic obliteration"
Bit of a leap from "heat cannon, of itself, pass from one body to a hotter body"?
Also it misses the most important thing: thermodynamics is what lets our complicated processes exist in the first place. The mental model I use is that when you first drop food coloring into water there is low entropy. After an hour it is mixed—high entropy. But in the middle is when you get the complicated swirling structures. In the cosmic sense those swirls represent stars and galaxies and life.
This work does have some very interesting ideas, specifically avoiding the costs of backpropagation through time.
However, it does not appear to have been peer reviewed.
The results section is odd. It does not include include details of how they performed the assesments, and the only numerical values are in the figure on the front page. The results for ARC2 are (contrary to that figure) not top of the leaderboard (currently 19% compared to HRMs 5% https://www.kaggle.com/competitions/arc-prize-2025/leaderboa...)
In fields like AI/ML, I'll take a preprint with working code over peer-reviewed work without any code, always, even when the preprint isn't well edited.
Everyone everywhere can review a preprint and its published code, instead of a tiny number of hand-chosen reviewers who are often overworked, underpaid, and on tight schedules.
If the authors' claims hold up, the work will gain recognition. If the claims don't hold up, the work will eventually be ignored. Credentials are basically irrelevant.
Think of it as open-source, distributed, global review. It may be messy and ad-hoc, since no one is in charge, but it works much better than traditional peer review!
I sympathize partially with your views, but how would this work in practice? Where would the review comments be stored? Is one supposed to browse Hacker News to check the validity of a paper?
If a professional reviewer spots a serious problem, the paper will not make it to a conference or journal, saving us a lot of trouble.
Peer review is a way to distribute the work of identifying which papers are potentially worth reading. If you're starting from an individual paper and then ask yourself whether it was peer reviewed or not, you're doing it wrong. If you really need to know, read it yourself and accept that you might just be wasting your time.
If you want to mostly read papers that have already been reviewed, start with people or organizations you trust to review papers in an area you're interested in and read what they recommend. That could be on a personal blog or through publishing a traditional journal, the difference doesn't matter much.
“Find papers that support what you want via online echo chambers” isn’t the advice you want to be giving but it is the net result of it. Society needs trusted institutions. Not that publishers are the best result of that but adhoc blog posts are decidedly not better.
It is totally the advice I want to be giving. Given the choice between an echo chamber matched to my interests and wading through a stream of unfiltered crap, I'll take the echo chamber every time. (Of course there's also the option of not reading papers at all, which is typically a good choice if you're not a subject matter expert and don't intend to put in the work to become one.)
If you choose to focus on the output of a well-known publisher, you're not avoiding echo chambers, you're using a heuristic to hopefully identify a good one.
Those are not the only options, namely the parent mentioned 'trusted institutions'. It is the best way to defer that filtering to a group of other humans, whose collective expertise will surpass any one individual.
The destruction of trust in both public and private institutions - newspapers, journals, research institutions, universities - and replacement with social media 'influencers' and online echo chambers is how we arrived at the current chaotic state of politics worldwide, the rise of extremist groups, cults, a resurgence of nationalism, religious fanaticism... This is terrible advice.
Your question is like asking "how can I verify this rod is 1m long if I can't ask an expert". The answer is of course you measure it. That's much more reliable than asking an expert. However, the results of many papers take a huge amount of work to replicate, so we've built a network of experts over the years to evaluate them.
But this is open source, so TL;DR: you download the code, run it, and see if it gets the results claimed.
Have you tried that with the code repository that we're discussing here? It took this trained professional over an hour to get started, and then I gave up. It would take an additional 24 hours and quite some hardware simply to reproduce the results, and then probably a few weeks to actually understand what is going on. All in all, not very practical.
Scepticism is generally always a good idea with ML papers. Once you start publishing regularly in ML conferences, you understand that there is no traditional form of peer review anymore in this domain. The volume of papers has meant that 'peers' are often students coming to grips with parts of the field that rarely align with what they are asked to review. Conference peer review has become a 'vibe check' more than anything.
Real peer review is when other experts independently verify your claims in the arXiv submission through implementation and (hopefully) cite you in their followup work. This thread is real peer review.
I appreciate this insight, makes you wonder, why even publish a paper if it only amounts to a vibe check? If it's just the code we need we can get that peer reviewed through other channels.
Skepticism is best expressed by repeating the experiment and comparing results. I'm game and I have 10 days off work next month. I wonder what can be had in terms of full source and data, etc. from the authors?
I think that’s too harsh a position solely for not being peer reviewed yet. Neither of yhe original mamba1 and mamba2 papers were peer reviewed. That said, strong claims warrant strong proofs, and I’m also trying to reproduce the results locally.
Do you consider yourself a peer? Feel free to review it.
A peer reviewer will typically comment that some figures are unclear, that a few relevant prior works have gone uncited, or point out a followup experiment that they should do.
That's about the extent of what peer reviewers do, and basically what you did yourself.
The fact that you are expecting a paper just published to have been peer reviewed already tells me that you are likely not familiar with the process. The first step to have your work peer reviewed is to publish it.
Enough already. Please. The paper + code is here for everybody to read and test. Either it works or it doesn't. Either people will build upon it or they won't. I don't need to wait 20 months for 3 anonymous dudes to figure it out.
> However, it does not appear to have been peer reviewed.
my observation is that peer reviewers never try to reproduce results or do basic code audit to check that there is no data leak for example to training dataset.
Skepticism is an understatement. There are tons of issues with this paper. Why are they comparing results of their expert model that was trained from scratch on a single task to general purpose reasoning models? It is well established in the literature that you can still beat general purpose LLMs in narrow domain tasks with specially trained, small models. The only comparison that would have made sense is one to vanilla transformers using the same nr of parameters and trained on the same input-output dataset. But the paper shows no such comparison. In fact, I would be surprised if it was significantly better, because such architecture improvements are usually very modest or not applicable in general. And insinuating that this is some significant development to improve general purpose AI by throwing in ARC is just straight up dishonest. I could probably cook up a neural net in pytorch in a few minutes that beats a hand-crafted single task that o3 can't solve in an hour. That doesn't mean that I made any progress towards AGI.
Have you spent much time with the ARC-1 challenge? Their results on that are extremely compelling, showing results close to the initial competition's SOTA (as of closing anyway) with a tiny model and no hacks like data augmentation, pretraining, etc that all of the winning approaches leaned on heavily.
Your criticism makes sense for the maze solving and sudoku sets, of course, but I think it kinda misses the point (there are traditional algos that solve those just fine - it's more about the ability of neural nets to figure them out during training, and known issues with existing recurrent architectures).
Looking at the code, there is a lot of data augmentation going on there. For the Sudoku and ARC data sets, they augment every example by a factor of 1,000.
That's fair, they are relabelling colours and rotating the boards. I meant more like mass generation of novel puzzles to try and train specific patterns. But you are right that technically there is some augmentation going on here, my bad.
Hm, I'm not so sure it's fair play for the Sudoku puzzle. Suggesting that the AI will understand the rules of the game with only 1,000 examples, and then adding 1,000,000 derived examples does not feel fair to me. Those extra examples leak a lot of information about the rules of the game.
I'm not too familiar with the ARC data set, so I can't comment on that.
True, it leaks information about all the symmetries of the puzzle, but that's about it. I guess someone needs to test how much that actually helps - if I get the model running I'll give it a try!
> That's fair, they are relabelling colours and rotating the boards.
Photometric augmentation, Geometric augmentation
> I meant more like mass generation of novel puzzles to try and train specific patterns.
What is the difference between Synthetic Data Generation and Self Play (like AlphaZero)? Don't self play simulations generate synthetic training data as compared to real observations?
I don't know the jargon, but for me the main thing is the distinction between humans injecting additional bits of information into the training set vs the algorithm itself discovering those bits of information. So self-play is very interesting (it's automated as part of the algorithm) but stuff like generating tons of novel sudoku puzzles and adding them to the training set is less interesting (the information is being fed into the training set "out-of-band", so to speak).
In this case I was wrong, the authors are clearly adding bits of information themselves by augmenting the dataset with symmetries (I propose "symmetry augmentation" as a much more sensible phrase for this =P). Since symmetries share a lot of mutual information with each other, I don't think this is nearly as much of a crutch as adding novel data points into the mix before training, but ideally no augmentation would be needed.
I guess you could argue that in some sense it's fair play - when humans are told the rules of sudoku the symmetry is implicit, but here the AI is only really "aware" of the gradient.
Traditional ML CV Computer Vision research has perhaps been supplanted by multimodal LLMs that are trained on image analysis annotations. (CLIP, Brownian-motion based Dall-E and Latent Diffusion were published in 2021. More recent research: Brownian Bridges, SDEs, Lévy processes. What are foundational papers in video genai?)
TOPS are now necessary.
I suspect that existing CV algos for feature extraction would also be useful for training LLMs. OpenCV, for example, has open algorithms like ORB (Oriented FAST and Rotated BRIEF), KAZE and AKAZE, and SIFT since 2020. SIFT "is highly robust to rotation, scale, and illumination changes".
But do existing CV feature extraction and transform algos produce useful training data for LLMs as-is?
Similarly, pairing code and tests with a feature transform at training time probably yields better solutions to SWE-bench.
Self Play algos are given rules of the sim. Are self play simulations already used as synthetic training data for LLMS and SLMs?
There are effectively rules for generating synthetic training data.
The orbits of the planets might be a good example of where synthetic training data is limited and perhaps we should rely upon real observations at different scales given cost of experimentation and confirmations of scale invariance.
Extrapolations from orbital observations and classical mechanics failed to predict the Perihelion precession of Mercury (the first confirmation of GR General Relativity).
To generate synthetic training data from orbital observations where Mercury's 43 arcsecond deviation from Newtonian mechanics was disregarded as an outlier would result in a model overweighted by existing biases in real observations.
As the other commenter already pointed out, I'll believe it when I see it on the leaderboard. But even then it already lost twice against the winner of last year's competition, because that too was a general purpose LLM that could also do other things.
Let's not move the goalposts here =) I don't think it's really fair to compare them directly like that. But I agree, this is triggering my "too good to be true" reflex very hard.
I think you are right that once the exercise becomes hunting for a scapegoat it's pointless.
However, it can be a way for everyone to understand the system better. The goal should be making each of the dominoes less likely to fall. Doing so can simplify rather than add complexity.
Bit of a leap from "heat cannon, of itself, pass from one body to a hotter body"?