Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Could you give an example of something we recently solved that was considered an unsolvable problem six months beforehand? I don’t have any specific examples, but it seems like most of the huge breakthrough discoveries I’ve seen announced end up being overstated and for practical usage, our choice of LLM-driven tools is only marginally better than they were a couple of years ago. It seems like the preponderance of practical advancement in recent times has come from the tooling/interface improvements rather than generating miracles from the models themselves. But it could be that I just don’t have the right use cases.


Take a look at the ARC Prize, which is a test for achieving "AGI" created in 2019 by François Chollet. Scroll down halfway on the home page and ponder the steep yellow line on the graph. That's what OpenAI o3 recently achieved.

[0] https://arcprize.org/

[1] https://arcprize.org/blog/oai-o3-pub-breakthrough


Reviewing the actual problems is highly recommended: https://kts.github.io/arc-viewer/

They're not particularly difficult, but clearly require reasoning to solve.


unless you train directly against solving those problems... in which case how could you theoretically design a test that could stand against training directly against the answer sheet?


That's why they keep the evaluation set private: "Submit a solution which scores 85% on the ARC-AGI private evaluation set and win $600K."

[0] https://arcprize.org/guide


So we're only 12% from AGI?

I'm dubious tbh. Given we still can't simulate a nematode.


ARC creator François Chollet says: https://bsky.app/profile/fchollet.bsky.social/post/3les3izgd...

I don't think people really appreciate how simple ARC-AGI-1 was, and what solving it really means.

It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.

Passing it means your system exhibits non-zero fluid intelligence -- you're finally looking at something that isn't pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.


> designed as the simplest, most basic assessment of fluid intelligence possible.

This was the goal, but that doesn't say what the test itself is. Try to get a human to solve this problem without their visual cortex, they couldn't do it. Stating your goal for a thing, doesn't make the thing that goal.

AI researchers designing intelligence tests are like programmers designing their own cryptography.

How about we have people skilled in neuropsychology, psychometrics and cognitive psychology do what they are good at.


> How about we have people skilled in neuropsychology, psychometrics and cognitive psychology do what they are good at.

Disagree. The thing that we will eventually call AGI will not be human. No need to have human-specific evaluations unless you’re aiming for an artificial human and not just an artificial intelligence.


But why ignore a huge body of research in how to write scientific tests of intelligence and cognition?

Smells like linear algebra exceptionalism.

Is ARC AGI really the, "simplest, most basic assessment of fluid intelligence possible" ?


> But why ignore a huge body of research in how to write scientific tests of intelligence and cognition?

Not saying to ignore it, but we are not dealing with humans. Those tests may give misleading results as you're proposing to use them outside of their design envelope. This is an area of research in itself.


That's why I put "AGI" in quotes. The point is that six months ago, no one expected an LLM to score this well.


Fair enough.


Yes, the 12% impact is significant, especially in a societal context, because it represents a shift in how people access and process information. Even without AGI, the comparison between LLMs and search engines is crucial. LLMs provide synthesized, conversational responses rather than just indexing and ranking web pages. This shift reduces the need for users to evaluate multiple sources manually, which has far-reaching implications.


> OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set…

Sounds fishy to me


That's the purpose of the "public training set." You don't take an exam before reviewing the instructions on how to fill out the answer sheet.


You would think that the training set for the models already included enough of Mensa etc iq tests so that the model knows how to do these kinds of tests. It takes humans 2 or at most 3 examples to "get" what the test is asking for, and then they can start filling the answers to the actual questions. Meanwhile it takes hundreds of answers at least (in the public set) to train o3 to do this test.

The need for a huge training set to solve simple questions has never stopped bewildering me. I think to get a human-like intelligent model we need to figure out why humans learn from 2 examples and the models don't. But I don't mean to say that the current models aren't intelligent in their own way or aren't useful already.


Human intelligence is bootstrapped by biological evolution and the society, neither of which is fast or efficient. Truly individual part of the intelligence is tiny, it's vastly overrated and relies on these two. Similarly, LLMs perform in-context learning, which is much more efficient because it relies on the pre-baked knowledge. Yes, the generalization ability is still incomparable to humans, but it's entirely possible that much better ability is achievable by slowly bootstrapping it.


If you train a model on tasks in a similar format it's even less indicative of "AGI".


Not quite what you asked for, but it seems tangentially related and you might find it interesting: https://r0bk.github.io/killedbyllm/


Would be interesting to have a list of startups killed by ChatGPT as well.


Completely disagree… there are a crazy amount of cases that didn’t work, until the models scaled to a point they magically did.

Best example I can think of is the ARC AGI benchmark. It was seen to measure human-like intelligence through special symmetries and abstract patterns.

From GPT-2 to GPT-4 there was basically had no progress, then o1 got about 20%. Now o3 has basically solved the benchmark.


I guess what I'm probably not seeing from my vantage point is that translating into a better experience with the tools available. I just cancelled a ChatGPT plus subscription because it just didn't seem useful enough to justify the price. I absolutely understand that there are people for whom it is, but nearly everyone I see that talks a lot about the value of AI either has use cases that I don't care about such as automated "content" generation or high-volume lowish-skill code generation, or they see achieving a progressively more difficult set of benchmarks as a useful end in itself. I like copilot autocomplete when I'm coding, but the quality of that hasn't dramatically changed. I don't give a damn about benchmarks-- I only care what I get from it practically. I have absolutely no interest in using ChatGPT as a therapist or companion because I value human connection and have access to it. So far I simply don't see significant changes in what comes out vs what gets typed in for practical usage. I wouldn't give ChatGPT logic problems to solve except maybe for generating code because I know code well enough to quickly evaluate its output. If the caveat is "hey FYI this thing might hide some frustratingly plausible looking bullshit in the answer so double-check its work," then what good is it really for hard problems if you just have to re-do them anyway?

The same thing is true with image generation. Sure, it's better in ways that are sort-of meaningful for low-value professional or hobby usage, but it's barely budged the barriers to becoming good enough for high-end media production.

I totally believe that this technology is improving and when you're looking at it in isolation, those improvements seem meaningful. But I just don't see that yet translating into things most of the general public can sink their teeth into. With things like the (still) shitty google search "enhancements", and users being forced into AI-driven chat workflows or having big loud not-really-useful UI elements dedicated to AI features, in some ways they've made people's experience using computers meaningfully worse.

Just like with Mastodon, I see a huge disconnect with the tech crowd's excitement with what's happening with the technology, and how that ends up working for users that need to actually solve their problems with that technology.


Performance of OpenAI o3 in the ARC-AGI challenge fits the bill, however the model is not released publicly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: