Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There is a hidden test set with new puzzle types not seen in the open part. It's designed so that humans do well and AI models have a hard time.


"Designed" is not right. What gives "AI models" (i.e. deep neural nets) a hard time is that there are very few examples in the public training and evaluation set: each task has three examples. So basically it's not a test of intelligence but a test of sample efficiency.

Besides which, it is unfair because it excludes an entire category of systems, not to mention a dominant one. If F. Chollet really believes ARC is a test of intelligence, then why not provide enough examples for deep nets or some other big data approach to be trained effectively? The answer is: because a big data approach would then easily beat the test. But if the test can be beaten without intelligence, just with data, then it's not a test of intelligence.

My guess for a long time has been that ARC will fall just like the Winograd Schema challenge (WSC) [1] fell: someone will do the work to generate enough (tens of thousands) examples of ARC-like tasks, then train a deep neural net and go to town. That's what happened with the WSC. A large dataset of Winograd schema sentences was crowd-sourced and a big BERT-era Transformer got around 90% accuracy on the WSC [2]. Bye bye WSC, and any wishful thinking about Winograd schemas requiring human intuition and other undefined stuff.

Or, ARC might go the way of the Bongard Problems [3]: the original 100 problems by Bongard still stand unsolved, but the machine learning community has effectively sidestepped them. Someone made a generator of Bongard-like problems [4], and while this was not enough to solve the original problems, everyone simply switched to training CNNs and reporting results on the new dataset [5].

We basically have no idea how to create a test for intelligence that computers cannot beat by brute force or big data approaches so we have no effective way to test computers for (artificial) intelligence. The only thing we know humans can do that computers can't is identify undecidable problems (like Barber Paradoxes i.e. statements of the form "this sentence is false", as in Gödel's second incompleteness theorem). Unfortunately we already know there is no computer that can ever do that, and even if we observe say ChatGPT returning the right answer we can be sure it has only memorised, not calculated it, so we're a bit stuck. ARC won't get us unstuck in any way shape or form and so it's just a distraction.

_____________________

[1] https://en.wikipedia.org/wiki/Winograd_schema_challenge

[2] WinoGrande: An Adversarial Winograd Schema Challenge at Scale

https://arxiv.org/abs/1907.10641

Although note the results are interpreted to mean LLMs are more or less memorising answers, which is right of course.

[3] Index of Bongard Problems

https://www.foundalis.com/res/bps/bpidx.htm

[4] Comparing machines and humans on a visual categorization test

https://www.pnas.org/doi/abs/10.1073/pnas.1109168108

[5] 25 years of CNNs: Can we compare to human abstraction capabilities?

https://arxiv.org/abs/1607.08366


> "Designed" is not right. What gives "AI models" (i.e. deep neural nets) a hard time is that there are very few examples in the public training and evaluation set

No, he actually made a list of cognitive skills humans have and is targeting them in the benchmark. The list of "Core Knowledge Priors" contains Object cohesion, Object persistence, Object influence via contact, Goal-directedness, Numbers and counting, Basic geometry and topology. The dataset is fit for human ease of solving, but targets areas hard for AI.

> "A typical human can solve most of the ARC evaluation set without any practice or verbal explanations. Crucially, to the best of our knowledge, ARC does not appear to be approachable by any existing machine learning technique (including Deep Learning), due to its focus on broad generalization and few-shot learning, as well as the fact that the evaluation set only features tasks that do not appear in the training set."


Thanks, I know about the core knowledge priors, and François Chollet's claims about them (I've read his white paper, although it was long, and long-winded and I don't remember most of it). The empirical observation however is that none of the systems that have positive performance on ARC, on Kaggle or the new leaderboard, have anything to do with core knowledge priors. Which means core knowledge priors are not needed to solve any of the so-far solved ARC tasks.

I think Chollet is making a syllogistic error:

  a) Humans have core knowledge priors and can solve ARC tasks
  b) Some machine X can solve ARC tasks
  c) Therefore machine X has core knowledge priors
That doesn't follow; and like I say it is refuted by empirical observations, to boot. This is particularly so for his claim that ARC "does not appear approachable" (what) by deep learning. Plenty of neural-net based systems on the ARC-AGI leaderboard.

There's also no reason to assume that core knowledge priors present any particular difficulty to computers (i.e. that they're "hard for AI"). The problem seems to be more with the ability of humans to formalise them precisely enough to be programmed into a computer. That's not a computer problem, it's a human problem. But that's common in AI. For example, we don't know how to hand-code an image classifier; but we can train very accurate ones with deep neural nets. That doesn't mean computers aren't good at image classification: they are; CNNs to the proof. It's humans who suck at coding it. Except nobody's insisting on image classification datasets with only three or four training examples for each class, so it was possible to develop those powerful deep neural net classifiers. Chollet's choice to only allow very few training examples is creating an artificial data bottleneck that does not restrict anyone in the real world so it tells us nothing about the true capabilities of deep neural nets.

Cthulhu. I never imagined I'd end up defending deep neural nets...

I have to say this: Chollet annoys me mightily. Every time I hear him speak, he makes gigantic statements about what intelligence is, and how to create it artificially, as if he knows what dozens of thousands of researchers in biology, cognitive science, psychology, neuroscience, AI, and who knows what other field, don't. That is despite the fact that he has created just as many intelligent machines as everyone else so far, which is to say: zero. Where that self-confidence comes from, I have no idea, but the results on his "AIQ test" indicate he, just like everyone else, has no clue what intelligence is, yet he persists with the absurd self-assurance. Insufferable arrogance.

Apologies for the rant.


> My guess for a long time has been that ARC will fall just like the Winograd Schema challenge (WSC) fell: someone will do the work to generate enough (tens of thousands) examples of ARC-like tasks, then train a deep neural net and go to town.

I think that this would be the real AGI (or even superintelligence hurdle) - having essentially a metacognitive AI understand that something given to it is a novel problem, for which it would use the given examples to automatically generate synthetic data sets and then train itself (or a subordinate model) based on these examples to gain the skill of solving this general type of problem.

> The only thing we know humans can do that computers can't is identify undecidable problems (like Barber Paradoxes i.e. statements of the form "this sentence is false", as in Gödel's second incompleteness theorem). Unfortunately we already know there is no computer that can ever do that

Where did the "ever" come from? Why wouldn't future computers be able to do this at (least at) a human level?


The "ever" comes from the Church-Turing thesis. Maybe in the future computers will not be Turing machines, but that we can't know yet.


That refers to the unsolvability of the general case, which is of course also unsolvable by humans.


I'm sorry, I'm not sure what "unsolvability" means. What I'm saying above is that humans can identify undecidable statements, i.e. we can recognise them as undecidable. If we couldn't, Gödel, Church, and Turing would not have a proof. But we can. We just don't do it algorithmically, obviously- because there is no algorithm that can do that, and so no computer that can, either.


But that's the thing, humans can't do it either, except only in some very specific simple cases. We're not magical; if we had a good way of doing it, we could implement it as an algorithm, but we don't.

There's a nice discussion of it in this CS Stack Exchange thread: https://cs.stackexchange.com/questions/47712/why-can-humans-...


I'm confused by the discussion in your link. It starts out about decidability and it soon veers off into complexity, e.g. a discussion about "efficiently" (really, cheaply) solving NP-complete instances with heurstics etc.

In any case, I'm not claiming that humans can decide the truth or falsehood of undecidable statements, either in their special or general cases. I'm arguing that humans can identify that such a statement is undecidable. In other words, we can recognise them as undecidable, without having to decide their truth values.

For example, "this statement is false" is obviously undecidable and we don't have to come up with an algorithm to try and decide its truth value before we can say so. So it's an identification problem that we solve, not a decision problem. But a Turing machine can't do that, either: it has to basically execute the statement before it can decide it's undecidable. The only alternative is to rely on patter recognition, but that is not a general solution.

Another thing to note is that statements of the form "this sentence is false" are undecidable even given infinite resources (it'd be better to refer to Turing's Halting Problem examples here but I need a refresher on that). In the thread you link to, someone says that the problem in the original question (basically higher-order unification) can be decided in a finite number of steps. I think that's wrong but in any case there is no finite way in which "this sentence if false" can be shown to be true or false algorithmically.

I think you're arguing that we can't solve the identification problem in the general case. I think we can, because we can do what I describe above: we can solve, _non-algorithmically_ and with finite resources, problems that _algorithmically_ cannot be solved with infinite resources. Turing gives more examples, as noted. I don't know how it gets more general than that.


Sorry, but I really don't understand your claim. You say

> The only alternative is to rely on patter recognition, but that is not a general solution.

but then you say

> we can solve, _non-algorithmically_ and with finite resources, problems that _algorithmically_ cannot be solved with infinite resources

How do you propose that we humans solve these problem in a way that both isn't algorithmic and isn't reducible to pattern recognition? Because I'm pretty sure it's always one of these.


The first sentence you quote refers to machines, not humans, and machines must be somehow programmed with, or learn patterns from data, that's why I say that it's not a general solution - because it's restricted by the data available.

I don't know how humans do it. Whatever we do it's something that is not covered by our current understanding of computation and maybe even mathematics. I suspect that in order to answer your question we need a new science of computation and mathematics.

I think that may sound a bit kooky but it's a bit late and I'm a bit tired to explain so I guess you'll have to suspect I'm just a crank until I next find the energy to explain myself.


>We basically have no idea how to create a test for intelligence that computers cannot beat by brute force or big data approaches.

Agreed

>so we have no effective way to test computers for (artificial) intelligence.

I never quite understand stances like this considering evolutionary human intelligence is exactly the consequence of incredible brute force and scale. Why is the introduction of brute force suddenly something that means we cannot 'truly' test for intelligence in machines ?


This is my entire quote:

>> We basically have no idea how to create a test for intelligence that computers cannot beat by brute force or big data approaches so we have no effective way to test computers for (artificial) intelligence.

When I say "brute force" I mean an exhaustive search of some large search space, in real time, not in evolutionary time. For example, searching a very large database for an answer, rather than computing the answer. But, as usual, I don't understand the point you're trying to make and where the bit about evolution came from. Can you clarify?

Btw, three requests, so we can have a productive conversation with as little time wasted in misunderstandings as possible:

a) Don't Fisk me (https://www.urbandictionary.com/define.php?term=Fisking).

b) Don't quote my words out of context.

c) If you don't understand why I say something, just ask.


Ok. I guess i misunderstood you then. I didn't mean to quote you out of context.

I just meant the human brain is the result of brute force. Evolution is a dumb biological optimizer whose objective function is to procreate. It's not search exactly but well then neither is brute force of Modern NNs.


OK, I see what you mean and thank you for the clarification.

So I think here you're mainly talking about the process by which artificial intelligence can be achieved. I don't disagree that, in principle, it should be possible to do this by some kind of brute-force, big-data optimisation programme. There is such a thing as evolutionary computation and genetic algorithms, after all. I think it's probably unrealistic to do that in practice, at least in other than evolutionary time scales, but that's just a hunch and not something I can really support with data, like.

But what I'm talking about is testing the intelligence of such a system, once we have it. By "testing" I mean to things: a) detecting that such a system is intelligent in the first place, and, b) measuring its intelligence. Now ARC-AGI muddles the waters a bit because it doesn't make it clear what kind of test of intelligence it is, a detecting kind of test or a measuring kind of test; and Chollet's white paper that introduced ARC is titled "On the measure of intelligence" which further confuses the issue: does he assume that there already exist artificially intelligent systems, so we don't have to bother with a detection kind of test? Having read his paper, I retain an impression that the answer is: no. So it's a bit of a muddle, like I say.

In any case, to come back to the brute force issue: I assume that, by brute force approaches we can solve any problem that humans solve, presumably using our intelligence, but without requiring intelligence. And that makes it very hard to know whether a system is intelligent or not just by looking at how well it does in a test, e.g. an IQ test for humans, or ARC, etc.

Seen another way: the ability to solve problems by brute force is a big confounding factor when trying to detect the presence of intelligence in an artificial system. My point above is that we have no good way to control for this confounder.

The question that remains is, I think, what counts as "brute force". As you say, I also don't think of neural net inference as brute force. I think of neural net training as brute force, so I'm muddling the issue a bit myself, since I said I'm talking about testing the already-trained system. Let's say that by "brute force" I mean a search of a large combinatorial space carried out at inference time, with or without heuristics to guide it. For example, minimax (as in Deep Blue) is brute force, minimax with a neural-net learned evaluation function (as in AlphaGo and friends) is brute force, AlphaCode, AlphaProof and similar approaches (generating millions of candiates and filtering/ ranking) is brute force, SAT-Solving is brute force, searching for optimal plans is brute force. What is not brute force? Well, for example, SLD-Resolution is not brute force because it's a proof procedure, arithmetic is not brute force because there's algorithms, boolean logic is not brute force, etc. I think I'm arguing that anything for which we have an algorithm that does not require a huge amount of computational power is not brute force, and I think that may even be an intuitive definition. Or not?


Thanks. I understand your position much better now. Your examples of non-brute force are fair enough. But it begs the question. Is it even possible to build/create a self-learning system (that starts from scratch) without brute force? Like we can get a calculator to perform the algorithms for addition but how would we get one to learn how to add from scratch without brute force ? This isn't even about NNs vs GOFAI vs Biology. I mean, how do you control for a variable that is always part of the equation ?


I don't know the answer to that. The problem with brute force approaches to learning is that it might take a very large amount of search, over a very large amount of time, to get to a system that can come up with arithmetic on its own.

Honestly I have no answer. I can see the problems with, essentially, scaling up, but I don't have a solution.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: