Did you enable reasoning? Qwen3 32b with reasoning enabled gave me the correct a...

mdp2021 · 2025-08-10T00:10:18 1754784618

> Did you enable reasoning

Yep.

> gave me the correct answer

Try real-world tests that cannot be covered by training data or chancey guesses.

kgeist · 2025-08-10T00:35:30 1754786130

Counting letters is a known blindspot in LLMs because of how tokenization works in most LLMs - they don't see individual letters. I'm not sure it's a valid test to make any far-reaching conclusions about their intelligence. It's like saying a blind person is an absolute dumbass just because they can't tell green from red.

The fact that reasoning models can count letters, even though they can't see individual letters, is actually pretty cool.

>Try real-world tests that cannot be covered by training data

If we don't allow a model to base its reasoning on the training data it's seen, what should it base it on? Clairvoyance? :)

> chancey guesses

The default sampling in most LLMs uses randomness to feel less robotic and repetitive, so it’s no surprise it makes “chancey guesses.” That’s literally what the system is programmed to do by default.

mdp2021 · 2025-08-10T00:56:00 1754787360

> they don't see individual letters

Yet they seem to be from many other tests (characters corrections or manipulation in texts, for example).

> The fact that reasoning models can count letters, even though they can't see individual letters

To a mind, every idea is a representation. But we want the processor to work reliably on them representations.

> If we don't allow a [mind] to base its reasoning on the training data it's seen, what should it base it on

On its reasoning and judgement over what it was told. You do not repeat what you heard, or you state that's what you heard (and provide sources).

> uses randomness

That is in a way a problem, a non-final fix - satisficing (Herb Simon) after random germs instead of constructing through a full optimality plan.

In the way I used the expression «chancey guesses» though I meant that guessing by chance when the right answer falls in a limited set ("how many letters in 'but'") is a weaker corroboration than when the right answer falls in a richer set ("how many letters in this sentence").

kgeist · 2025-08-10T01:33:02 1754789582

Most people act on gut instincts first as well. Gut instinct = first semi-random sample from experience (= training data). That's where all the logical fallacies come from. Things like the bat and the ball problem, where 95% people give an incorrect answer, because most of the time, people simply pattern-match too. It saves energy and works well 95% time. Just like reasoning LLMs, they can get to a correct answer if they increase their reasoning budget (but often they don't).

An LLM is a derivative of collective human knowledge, which is intrinsically unreliable itself. Most human concepts are ill-defined, fuzzy, very contextual. Human reasoning itself is flawed.

I'm not sure why people expect 100% reliability from a language model that is based on human representations which themselves cannot realistically be 100% reliable and perfectly well-defined.

If we want better reliability, we need a combination of tools: a "human mind model", which is intrinsically unreliable, plus a set of programmatic tools (say, like a human would use a calculator or a program to verify their results). I don't know if we can make something which works with human concepts and is 100% reliable in principle. Can a "lesser" mind create a "greater" mind, one free of human limitations? I think it's an open question.

mdp2021 · 2025-08-10T07:43:35 1754811815

> Most people act on gut instincts first as well

And we do not hire «most people» as consultants intentionally. We want to ask those intellectually diligent and talented.

> language model that is based on human representations

The machine is made to process the input - not to "intake" it. To create a mocker of average-joe would be an anti-service in both that * the project was to build a processor and * we refrain to ask average-joe. The plan can never have meant to be what you described, the mockery of mediocrity.

> we want better reliability

We want the implementation of a well performing mind - of intelligence. What you described is the "incompetent mind", the habitual fool - the «human mind model» is prescriptive based on what the properly used mind can do, not descriptive on what sloppy weak minds do.

> Can a "lesser" mind create a "greater" mind

Nothing says it could not.

> one free of human limitations

Very certainly yes, we can build things with more time, more energy, more efficiency, more robustness etc. than humans.

dvrj101 · 2025-08-12T17:26:23 1755019583

2b granite model can do this in first attempt

ollama run hf.co/ibm-granite/granite-3.3-2b-instruct-GGUF:F16 >>> how many b’s are there in blueberry? The word "blueberry" contains two 'b's.

mdp2021 · 2025-08-13T20:11:37 1755115897

I did include granite (8b) in my mentioned tests. You suggest granite-3.3-2b-instruct, no prob.

  llama-cli -m granite-3.3-2b-instruct-Q5_K_S.gguf --seed 1 -sys "Count the words in the input text; count the 'a' letters in the input text; count the five-letter words in the input text" -p "If you’re tucking into a chicken curry or a beef steak, it’s safe to assume that the former has come from a chicken, the latter from a cow"

response:

  - Words in the input text: 18
  - 'a' letters in the input text: 8
  - Five-letter words in the input text: 2 (tucking, into)

All wrong.

Sorry I did not have the "F16" available

vitaflo · 2025-08-10T03:10:04 1754795404

So did Deepseek. I guess the Chinese have figured out something the West hasn't, how to count.

mdp2021 · 2025-08-10T07:50:11 1754812211

No, DeepSeek also fails. (It worked in your test - it failed in similar others.)

(And note that DeepSeek can be very dumb - in practice, as experienced in our practice, and in standard tests, where it shows an ~80 IQ, where with other tools we achieved ~120 IQ (trackingai.org). DeepSeek was in important step, a demonstration of potential for efficiency, a gift - but it is still part of the collective work in progress.)