Let's start with a multimodal[1] LLM that doesn't fail vacuously simple out-of-distribution counting problems.
I need to be convinced that an LLM is smarter than a honeybee before I am willing to even consider that it might be as smart as a human child. Honeybees are smart enough to understand what numbers are. Transformer LLMs are not. In general GPT and Claude are both dramatically dumber than honeybees when it comes to deep and mysterious cognitive abilities like planning and quantitative reasoning, even if they are better than honeybees at human subject knowledge and symbolic mathematics. It is sensible to evaluate Claude compared to other human knowledge tools, like an encyclopedia or Mathematica, based on the LLM benchmarks or "demonstrated LLM abilities." But those do not measure intelligence. To measure intelligence we need make the LLM as ignorant as possible so it relies on its own wits, like cognitive scientists do with bees and rats. (There is a general sickness in computer science where one poorly-reasoned thought experiment from Alan Turing somehow outweighs decades of real experiments from modern scientists.)
[1] People dishonestly claim LLMs fail at counting because of minor tokenization issues, but
a) they can count just fine if your prompt tells them how, so tokenization is obviously not a problem
b) they are even worse at counting if you ask them to count things in images, so I think tokenization is irrelevant!
I need to be convinced that an LLM is smarter than a honeybee before I am willing to even consider that it might be as smart as a human child. Honeybees are smart enough to understand what numbers are. Transformer LLMs are not. In general GPT and Claude are both dramatically dumber than honeybees when it comes to deep and mysterious cognitive abilities like planning and quantitative reasoning, even if they are better than honeybees at human subject knowledge and symbolic mathematics. It is sensible to evaluate Claude compared to other human knowledge tools, like an encyclopedia or Mathematica, based on the LLM benchmarks or "demonstrated LLM abilities." But those do not measure intelligence. To measure intelligence we need make the LLM as ignorant as possible so it relies on its own wits, like cognitive scientists do with bees and rats. (There is a general sickness in computer science where one poorly-reasoned thought experiment from Alan Turing somehow outweighs decades of real experiments from modern scientists.)
[1] People dishonestly claim LLMs fail at counting because of minor tokenization issues, but
a) they can count just fine if your prompt tells them how, so tokenization is obviously not a problem
b) they are even worse at counting if you ask them to count things in images, so I think tokenization is irrelevant!