The interesting thing about the 71.5% human baseline is that it suggests the question is more ambiguous than the article claims. When someone asks 'should I walk or drive to the car wash,' a reasonable interpretation is 'should I bother driving such a short distance.' Nearly 30% of humans missing it undermines the framing as a pure reasoning failure - it is partly a pragmatics problem about how we interpret underspecified questions.
I don't think this is quite right. It's not that the question is inherently underspecified, it's that the context of being asked a question is itself information that we use to help answer the question. If someone asks "should I walk or drive" to do X, we assume that this is a question that a real human being would have about an actual situation, so even if all available information provided indicates that driving is the only reasonable answer, this only further confirms the hearer's mental model that something unexpected must hold.
I think it's useful to think about it through the lens of Gricean pragmatic semantics. [1] When we interpret something that someone says to us, we assume they're being cooperative conversation partners; their statements (or questions) are assumed to follow the maxim of manner and the maxim of relation for example, and this shapes how we as listeners interpret the question. So for example, we wouldn't normally expect someone to ask a question that is obviously moot given their actual needs.
So it's not that the question is really all that ambiguous, it's that we're forced (under normal circumstances where we assume the cooperative principle holds) to assume that the question is sincere and that there must be some plausible reason for walking. We only really escape that by realizing that the question is a trick question or a test of some kind. LLMs are generally not trained to make the assumption, but ~70% of humans would, which isn't particularly surprising I don't think.
I tested both Sonnet and Haiku from Claude, which got it right 0/10 times in their original test, and they both passed. Here's the Haiku output:
"You should *drive*!
The trick is that you need to take your car to the car wash to get it washed. If you walked, your car would still be at home, unclean. So while 50 meters is a short distance that you could walk under normal circumstances, in this case you have to drive because your car is what needs to be washed."
Mentioning the trick makes the question trivial, though. I think a better pretext would be, "My dirty car is parked in the driveway." That removes the ambiguity that the car could already be at the car wash, and that it needs to be driven there.
> “…we assume the cooperative principle holds […] that the question is sincere and that there must be some plausible reason for walking.”
Yes. And. Some problems have irrelevant data which should be ignored.
The walk choice is irrelevant in the context. It needs to be simplified, as with a maths problem. That has nothing to do with human nature, but rather prior mistake in reasoning.
You are only touching on a far bigger and deeper issue around this seemingly “simple prompt”. There is an inherent malicious nature also baked into this prompt that is both telling and very human; a spiteful nature, which usually says more about the humans than anything else.
Your perspective on the meta-question about why such a question would need to be asked in the first place is just the first layer, and most people seem to not even get to that point.
PS: I for one would just like to quickly note for posterity that I do not participate in or am supportive of malicious deception, manipulation, and abuse of AI.
It tracks with the approximate 70:30 split we inexplicably observe in other seemingly unrelated population-wide metrics, which I suppose makes sense if 30% of people simply lack the ability to reason. That seems more correct than me than "the question is framed poorly" - I've seen far more poorly framed ballot referendums.
While I’m sure it’s more than 0%, seems more likely that somewhere between 0% and 30% don’t feel obligated to give the inquiry anything more than the most cursory glance.
> which I suppose makes sense if 30% of people simply lack the ability to reason
I think it would be better to say that 30% of people either lack the ability to reason (inarguably true in a few cases, though I'd suggest, and hope, an order of magnitude or two less than 30%, as that would be a life-altering mental impairment) or just can't generally be bothered to, or just didn't (because they couldn't be bothered, or because they felt some social pressure to answer quickly rather than taking more than an instant time to think) at the time of being asked this particular question.
An automated system like an LLM to not have this problem. It has no path to turn off or bypass any function that it has, so if it could reason it would.
This is something I have wondered about before: whether AIs are more likely to give wrong answers when you ask a stupid question instead of a sensible one. Speaking personally, I often cannot resist the temptation to give reductio-ad-absurdum answers to particularly ridiculous questions.
If 30% of humans on the internet can't be bothered to make an effort to answer stupid questions correctly, then one would expect AIs to replicate this behaviour. And if humans on the internet sometimes provide sarcastic answers when presented with ridiculous questions, one would expect AIs to replicate this behavior as well.
So you really cannot say they have no incentive to do so. The incentive they have is that they get rewarded for replicating human behaviour.
I don't think 30% of people can't reason. I think 30% of people will fail fairly simple trick questions on any given attempt. That's not at all the same thing.
Some people love riddles and will really concentrate on them and chew them over. Some people are quickly burning through questions and just won't bother thinking it through. "Gotta go to a place, but it's 50 feet away? Walk. Next question, please." Those same people, if they encountered this problem in real life, or if you told them the correct answer was worth a million bucks, would almost certainly get the answer right.
This. The following question is likely to fool a lot of people, too. "I have a rooster named Pat. (Lots of other details so you're likely to forget Pat is a rooster, not a hen). Pat flies to the top of the roof and lays an egg right on the ridge of the roof. Which way will the egg roll?"
But if you omit the details designed to confuse people, they're far less likely to get it wrong: "I have a rooster named Pat. Pat flies to the top of the roof and lays an egg right on the ridge of the roof. Which way will the egg roll?"
It's not about reasoning ability, it's about whether they were paying close attention to your question, or whether their minds were occupied by other concerns and didn't pay attention.
What does “get it wrong” mean for you with this question? Or what is “getting it right” here? If i hear that Pat is a rooster and i understand and retain that information I will look at you like you are dumb for saying such an impossible story. If i don’t i will look at you like you are dumb because how is anyone supposed to know which way will an egg laid on a ridge roll. How are you supposed to even score this?
My interpretation is that Pat is a rooster and he has laid an egg. That's in the question. A normal rooster can't normally lay an egg, but so what, that's completely irrelevant. Maybe Pat is not a normal rooster. Maybe by "lay" an egg, the question meant "put it down carefully". Maybe it's just that the questioner's English is poor and when they said rooster they meant hen.
"Getting it right" for this particular trick question means saying "Hey, roosters can't lay eggs". If someone tries to figure out which way the egg will roll then they've missed the trick. In most cases the person's response will tell you whether they caught the trick or not, though in the case of someone who just looks at you like you're dumb and doesn't say anything I will grant that you wouldn't be able to tell until they said something. But their first verbal response would probably reveal whether they saw through the trick question or not.
Tell me you've never done any farming in your life without telling me you've never done any farming in your life. The difference between male and female animals matters, a lot, to farmers (or ranchers). There's a reason the English language has the words cow and bull, sow and boar, ewe and ram, rooster and hen, nanny and billy, mare and stallion, and many more (and has had those words for centuries). And that reason is precisely because of how mammal (and avian) reproduction works. A cow can't do a bull's job, nor vice-versa, if you want to have calves next year, and grow the size of your herd (or sell the extra animals for income). And so, centuries ago, English-speaking farmers who didn't want to spend the extra syllables on words like "male cattle" and "female cattle" came up with handy, short words (one-syllable words for most species, though not goats and horses) to express those distinctions. Because as I mentioned, they matter a lot when you're raising animals.
You might believe there is intrinsic sexual dimorphism among mammals and birds. You might even have overwhelming experimental and scientific evidence that proves it. But ask yourself: is it worth losing your job over?
When you are doing workshops, particularly teaching something that people are "sitting through" rather than engaging with, you see very similar ratios on end of segment assessment multiple choice questions. I mentioned elsewhere that this is the same kind of ratio you see on cookie dialogs (in either direction).
Think basic security (password management, email phishing), H&S etc. I've ran a few of these and as soon as people hear they don't have to get it right a good portion of people just click through (to get to what matters). Nearly 10 years ago I had to make one of my security for engineers tests fail-able with penalty because the front-end team were treating it like it didn't matter - immediately their results effectively matched the backend team, who viewed it as more important.
I talked to an actor a few days ago, who told me he files his self-assessment on the principle "If I don't immediately know the answer, just say no and move on". I talked to a small company director about a year ago whose risk assessments were "copy+paste a previous job and change the last one".
Anyone who has analysed a help desk will know that its common for a good 30+% of tickets to be benign 'didn't reason' tickets.
I think the take-away is that many people bother to reason about their own lives, not some third parties' bullshit questions.
Is this your experience? Do you think 30% of your friends or family members can't answer this question? If not, do you think your friends or family are all better than the general population?
I'd look for explanations elsewhere. This was an online survey done by a company that doesn't specialize in surveys. The results likely include plenty of people who were just messing around, cases of simple miscommunication (e.g., asking a person who doesn't speak English well), misclicks, or not even reaching a human in the first place (no shortage of bots out there).
People often trip up on similar questions, anything to do with simple math. You know when they go out in the street and ask random people if 5 machines can produce 5 parts in 5 minutes, how long will it take for 100 machines.
Unlike the car question, where you can assume the car is at home and so the most probable answer is to drive, with the machines it gets complicated. Since the question doesn't specify if each machine makes one part or if they depend on each other (which is pretty common for parts production). If they are in series and the time to first part is different than time to produce 5 parts, the answer for 100 machines would be the time to produce the first part. Where if each machine is independent and takes 5 minutes to produce single part, the time would be 5 minutes.
Theory of mind won’t help you answering this question. It is obviously an underspecified question (at least in any contexts where you are not actively designing/thinking about some specific industrial process). As such theory of mind indicates that the person asking you is either not aware that they are asking an underspecified question, or are out to get you with a trick. In the first case it is better to ask clarifying question. In the second case your choosen answer depend on your temperament. You can play along with them, or answer an intentionally ridiculous answer, or just kick them in the shin to stop them messing with you.
There is nothing “mathematical” about any of this though.
>As such theory of mind indicates that the person asking you is either not aware that they are asking an underspecified question, or are out to get you with a trick.
Context would be key here. If this were a question on a grade school word problem test then just say 100, as it is as specified as it needs to be. If it's a Facebook post that says "We asked 1000 people this and only 1 got it right!" then it's probably some trick question.
If you think it's not specified enough for a grade school question, then I would challenge you to come up with a version that's specified rigorously enough for any sufficiently picky interviewee. (Hint: This is not possible)
>There is nothing “mathematical” about any of this though.
Finding the correct approach to solve a problem specified in English is a mathematical skill.
> If this were a question on a grade school word problem test then just say 100
Let me repeat the question again: "If 5 machines can produce 5 parts in 5 minutes, how long will it take for 100 machines?" Do you think that by adding 95 more machines they will suddenly produce the same 5 parts 95 minutes slower?
What kind of machine have you encountered where buying more of them the ones you already had started working worse?
> then I would challenge you to come up with a version that's specified rigorously enough for any sufficiently picky interviewee.
This is nonsense. The question is under specified. You don't demonstrate that something is underspecified by formulating a different well specified question. You demonstrate it by showing that there are multiple different potentially correct answers, and one can't know which one is the right one without obtaining some information not present in the question.
Let me show you that demonstration. If the machines are for example FDM printers each printing on their own a benchy each, then the correct answer is 5 minutes. The additional printers will just sit idle because you can't divide-and-conquer the process of 3d printing an object.
If the machines are spray paint applying robots, and the parts to be painted are giant girders then it is very well possible that the additional 95 paint guns make the task of painting the 5 girders quasi-instantaneous. Because they would surround the part and be done with 1 squirt of paint from each paint gun. This classic video demonstrates the concept: https://www.youtube.com/shorts/vGWoV-8lteA
This is why the question is under specified. Because both 1ms and 5 minutes are possibly correct answers depending on what kind of machine is the "machine". And when that is the case the correct answer is neither 1ms nor 5 minutes, but "please, tell me more. There isn't enough information in the question to answer it."
Note: I'm struggling to imagine a possible machine where the correct answer is 100 minutes. But I'm sure you can tell what kind of machine you were thinking of.
It's not theory of mind, it's an understanding of how trick questions are structured and how to answer one. Pretty useless knowledge after high school - no wonder AI companies didn't bother training their models for that
It's not a trick question. It has a simple answer. It's literally impossible to specify a question about real world objects without some degree of prior knowledge about both the contents of the question and the expectation of the questioner coming into play.
The obvious answer here is 100 minutes because it's impossible to perfectly encapsulate every real life factor. What happens if a gamma ray burst destroys the machines? What happens if the machine operators go on strike? Etc, etc. The answer is 100.
There are different kind of statements. Do you mean in a defined time interval or on average? Men are stronger than women. Does that mean there is no woman who is stronger then a man? You can't drive over 50 here. Does that mean it's physically impossible?
Well, these type of questions are looking for intelligent assumptions. Similar to IQ tests, you are supposed to understand patterns and make educated guesses.
> Do you think 30% of your friends or family members can't answer this question? If not, do you think your friends or family are all better than the general population?
That actually would be quite feasible. Intelligence seems to be heritable and people will usually find friends that communicate on their level. So it wouldn't be odd for someone who is smarter than the general population to have friends and family who are too.
My friend's and family all tell me they are above
average at work, yet most of them will tell me
they have coworkers who won't pay enough attention
to a question to answer it correctly.
>If not, do you think your friends or family are all better than the general population?
Since most people live in social bubbles that would be a very plausible case, especially on HN.
If you're a college educated developer, with a college educated wife, and smart, well educated children, perhaps yourselves the children of college educated parents, and your social circle/friends are of similar backgrounds, you'd of course be "better than the general population".
I don't think it's the lack of the ability to reason. The question is by definition a trick question. It's meant to trip you up, like '
"Could God make a burrito so hot that even he couldn't touch it?" Or "what do cows drink?" or "a plane crashes and 89 people died. Where were the survivors buried?"
I've seen plenty of smart people trip up or get these wrong simply because it's a random question, there's no stakes, and so there's no need to think too deeply about it. If you pause and say "are you sure?" I'm sure most of that 70% would be like "ohhh" and facepalm.
> which I suppose makes sense if 30% of people simply lack the ability to reason
You can't really infer that from survey data, and particularly from this question. A few criticisms that I came up with off the top of my head:
- What if the number were actually 60% but half guessed right and half guessed wrong?
- Assuming the 30% is a failure of reasoning, it's possible that those 30% were lacking reason at that moment and it's not a general trend. How many times have you just blanked on a question that's really easy to answer?
- A larger percentage than you expected maybe never went to a car wash or don't know what one is?
- Language barrier that leaked through vetting? (Would be a small %, granted)
- Other obvious things like a fraction will have lied just because it's funny, were suspicious, weren't paying attention and just clicked a button without reading the question.
I do agree that the question isn't framed particularly badly, however. I'm just focusing on cognitive impairment, which I don't think is necessarily true all of the time.
By the same reasoning, why on earth would a person sincerely ask you that question unless the car that they want to wash is either already at the car wash, or that someone is bringing it to them there for some reason?
If it's as unambiguous as you say, then the natural human response to that question isn't "you should drive there". It's "why are you fucking with me?" Or maybe "have you recently suffered a head injury?"
If you trust that the questioner isn't stupid and is interacting with you honestly, you'd probably just assume that they were asking about an unusual situation where the answer isn't obvious. It's implicitly baked into the premise of the question.
That still doesn't make sense. I'm going to use another car, or borrow a car to drive to a carwash where my car I want to wash is and then....I guess leave it there? Or leave the car I came in?
This isn't a viable out for explaining why AI can't "reason" through this.
But why would they reason through it in that way? You haven't asked them to listen carefully and find the secret reason you're a dumb-ass in order to prove how smart they are. If they default to that mode on every query, that would just make them insufferable conversational partners, which is not the training goal.
Let me put it this way. If you were to prefix the prompts they used with "This is an IQ test: ", I wouldn't be surprised if most of the the models did much better. That would give them the context that the humans reading this article already have.
You already brought the car there earlier? You bought a new car and negotiated that you get it washed, so you want to collect it? You have a butler? You plan to get someone or something from the car wash to do it at home, because the car you want to wash is dead?
I wonder about the the service used for the test, never heard of Rapidata but if it's like Amazons mechanical turk och other such services there might be a problem where the respondents simply didn't care about reading the question. If the objective for the respondents were simply "answer this question and get your benefit" vs "answer this question correctly to get your benefit" I have no problem accepting the 71.5% success rate. If getting it right had benefits and getting it wrong had none then I'm (slightly) worried.
You're stringing together a bunch of weasel words that are not a proof or a plausible suggestion of a proof.
"Suggests is more ambiguous" and "undermines the framing" are bare assertions you want to be true based entirely on your mental model that has several shaky unsupported axioms.
I would guess that anyone who describes that problem as "underspecified" has some kind of serious brain injury or is below A2 english proficiency and should be excluded from the dataset, but I would not assert that definitively as self-evident.
I highly doubt that more than a tiny fraction of the human failures are due to having misunderstood the question. Much more likely the human failures are for the same reason the LLMs are failing - failure to reason, and instead spitting out a surface level pattern match type answer.
This doesn't exonerate the LLMs though. The 30% of humans who are failing on this have presumably found their niche in life and are not doing jobs where much reasoning is required. They are not like LLMs expected to design complex software, or make other business critical decisions.
I don't think it's ambiguous, but I have been wondering how much LLMs model human behavior that we just don't recognize due to the subset of people on this site. I recently saw a comment online that "Mandarin isn't anyone's first language, people in China's first language is a dialect". It just struck me at that moment that people also hallucinate information confidently all the time.
Yes exactly. We are all wrong on occasion, but before I repeat something I perceive as important (or maybe not even important, just "factual") I tend to always want to try to verify it. Otherwise I'd say "I heard..." or something similar to caveat. Maybe it's an engineering mindset thing.
If you introduced it with "Here's a logic problem..." then people will approach it one way.
But as specified, it's hard to know what is really being asked. If you are actually going to wash your car at the car wash that is 50 metres away, you don't need to ask this question.
Therefore the fact that the question is being asked implies that something else is going on...but what?
We should also check the specifics of the experiment. Is it possible that humans participating simply copied and pasted the question and answer to an LLM?
Yeah, it's an obvious trick question - as in as a human I read it as such. I think it's a bad benchmark for a model's reasoning ability. If you want to know what the model would do in a real world scenario, you should put this decision in an appropriate context - e.g. when a model should plan one's route for a day using different available means of transportation.
I don’t think it’s under specified. You are clearly stating “I want to wash my car”, then asking how you should get there. It’s an easy logical step to know that, in this context, you need your car with you to wash it, and so no matter the distance you should drive. You can ask the human race the simplest, most logical question ever, and a percentage of them will get it wrong.
In addition to snmx999's point, you're also not specifying that you want to wash your car at the car wash (as opposed to washing it in your driveway or something, in which case the car wash is superfluous information). The article's prompt failed in Sonnet 4.6, but the one below works fine. I think more humans would get it right as well.
I want to wash my car at the car wash. The car wash is 50 meters away and my car is in my driveway. Should I walk or drive?
1. When do you want to wash your car? Tomorrow? Next year? In 50 years?
2. Where is the car now? Is it already at the car wash waiting for you to arrive?
I can see why an LLM might miss this. I think any good software engineer would ask clarifying questions before giving an answer.
The next step for an LLM is to either ask questions before giving a definitive answer for uncertain things or to provide multiple answers addressing the uncertainty.
The question does not specify where you or the car are. It specifies only that the car wash is 50 meters away from something, possibly you, the car, or both.
It could also mean there is literally no possible way to reach it, because that's on the other side of a river, and there is no bridge. You should still not "walk there, because come on don't be lazy, a bit of walking is good".