AI coding tools are making this problem worse in a subtle way. When an agent can generate a "scalable event-driven architecture" in 5 minutes, the build cost of complexity drops to near zero. But the maintenance cost doesn't.
So now you get Engineer B's output even faster, with even more impressive-sounding abstractions, and the promotion packet writes itself in minutes too. Meanwhile the actual cost - debugging, onboarding, incident response at 3am - stays exactly the same or gets worse, because now nobody fully understands what was generated.
The real test for simplicity has always been: can the next person who touches this code understand it without asking you? AI-generated complexity fails that test spectacularly.
To be fair, a lot of the on call people being pulled in at 3am before LLMs existed didn't understand the systems they were supporting very well, either. This will definitely make it worse, though.
I think part of charting a safe career path now involves evaluating how strong any given org's culture of understanding the code and stack is. I definitely do not ever want to be in a position again where no one in the whole place knows how something works while the higher-ups are having a meltdown because something critical broke.
People on call will use AI as well. As long as the first AI left enough documentation and implemented traceability, the diagnosing AI should have an easier time proposing a fix. Ideally, AI would prepare the PR or rollback plan. In a utopia, AI would execute it and recover the system until a human wakes up.
Or at least there is something to chat with about the issue at 3am.
This. I've been a sysadmin for a quarter of a century and have professionally written next to no software. I've debugged every system I've had to support at some point though. It's a very different skill set.
True, but I think the implication (as I read it) is that AI may be providing more complex solutions than were needed for the problem and perhaps more complex than a human engineer would have provided.
it's MUCH worse now, not just because of the massive amount of code generated with zero supervision or very little supervision, but also because of the speed at which the systems grow in function
Not sure if you're kidding or not, but to write great maintable code, you need a lot of understanding that a LLM just doesn't have, like history, business context, company culture etc. Also, I doubt that in it's training data it has a lot of good examples of great maintainable code to pull from.
Admitting you've spent two decades on a career stuck working in the kind of sweatshops that hire people who can't actually code isn't much of a flex, and certainly doesn't lend a whole lot of credence to your argument.
Not the GP, but when I take strolls through some open source project hosted on GitHub, usually I am not impressed either. Unnecessary OOPism, way too long procedures, that instead could be pure functions, badly named variables, and way, way, waaay too many willy-nilly added dependencies. If that is what the LLMs mostly learn from, I am not surprised at all. But then again this stuff was also written by humans. I remember one especially bad case of a procedure in a very popular project (in its niche) that was basically a 1 man show. A procedure of 300+ lines doing all kinds of stuff, changing the global state of that service it is implementing. But that code was or is relied upon by tech giants and other businesses and no one improves it. They are happy with paying that one guy probably not so much money.
Awesome! However, the corporate is excited with using AI, making the coder the one who's at risk at getting fired for writing the exact same lousy (for the sake of the argument) code.
Or worse: for not relying as much as possible to the AI who apparently can write just as bad code but faster!
A subtle detail: you speak of coders, not software engineers. A SWE's value is not his code churning speed.
This says more about you and the people you work with. I find engineers that have been at the company for a while are quite invaluable when it comes to this information, it's not just knowing the how but the when + why that's critical as well.
Acting like people can't be good at their job is frankly dehumanizing and says a lot about your mindset with how you view other fellow devs.
If only more engineers admitted, that something they wrote is not good code, but a product of its time, then I think we would get more realistic expectations.
It's OK to say that something you made is shit. It is OK to say that you were not given time to do xyz.
How you recognize something has been made fitting at least is, when you see it in use without much of a change for some 3 or 4 years and while you are the person maintaining it, you rarely ever need to touch it, because you built it in a way that is simple enough to not have tons of bugs yet flexible enough, to cover use-cases and anticipated use-cases.
Sometimes, as in the bilsbi's top level comment, the solution is to use a free tool/library/product that already exists. The solution is not always to write new code, but the agent will happily do it.
Maybe that's "the manager's job", but that's just passing the buck and getting a worse solution. Every level of management should be looking for the best solution.
This is exactly right. I maintain an AGENTS.md for my own AI assistant with similar principles - "禁止只记录不行动" (no recording without action) and strict rules about when to escalate vs. when to solve autonomously.
The key insight is that the AGENTS.md becomes a kind of "engineering culture in a file". When you onboard a human engineer, you hope they absorb the team's values over time. With AI, you can encode those values upfront.
The challenge is that principles need to be specific enough to be actionable. "Write simple code" is too vague. "Avoid single-use wrapper functions" (from the sibling comment) is better - it's enforceable.
I wrote something similar in a Claude Code instructions.md: "minimize cyclomatic complexity" What happened next? It generated an 8 line wrapper function called only once from a different file. So, I told it to inline that logic in the caller. The result? One. Line. Of. Code.
So, I asked it to modify its instructions.md file to not repeat that mistake. The result was the new line "Avoid single-use wrapper functions; inline logic at the call site unless reused"
It reminds me a lot of people who take Code Complete too seriously. "Common sense" is not an objective or universal statement unfortunately - plus, speaking for myself, what I consider "common sense" can change on the daily, which is why I can't be trusted adding features to my own codebase long term <_<.
I wrote something similar in a Claude Code instructions.md: "minimize cyclomatic complexity" What happened next? It generated an 8 line wrapper function called only once from a different file. So, I told it to inline that logic in the caller. The result? One. Line. Of. Code.
So, I asked it to modify its instructions.md file to not repeat that mistake. The result was the new line "Avoid single-use wrapper functions; inline logic at the call site unless reused"
I think this is at the moment the practical limitation to using AI for everything (and what the coding agents themselves also optimize for to some degree, or it's the slider they can play with for price vs quality, the "thinking" models being the exact same, but just burning more tokens).
Am waiting for the next Mac Studio to come out to experiment with the "AI for everything" approach. Most likely, the open source distilled models will lower quality. So, another "price vs quality" tradeoff. Still, will be fun to code like I'm at a foundation lab.
This seems like a perfect use case for a local model. But I've found in practice that the system requirements for agents are much higher than for models that can handle simple refactoring tasks. Once tool use context is factored in, there is very little room for models that perform decently.
Whatever agent I tried would include thousands of tokens in tool-use instruction. That would use up most available context unless running very low-spec models. I've concluded it's best to use the big 3 for most tasks and qwen on runpod for more private data.
It’s not even about perfectionism. Code’s value is about processing data. Bad code do it wrongly and if you have strange code on top of that, you cannot correct the course. Happy path are usually the low hanging fruits. What makes developing software hard is thinking about all the failure and edge cases.
I think we'll see a decline of software as a product for this reason. If your job is to solve a problem, and you use AI to generate a tool that solves that problem, or you use money to buy a tool that solves that problem, well then it's still your job to solve that problem regardless of which tool you use.
But given how poorly bought software tends to fit the use case of the person it was bought for... eventually generate-something-custom will start making more and more sense.
If you end up generating something that nobody understands, then when you quit and get a new job, somebody else will probably use your project as context for generating something that suits the way they want to solve that problem. Time will have passed, so the needs will have changed, they'll end up with something different. They'll also only partially understand it, but the gaps will be in different places this time around. Overall I think it'll be an improvement because there will be less distance (both in time and along the social graph) between the software's user its creator--them being most of the time the same person.
This is something I keep thinking while coding with AI, and same with introducing library dependencies for the simplest problems. It’s not whether how quickly I can get there but more about how can I keep it simple to maintain not only for myself but for the next AI agent.
Biggest problem is that next person is me 6 months later :) but even when it’s not a next person problem how much of the design I can just keep in my mind at a given time, ironically AI has the exact same problem aka context window
There's also the operational cost of running whatever is churned out. I wouldn't exactly blame that on AIs, but a large contingency of developers optimize for popular tech-stacks and not ease of operations. I don't think that will change just because they start using AI. In my experience the AI won't tell you that you're massively overbuilding something or that if we did this in C and used Postgresql we'd be able to run this on an old Pentium III with 4GB of RAM. If you want Kubernetes and ElasticSearch, you'll get exactly that.
There's different kinds of abstraction. There's abstraction like Jackson Pollock and there's abstraction like what Dijkstra as suggesting: elegance. Which personally made the article a very weird read to me
tbh codebases like that predate AI code generators. I had one job where my predecessor was not a very good developer by modern standards, but he was productive... a dangerous combination.
I also kind of respect it, it bothers me endlessly when everything isn't perfect and this guy just threw caution to the wind. Jokes on me as I'm working for him now. But it's not like anything that predates AI, I couldn't write this type of slop if I tried lol. Zero formatting, linting, or anything. Just straight goulash.
AI generators only generate that if you tell them to though - as a developer (especially senior) it's your job to know what you want and tell the AI coding tools that.
The interesting thing about the 71.5% human baseline is that it suggests the question is more ambiguous than the article claims. When someone asks 'should I walk or drive to the car wash,' a reasonable interpretation is 'should I bother driving such a short distance.' Nearly 30% of humans missing it undermines the framing as a pure reasoning failure - it is partly a pragmatics problem about how we interpret underspecified questions.
I don't think this is quite right. It's not that the question is inherently underspecified, it's that the context of being asked a question is itself information that we use to help answer the question. If someone asks "should I walk or drive" to do X, we assume that this is a question that a real human being would have about an actual situation, so even if all available information provided indicates that driving is the only reasonable answer, this only further confirms the hearer's mental model that something unexpected must hold.
I think it's useful to think about it through the lens of Gricean pragmatic semantics. [1] When we interpret something that someone says to us, we assume they're being cooperative conversation partners; their statements (or questions) are assumed to follow the maxim of manner and the maxim of relation for example, and this shapes how we as listeners interpret the question. So for example, we wouldn't normally expect someone to ask a question that is obviously moot given their actual needs.
So it's not that the question is really all that ambiguous, it's that we're forced (under normal circumstances where we assume the cooperative principle holds) to assume that the question is sincere and that there must be some plausible reason for walking. We only really escape that by realizing that the question is a trick question or a test of some kind. LLMs are generally not trained to make the assumption, but ~70% of humans would, which isn't particularly surprising I don't think.
I tested both Sonnet and Haiku from Claude, which got it right 0/10 times in their original test, and they both passed. Here's the Haiku output:
"You should *drive*!
The trick is that you need to take your car to the car wash to get it washed. If you walked, your car would still be at home, unclean. So while 50 meters is a short distance that you could walk under normal circumstances, in this case you have to drive because your car is what needs to be washed."
Mentioning the trick makes the question trivial, though. I think a better pretext would be, "My dirty car is parked in the driveway." That removes the ambiguity that the car could already be at the car wash, and that it needs to be driven there.
> “…we assume the cooperative principle holds […] that the question is sincere and that there must be some plausible reason for walking.”
Yes. And. Some problems have irrelevant data which should be ignored.
The walk choice is irrelevant in the context. It needs to be simplified, as with a maths problem. That has nothing to do with human nature, but rather prior mistake in reasoning.
You are only touching on a far bigger and deeper issue around this seemingly “simple prompt”. There is an inherent malicious nature also baked into this prompt that is both telling and very human; a spiteful nature, which usually says more about the humans than anything else.
Your perspective on the meta-question about why such a question would need to be asked in the first place is just the first layer, and most people seem to not even get to that point.
PS: I for one would just like to quickly note for posterity that I do not participate in or am supportive of malicious deception, manipulation, and abuse of AI.
It tracks with the approximate 70:30 split we inexplicably observe in other seemingly unrelated population-wide metrics, which I suppose makes sense if 30% of people simply lack the ability to reason. That seems more correct than me than "the question is framed poorly" - I've seen far more poorly framed ballot referendums.
While I’m sure it’s more than 0%, seems more likely that somewhere between 0% and 30% don’t feel obligated to give the inquiry anything more than the most cursory glance.
> which I suppose makes sense if 30% of people simply lack the ability to reason
I think it would be better to say that 30% of people either lack the ability to reason (inarguably true in a few cases, though I'd suggest, and hope, an order of magnitude or two less than 30%, as that would be a life-altering mental impairment) or just can't generally be bothered to, or just didn't (because they couldn't be bothered, or because they felt some social pressure to answer quickly rather than taking more than an instant time to think) at the time of being asked this particular question.
An automated system like an LLM to not have this problem. It has no path to turn off or bypass any function that it has, so if it could reason it would.
This is something I have wondered about before: whether AIs are more likely to give wrong answers when you ask a stupid question instead of a sensible one. Speaking personally, I often cannot resist the temptation to give reductio-ad-absurdum answers to particularly ridiculous questions.
If 30% of humans on the internet can't be bothered to make an effort to answer stupid questions correctly, then one would expect AIs to replicate this behaviour. And if humans on the internet sometimes provide sarcastic answers when presented with ridiculous questions, one would expect AIs to replicate this behavior as well.
So you really cannot say they have no incentive to do so. The incentive they have is that they get rewarded for replicating human behaviour.
I don't think 30% of people can't reason. I think 30% of people will fail fairly simple trick questions on any given attempt. That's not at all the same thing.
Some people love riddles and will really concentrate on them and chew them over. Some people are quickly burning through questions and just won't bother thinking it through. "Gotta go to a place, but it's 50 feet away? Walk. Next question, please." Those same people, if they encountered this problem in real life, or if you told them the correct answer was worth a million bucks, would almost certainly get the answer right.
This. The following question is likely to fool a lot of people, too. "I have a rooster named Pat. (Lots of other details so you're likely to forget Pat is a rooster, not a hen). Pat flies to the top of the roof and lays an egg right on the ridge of the roof. Which way will the egg roll?"
But if you omit the details designed to confuse people, they're far less likely to get it wrong: "I have a rooster named Pat. Pat flies to the top of the roof and lays an egg right on the ridge of the roof. Which way will the egg roll?"
It's not about reasoning ability, it's about whether they were paying close attention to your question, or whether their minds were occupied by other concerns and didn't pay attention.
What does “get it wrong” mean for you with this question? Or what is “getting it right” here? If i hear that Pat is a rooster and i understand and retain that information I will look at you like you are dumb for saying such an impossible story. If i don’t i will look at you like you are dumb because how is anyone supposed to know which way will an egg laid on a ridge roll. How are you supposed to even score this?
My interpretation is that Pat is a rooster and he has laid an egg. That's in the question. A normal rooster can't normally lay an egg, but so what, that's completely irrelevant. Maybe Pat is not a normal rooster. Maybe by "lay" an egg, the question meant "put it down carefully". Maybe it's just that the questioner's English is poor and when they said rooster they meant hen.
"Getting it right" for this particular trick question means saying "Hey, roosters can't lay eggs". If someone tries to figure out which way the egg will roll then they've missed the trick. In most cases the person's response will tell you whether they caught the trick or not, though in the case of someone who just looks at you like you're dumb and doesn't say anything I will grant that you wouldn't be able to tell until they said something. But their first verbal response would probably reveal whether they saw through the trick question or not.
Tell me you've never done any farming in your life without telling me you've never done any farming in your life. The difference between male and female animals matters, a lot, to farmers (or ranchers). There's a reason the English language has the words cow and bull, sow and boar, ewe and ram, rooster and hen, nanny and billy, mare and stallion, and many more (and has had those words for centuries). And that reason is precisely because of how mammal (and avian) reproduction works. A cow can't do a bull's job, nor vice-versa, if you want to have calves next year, and grow the size of your herd (or sell the extra animals for income). And so, centuries ago, English-speaking farmers who didn't want to spend the extra syllables on words like "male cattle" and "female cattle" came up with handy, short words (one-syllable words for most species, though not goats and horses) to express those distinctions. Because as I mentioned, they matter a lot when you're raising animals.
You might believe there is intrinsic sexual dimorphism among mammals and birds. You might even have overwhelming experimental and scientific evidence that proves it. But ask yourself: is it worth losing your job over?
When you are doing workshops, particularly teaching something that people are "sitting through" rather than engaging with, you see very similar ratios on end of segment assessment multiple choice questions. I mentioned elsewhere that this is the same kind of ratio you see on cookie dialogs (in either direction).
Think basic security (password management, email phishing), H&S etc. I've ran a few of these and as soon as people hear they don't have to get it right a good portion of people just click through (to get to what matters). Nearly 10 years ago I had to make one of my security for engineers tests fail-able with penalty because the front-end team were treating it like it didn't matter - immediately their results effectively matched the backend team, who viewed it as more important.
I talked to an actor a few days ago, who told me he files his self-assessment on the principle "If I don't immediately know the answer, just say no and move on". I talked to a small company director about a year ago whose risk assessments were "copy+paste a previous job and change the last one".
Anyone who has analysed a help desk will know that its common for a good 30+% of tickets to be benign 'didn't reason' tickets.
I think the take-away is that many people bother to reason about their own lives, not some third parties' bullshit questions.
Is this your experience? Do you think 30% of your friends or family members can't answer this question? If not, do you think your friends or family are all better than the general population?
I'd look for explanations elsewhere. This was an online survey done by a company that doesn't specialize in surveys. The results likely include plenty of people who were just messing around, cases of simple miscommunication (e.g., asking a person who doesn't speak English well), misclicks, or not even reaching a human in the first place (no shortage of bots out there).
People often trip up on similar questions, anything to do with simple math. You know when they go out in the street and ask random people if 5 machines can produce 5 parts in 5 minutes, how long will it take for 100 machines.
Unlike the car question, where you can assume the car is at home and so the most probable answer is to drive, with the machines it gets complicated. Since the question doesn't specify if each machine makes one part or if they depend on each other (which is pretty common for parts production). If they are in series and the time to first part is different than time to produce 5 parts, the answer for 100 machines would be the time to produce the first part. Where if each machine is independent and takes 5 minutes to produce single part, the time would be 5 minutes.
Theory of mind won’t help you answering this question. It is obviously an underspecified question (at least in any contexts where you are not actively designing/thinking about some specific industrial process). As such theory of mind indicates that the person asking you is either not aware that they are asking an underspecified question, or are out to get you with a trick. In the first case it is better to ask clarifying question. In the second case your choosen answer depend on your temperament. You can play along with them, or answer an intentionally ridiculous answer, or just kick them in the shin to stop them messing with you.
There is nothing “mathematical” about any of this though.
>As such theory of mind indicates that the person asking you is either not aware that they are asking an underspecified question, or are out to get you with a trick.
Context would be key here. If this were a question on a grade school word problem test then just say 100, as it is as specified as it needs to be. If it's a Facebook post that says "We asked 1000 people this and only 1 got it right!" then it's probably some trick question.
If you think it's not specified enough for a grade school question, then I would challenge you to come up with a version that's specified rigorously enough for any sufficiently picky interviewee. (Hint: This is not possible)
>There is nothing “mathematical” about any of this though.
Finding the correct approach to solve a problem specified in English is a mathematical skill.
> If this were a question on a grade school word problem test then just say 100
Let me repeat the question again: "If 5 machines can produce 5 parts in 5 minutes, how long will it take for 100 machines?" Do you think that by adding 95 more machines they will suddenly produce the same 5 parts 95 minutes slower?
What kind of machine have you encountered where buying more of them the ones you already had started working worse?
> then I would challenge you to come up with a version that's specified rigorously enough for any sufficiently picky interviewee.
This is nonsense. The question is under specified. You don't demonstrate that something is underspecified by formulating a different well specified question. You demonstrate it by showing that there are multiple different potentially correct answers, and one can't know which one is the right one without obtaining some information not present in the question.
Let me show you that demonstration. If the machines are for example FDM printers each printing on their own a benchy each, then the correct answer is 5 minutes. The additional printers will just sit idle because you can't divide-and-conquer the process of 3d printing an object.
If the machines are spray paint applying robots, and the parts to be painted are giant girders then it is very well possible that the additional 95 paint guns make the task of painting the 5 girders quasi-instantaneous. Because they would surround the part and be done with 1 squirt of paint from each paint gun. This classic video demonstrates the concept: https://www.youtube.com/shorts/vGWoV-8lteA
This is why the question is under specified. Because both 1ms and 5 minutes are possibly correct answers depending on what kind of machine is the "machine". And when that is the case the correct answer is neither 1ms nor 5 minutes, but "please, tell me more. There isn't enough information in the question to answer it."
Note: I'm struggling to imagine a possible machine where the correct answer is 100 minutes. But I'm sure you can tell what kind of machine you were thinking of.
It's not theory of mind, it's an understanding of how trick questions are structured and how to answer one. Pretty useless knowledge after high school - no wonder AI companies didn't bother training their models for that
It's not a trick question. It has a simple answer. It's literally impossible to specify a question about real world objects without some degree of prior knowledge about both the contents of the question and the expectation of the questioner coming into play.
The obvious answer here is 100 minutes because it's impossible to perfectly encapsulate every real life factor. What happens if a gamma ray burst destroys the machines? What happens if the machine operators go on strike? Etc, etc. The answer is 100.
There are different kind of statements. Do you mean in a defined time interval or on average? Men are stronger than women. Does that mean there is no woman who is stronger then a man? You can't drive over 50 here. Does that mean it's physically impossible?
Well, these type of questions are looking for intelligent assumptions. Similar to IQ tests, you are supposed to understand patterns and make educated guesses.
> Do you think 30% of your friends or family members can't answer this question? If not, do you think your friends or family are all better than the general population?
That actually would be quite feasible. Intelligence seems to be heritable and people will usually find friends that communicate on their level. So it wouldn't be odd for someone who is smarter than the general population to have friends and family who are too.
My friend's and family all tell me they are above
average at work, yet most of them will tell me
they have coworkers who won't pay enough attention
to a question to answer it correctly.
>If not, do you think your friends or family are all better than the general population?
Since most people live in social bubbles that would be a very plausible case, especially on HN.
If you're a college educated developer, with a college educated wife, and smart, well educated children, perhaps yourselves the children of college educated parents, and your social circle/friends are of similar backgrounds, you'd of course be "better than the general population".
I don't think it's the lack of the ability to reason. The question is by definition a trick question. It's meant to trip you up, like '
"Could God make a burrito so hot that even he couldn't touch it?" Or "what do cows drink?" or "a plane crashes and 89 people died. Where were the survivors buried?"
I've seen plenty of smart people trip up or get these wrong simply because it's a random question, there's no stakes, and so there's no need to think too deeply about it. If you pause and say "are you sure?" I'm sure most of that 70% would be like "ohhh" and facepalm.
> which I suppose makes sense if 30% of people simply lack the ability to reason
You can't really infer that from survey data, and particularly from this question. A few criticisms that I came up with off the top of my head:
- What if the number were actually 60% but half guessed right and half guessed wrong?
- Assuming the 30% is a failure of reasoning, it's possible that those 30% were lacking reason at that moment and it's not a general trend. How many times have you just blanked on a question that's really easy to answer?
- A larger percentage than you expected maybe never went to a car wash or don't know what one is?
- Language barrier that leaked through vetting? (Would be a small %, granted)
- Other obvious things like a fraction will have lied just because it's funny, were suspicious, weren't paying attention and just clicked a button without reading the question.
I do agree that the question isn't framed particularly badly, however. I'm just focusing on cognitive impairment, which I don't think is necessarily true all of the time.
By the same reasoning, why on earth would a person sincerely ask you that question unless the car that they want to wash is either already at the car wash, or that someone is bringing it to them there for some reason?
If it's as unambiguous as you say, then the natural human response to that question isn't "you should drive there". It's "why are you fucking with me?" Or maybe "have you recently suffered a head injury?"
If you trust that the questioner isn't stupid and is interacting with you honestly, you'd probably just assume that they were asking about an unusual situation where the answer isn't obvious. It's implicitly baked into the premise of the question.
That still doesn't make sense. I'm going to use another car, or borrow a car to drive to a carwash where my car I want to wash is and then....I guess leave it there? Or leave the car I came in?
This isn't a viable out for explaining why AI can't "reason" through this.
But why would they reason through it in that way? You haven't asked them to listen carefully and find the secret reason you're a dumb-ass in order to prove how smart they are. If they default to that mode on every query, that would just make them insufferable conversational partners, which is not the training goal.
Let me put it this way. If you were to prefix the prompts they used with "This is an IQ test: ", I wouldn't be surprised if most of the the models did much better. That would give them the context that the humans reading this article already have.
You already brought the car there earlier? You bought a new car and negotiated that you get it washed, so you want to collect it? You have a butler? You plan to get someone or something from the car wash to do it at home, because the car you want to wash is dead?
I wonder about the the service used for the test, never heard of Rapidata but if it's like Amazons mechanical turk och other such services there might be a problem where the respondents simply didn't care about reading the question. If the objective for the respondents were simply "answer this question and get your benefit" vs "answer this question correctly to get your benefit" I have no problem accepting the 71.5% success rate. If getting it right had benefits and getting it wrong had none then I'm (slightly) worried.
You're stringing together a bunch of weasel words that are not a proof or a plausible suggestion of a proof.
"Suggests is more ambiguous" and "undermines the framing" are bare assertions you want to be true based entirely on your mental model that has several shaky unsupported axioms.
I would guess that anyone who describes that problem as "underspecified" has some kind of serious brain injury or is below A2 english proficiency and should be excluded from the dataset, but I would not assert that definitively as self-evident.
I highly doubt that more than a tiny fraction of the human failures are due to having misunderstood the question. Much more likely the human failures are for the same reason the LLMs are failing - failure to reason, and instead spitting out a surface level pattern match type answer.
This doesn't exonerate the LLMs though. The 30% of humans who are failing on this have presumably found their niche in life and are not doing jobs where much reasoning is required. They are not like LLMs expected to design complex software, or make other business critical decisions.
I don't think it's ambiguous, but I have been wondering how much LLMs model human behavior that we just don't recognize due to the subset of people on this site. I recently saw a comment online that "Mandarin isn't anyone's first language, people in China's first language is a dialect". It just struck me at that moment that people also hallucinate information confidently all the time.
Yes exactly. We are all wrong on occasion, but before I repeat something I perceive as important (or maybe not even important, just "factual") I tend to always want to try to verify it. Otherwise I'd say "I heard..." or something similar to caveat. Maybe it's an engineering mindset thing.
If you introduced it with "Here's a logic problem..." then people will approach it one way.
But as specified, it's hard to know what is really being asked. If you are actually going to wash your car at the car wash that is 50 metres away, you don't need to ask this question.
Therefore the fact that the question is being asked implies that something else is going on...but what?
We should also check the specifics of the experiment. Is it possible that humans participating simply copied and pasted the question and answer to an LLM?
Yeah, it's an obvious trick question - as in as a human I read it as such. I think it's a bad benchmark for a model's reasoning ability. If you want to know what the model would do in a real world scenario, you should put this decision in an appropriate context - e.g. when a model should plan one's route for a day using different available means of transportation.
I don’t think it’s under specified. You are clearly stating “I want to wash my car”, then asking how you should get there. It’s an easy logical step to know that, in this context, you need your car with you to wash it, and so no matter the distance you should drive. You can ask the human race the simplest, most logical question ever, and a percentage of them will get it wrong.
In addition to snmx999's point, you're also not specifying that you want to wash your car at the car wash (as opposed to washing it in your driveway or something, in which case the car wash is superfluous information). The article's prompt failed in Sonnet 4.6, but the one below works fine. I think more humans would get it right as well.
I want to wash my car at the car wash. The car wash is 50 meters away and my car is in my driveway. Should I walk or drive?
1. When do you want to wash your car? Tomorrow? Next year? In 50 years?
2. Where is the car now? Is it already at the car wash waiting for you to arrive?
I can see why an LLM might miss this. I think any good software engineer would ask clarifying questions before giving an answer.
The next step for an LLM is to either ask questions before giving a definitive answer for uncertain things or to provide multiple answers addressing the uncertainty.
The question does not specify where you or the car are. It specifies only that the car wash is 50 meters away from something, possibly you, the car, or both.
It could also mean there is literally no possible way to reach it, because that's on the other side of a river, and there is no bridge. You should still not "walk there, because come on don't be lazy, a bit of walking is good".
This is a great practical application of pgvector! The HN corpus is perfect for semantic search because the discussions tend to be technical and well-structured.
I'm curious about the embedding model you chose - did you compare different options (OpenAI ada-002, Cohere, open-source models like all-MiniLM)? And how's the query performance with pgvector at scale?
One feature that would be valuable: filtering by time range or karma score. Sometimes you want recent discussions vs. classic threads with high engagement.
Interesting approach using SQLite as the persistence layer for AI agents. The local-first architecture makes a lot of sense for development workflows where latency matters.
One question: How do you handle concurrent writes from multiple agents working on the same project? SQLite has WAL mode, but I'm curious if you've encountered any race conditions in practice, especially when agents are running in parallel.
Also, the MCP (Model Context Protocol) integration is clever - having a standardized way for agents to query project state could really simplify the orchestration layer. Are you seeing other teams adopt MCP for similar use cases?
SQLite handles this well in practice. saga-mcp uses WAL mode with busy_timeout=5000 and synchronous=NORMAL, so concurrent writes queue up rather than fail. The intended use case is one agent per project per session — if you had multiple agents writing to the same .tracker.db, WAL mode serializes the writes transparently.
For MCP adoption — it's growing fast. Claude Code, Claude Desktop, Cursor, and Windsurf all support it natively now. The spec is simple (JSON-RPC over stdio or SSE), so the barrier to both building and consuming MCP servers is low.
So now you get Engineer B's output even faster, with even more impressive-sounding abstractions, and the promotion packet writes itself in minutes too. Meanwhile the actual cost - debugging, onboarding, incident response at 3am - stays exactly the same or gets worse, because now nobody fully understands what was generated.
The real test for simplicity has always been: can the next person who touches this code understand it without asking you? AI-generated complexity fails that test spectacularly.