Haven't seen a jump this large since I don't even know, years?
Too bad they are not releasing it anytime soon (there is no need as they are still currently the leader).
Sounds like a good opportunity to pause spending on nerfed 4.6 and wait for the new model to be released and then max out over 2 weeks before it gets nerfed again.
the performance degradation I've seen isn't quality/completion but duration, I get good results but much less quickly than I did before 4.6. Still, it's just anecdata, but a lot of folks seem to feel the same.
Been reading posts like these for 3 years now. There’s multiple sites with #s. I’m willing to buy “I’m paying rent on someone’s agent harness and god knows what’s in the system prompt rn”, but in the face of numbers, gotta discount the anecdotal.
You're probably right. It's probably more likely that for some period of time I forgot that I switched to the large context Opus vs Sonnet and it was not needed for the level of complexity of my work.
I don't believe that trackers like this are trustworthy. There's an enormous financial motive to cheat and these companies have a track record of unethical conduct.
If I was VP of Unethical Business Strategy at OpenAI or Anthropic, the first thing I'd do is put in place an automated system which flags accounts, prompts, IPs, and usage patterns associated with these benchmarks and direct their usage to a dedicated compute pool which wouldn't be affected by these changes.
Ya'll know they're teaching to the test. I'll wait till someone devises a novel test that isn't contained in the datasets. Sure, they're still powerful.
My understanding is GPT 6 works via synaptic space reasoning... which I find terrifying. I hope if true, OpenAI does some safety testing on that, beyond what they normally do.
“My vibes don’t match a lot of the traditional A.I.-safety stuff,” Altman said. He insisted that he continued to prioritize these matters, but when pressed for specifics he was vague: “We still will run safety projects, or at least safety-adjacent projects.” When we asked to interview researchers at the company who were working on existential safety—the kinds of issues that could mean, as Altman once put it, “lights-out for all of us”—an OpenAI representative seemed confused. “What do you mean by ‘existential safety’?” he replied. “That’s not, like, a thing.”
The absolute gall of this guy to laugh off a question about x-risks. Meanwhile, also Sam Altman, in 2015: "Development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity. There are other threats that I think are more certain to happen (for example, an engineered virus with a long incubation period and a high mortality rate) but are unlikely to destroy every human in the universe in the way that SMI could. Also, most of these other big threats are already widely feared." [1]
> We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.
I agree that they called many things remarkably well! That doesn't change the fact that AI 2027 is not a thing which happened, so it isn't valid to point out "this killed us in AI 2027." There are many reasons to want to preserve CoT monitorability. Instead of AI 2027, I'd point to https://arxiv.org/html/2507.11473.
Actually, going from 91.3% to 94.5% is a significant jump, because it means the model has gotten a lot better at solving the hardest problems thrown at it. This has downstream effects as well: it means that during long implementation tasks, instead of getting stuck at the most challenging parts and stopping (or going in loops!), it can now get past them to finish the implementation.
A jump that we will never be able to use since we're not part of the seemingly minimum 100 billion dollar company club as requirement to be allowed to use it.
I get the security aspect, but if we've hit that point any reasonably sophisticated model past this point will be able to do the damage they claim it can do. They might as well be telling us they're closing up shop for consumer models.
They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped versions.
More than killer AI I'm afraid of Anthropic/OpenAI going into full rent-seeking mode so that everyone working in tech is forced to fork out loads of money just to stay competitive on the market. These companies can also choose to give exclusive access to hand picked individuals and cut everyone else off and there would be nothing to stop them.
This is already happening to some degree, GPT 5.3 Codex's security capabilities were given exclusively to those who were approved for a "Trusted Access" programme.
However, I’m tempted to compare to GitHub: if I join a new company, I will ask to be included to their GitHub account without hesitation. I couldn’t possibly imagine they wouldn’t have one. What makes the cost of that subscription reasonable is not just GitHub’s fear a crowd with pitchforks showing to their office, by also the fact that a possible answer to my non-question might be “Oh, we actually use GitLab.”
If Anthropic is as good as they say, it seems fairly doable to use the service to build something comparable: poach a few disgruntled employees, leverage the promise to undercut a many-trillion-dollar company to be a many-billion dollar company to get investors excited.
I’m sure the founders of Anthropic will have more money than they could possibly spend in ten lifetimes, but I can’t imagine there wouldn’t be some competition. Maybe this time it’s different, but I can’t see how.
you have 2 labs at the forefront (Anthropic/OpenAI), Google closely behind, xAI/Meta/half a dozen chinese companies all within 6-12 months. There is plenty of competition and price of equally intelligent tokens rapidly drop whenever a new intelligence level is achieved.
Unless the leading company uses a model to nefariously take over or neutralize another company, I don't really see a monopoly happening in the next 3 years.
I was focusing on a theoretical dynamic analysis of competition (Would a monopoly make having a competitor easier or harder?) but you are right: practically, there are many players, and they are diverse enough in their values and interest to allow collusion.
We could be wrong: each of those could give birth to as many Basilisks (not sure I have a better name for those conscious, invisible, omni-present, self-serving monsters that so many people imagine will emerge) that coordinate and maintain collusion somehow, but classic economics (complementarity, competition, etc.) points at disruption and lowering costs.
Rent seeking isn't about whether the product has value or not, but about what's extracted in exchage for that value, and whether competition, lack of monopoly, lack of lock in, etc. keeps it realistic.
Rent-seeking of old was a ground rent, monies paid for the land without considering the building that was on it.
Residential rents today often have implied warrants because of modern law, so your landlord is essentially selling you a service at a particular location.
Well don’t forget we still have competition. Were anthropic to rent seek OpenAI would undercut them. Were OpenAI and anthropic to collude that would be illegal. For anthropic to capture the entire coding agent market and THEN rent seek, these days it’s never been easier to raise $1B and start a competing lab
In practice this doesn't work though, the Mastercard-Visa duopoly is an example, two competing forces doesn't create aggressive enough competition to benefit the consumer. The only hope we have is the Chinese models, but it will always be too expensive to run the full models for yourself.
New companies can enter this space. Google’s competing, though behind. Maybe Microsoft, Meta, Amazon, or Apple will come out with top notch models at some point.
There is no real barrier to a customer of Anthropic adopting a competing model in the future. All it takes is a big tech company deciding it’s worth it to train one.
On the other hand, Visa/Mastercard have a lot of lock-in due to consumers only wanting to get a card that’s accepted everywhere, and merchants not bothering to support a new type of card that no consumer has. There’s a major chicken and egg problem to overcome there.
> In practice this doesn't work though, the Mastercard-Visa duopoly is an example,
MC/Visa duopoly is an example of lock-in via network effects. Not sure that that applies to a product that isn't affected by how many other people are running it.
Just in one particular country. That hurts their labs, but there are ~190 other countries in the world for Chinese to sell their products to, just like they do with their cars.
And businesses from these other countries would happily switch to Chinese. From security perspective both Chinese and US espionage is equally bad, so why care if it all comes down to money and performance.
Also Chinese smartphones. Huawei was about 12-18 months from becoming the biggest smartphone manufacturer in the world a few years ago. If it would have been allowed to sell its phones freely in the US I'm fairly sure Apple would have been closer to Nokia than to current day Apple.
I don't think it will matter too much in the long run, 8 of the top 10 smartphone manufacturers are Chinese, there's nothing the US government can really do.
> More than killer AI I'm afraid of Anthropic/OpenAI going into full rent-seeking mode so that everyone working in tech is forced to fork out loads of money just to stay competitive on the market.
You should be more concerned about killer AI than rent seeking by OpenAI and Anthropic. AI evolving to the point of losing control is what scientists and researchers have predicted for years; they didn’t think it would happen this quickly but here we are.
This market is hyper competitive; the models from China and other labs are just a level or two below the frontier labs.
The thing is that the current models can ALREADY replicate most software-based products and services on the market. The open source models are not far behind. At a certain point I'm not sure it matters if the frontier models can do faster and better. I see how they're useful for really complex and cutting edge use cases, but that's not what most people are using them for.
but you are assuming that the magical wizards are the only ones who can create powerful AIs... mind you these people have been born just few decades ago. Their knowledge will be transferred and it will only take a few more decades until anyone can train powerful AIs ... you can only sit on tech for so long before everyone knows how to do it
It's not a matter of knowledge, it's a matter of resources. It takes billions of dollars of hardware to train a SOTA LLM and it's increasing all the time. You cannot possibly hope to compete as an independent or small startup.
> It takes billions of dollars of hardware to train a SOTA LLM and it's increasing all the time.
True, but it's also true that the returns from throwing money to the problem are diminishing. Unless one of those big players invents a new, propriatery paradigm, the gap between a SOTA model and an open model that runs on consumer hardware will narrow in the next 5 years.
Eventually these super expensive SXM data center GPUs will cost pennies on the dollar, and we’ll be able to snatch up H200s for our homelabs. Give it a decade.
Also eventually these WEIGHTS will leak. You can’t have the world’s most valuable data that can just be copied to a hard drive stay in the bottle forever, even if it’s worth a billion dollars. Somehow, some way, that genie’s going to get out, be it by some spiteful employee with nothing to lose, some state actor, or just a fuck up of epic proportions.
Unless, of course, the powerful manage to scare everyone about how the machines will kill us all and so AI technology needs to be properly controlled by the relevant authorities, and anyone making/using an unlicensed AI is arrested and jailed.
With Gemma-4 open and running on laptops and phones I see the flip side. How many non-HN users or researchers even need Opus 4.6e level performance? OpenAI, Anthropric and Google may be “rent seeking” from large corporations — like the Oracles and IBMs.
This is my nightmare about AI; not that the machines will kill all the humans, but that access is preferentially granted to the powerful and it's used to maintain the current power structure in blatant disregard of our democratic and meritocratic ideals, probably using "security" as the justification (as usual).
> I get the security aspect, but if we've hit that point any reasonably sophisticated model past this point will be able to do the damage they claim it can do. They might as well be telling us they're closing up shop for consumer models.
I read it like I always read the GPT-2 announcement no matter what others say: It's *not* being called "too dangerous to ever release", but rather "we need to be mindful, knowing perfectly well that other AI companies can replicate this imminently".
The important corps (so presumably including the Linux Foundation, bigger banks and power stations, and quite possibly excluding x.com) will get access now, and some other LLM which is just as capable will give it to everyone in 3 months time at which point there's no benefit to Anthropic keeping it off-limits.
This is why the EAs, and their almost comic-book-villain projects like "control AI dot com" cannot be allowed to win. One private company gatekeeping access to revolutionary technology is riskier than any consequence of the technology itself.
Having done a quick search of "control AI dot com", it seems their intent is educate lawmakers & government in order to aid development of a strong regulatory framework around frontier AI development.
Not sure how this is consistent with "One private company gatekeeping access to revolutionary technology"?
> strong regulatory framework around frontier AI development
You have to decode feel-good words into the concrete policy. The EAs believe that the state should prohibit entities not aligned with their philosophy to develop AIs beyond a certain power level.
And what is malicious about that ideology? I think EAs tend to like the smell of their farts way too much, but their views on AI safety don't seem so bad. I think their thoughts on hypothetical super intelligence or AGI are too focused on control (alignment) and should also focus on AI welfare, but that's more a point of disagreement that I doubt they'd try to forbid.
> They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped versions.
That’s not going to happen. If you recall, OpenAI didn’t release a model a few years ago because they felt it was too dangerous.
Anthropic is giving the industry a heads up and time to patch their software.
They said there are exploitable vulnerabilities in every major operating system.
But in 6 months every frontier model will be able to do the same things. So Anthropic doesn’t have the luxury of not shipping their best models. But they also have to be responsible as well.
I think they already said somewhere that they can't release Mythos because it requires absurdly large amounts of compute. The economics of releasing it just don't work.
> A jump that we will never be able to use since we're not part of the seemingly minimum 100 billion dollar company club as requirement to be allowed to use it.
> They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped
Duh, this was fucking obvious from the start. The only people saying otherwise were zealots who needed a quick line to dismiss legitimate concerns.
Are these fair comparisons? It seems like mythos is going to be like a 5.4 ultra or Gemini Deepthink tier model, where access is limited and token usage per query is totally off the charts.
> Importantly, we find that when used in an interactive, synchronous, “hands-on-keyboard”
pattern, the benefits of the model were less clear. When used in this fashion, some users perceived Mythos Preview as too slow and did not realize as much value. Autonomous, long-running agent harnesses better elicited the model’s coding capabilities. (p201)
^^ From the surrounding context, this could just be because the model tends to do a lot of work in the background which naturally takes time.
> Terminal-Bench 2.0 timeouts get quite restrictive at times, especially with thinking models, which risks hiding real capabilities jumps behind seemingly uncorrelated confounders like sampling speed. Moreover, some Terminal-Bench 2.0 tasks have ambiguities and limited resource specs that don’t properly allow agents to explore the full solution space — both being currently addressed by the maintainers in the 2.1 update. To exclusively measure agentic coding capabilities net of the confounders, we also ran Terminal-Bench with the latest 2.1 fixes available on GitHub, while increasing the timeout limits to 4 hours (roughly four times the 2.0 baseline). This brought the mean reward to 92.1%. (p188)
> ...Mythos Preview represents only a modest accuracy improvement over our best Claude Opus 4.6 score (86.9% vs. 83.7%). However, the model achieves this score with a considerably smaller token footprint: the best Mythos Preview result uses 4.9× fewer tokens per task than Opus 4.6 (226k vs. 1.11M tokens per task). (p191)
The first point is along the lines of what I'd expect given that claude code is generally reliable at this point. A model's raw intelligence doesn't seem as important right now compared to being able to support arbitrary length context.
If it only really shines once you give it a long leash and hours to work, that’s a rough fit for actual hands-on coding. A lot of that pain is just cloud round-trip tax showing up in the product, which is a big part of why we’re building rig.ai around local inference and a tight interactive loop.
I'm curious if frontier labs use any forms of compression on their models to improve performance. The small % drop of Q8 or FP8 would still put it ahead of Opus, but should double token throughput. Maybe then interactive use would feel like an improvement.
The quote comparing them here was for BrowseComp which "tests an agent's ability to find hard-to-locate information on the open web." (for those wondering). The new model seems significantly better than Opus4.6 judging by the 'Overall results summary'
Good catch. If it's "too slow" even when ran in a state-of-the-art datacenter environment, this "Mythos" model is most closely comparable to the "Deep Research" modes for GPT and Gemini, which Claude formerly lacked any direct equivalent for.
I don't think that's what's being hinted at. The system card seems to say that the model is both token efficient and slow in practice. Deep research modes generally work by having many subagents/large token spend. So this more likely the fact that each token just takes longer to produce, which would be because the model is simply much larger.
By epoch AIs datacenter tracking methods, anthropic has had access to the largest amount of contiguous compute since late last year. So this might simply be the end result result of being the first to have the capacity to conduct a training run of this size. Or the first seemingly successful one at any rate.
"Slow and token-efficient" could be achieved quite trivially by taking an existing large MoE model and increasing the amount of active experts per layer, thus decreasing sparsity. The broader point is that to end users, Mythos behaves just like Deep Research: having it be "more token efficient" compared to running swarms of subagents is not something that impacts them directly.
Not discussing Mythos here, but Opus. Opus to me has been significantly better at SWE than GPT or Gemini - that gets me confused why Opus is ranking clearly lower than GPT, and even lower than Gemini.
Agree, I never actually had great success with Opus. I think its the failures that are annoying, its probably better than codex when its "good", but it fails in annoying ways that I think codex very seldom does.
I wouldn't call codex considerably better. It may depend on specific codebase and your expectations, but codex produces more "abstraction for the sake of abstraction" even on simple tasks, while opus in my experience usually chooses right level of abstraction for given task.
Humanity's Last Exam (HLE) is already insanely difficult. It introduces 2,500 questions spanning mathematics, humanities, natural sciences, ancient languages, ...
I've never understood the point of things like HLE, it doesn't really prove or show anything since 99.99% of humans can't do a single question on this exam.
That is, it's easy to make benchmarks which humans are bad at, humans are really bad at many things.
Divide 123094382345234523452345111 by 0.1234243131324, guess what, humans would find that hard, computers easy. But it doesn't mean much.
Humanity's last exam (HLE) couldn't be completed by most of humanity, the vast majority, so it doesn't really capture anything about humanity or mean much if a computer can do it.
the point is that each question is something that a specialist in a field would be able to do, but deems challenging enough that the ability to solve it would imply significant general usefulness in that domain
Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen Opus 4.6 or GPT-5.4, I don't know what to make of the significant jumps on other benchmarks within these same categories. Training to the test? Better training?
And the decision to withhold general release (of a 'preview' no less!) seems to be well, odd. And the decision to release a 'preview' version to specific companies? You know any production teams at these massive companies that would work with a 'preview' anything? R&D teams, sure, but production? Part of me wants to LoL.
What are they trying to do? Induce FOMO and stop subscriber bleed-out stemming from the recent negative headlines around problems with using Claude?
> Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen
We're not reading the same numbers I think. Compared to Opus 4.6, it's a big jump nearly in every single bench GP posted. They're "only" catching up to Google's Gemini on GPQA and MMMLU but they're still beating their own Opus 4.6 results on these two.
This sounds like a much better model than Opus 4.6.
That's why I listed out the ones where it is barely competitive from @babelfish's table, which itself is extracted from Pg 186 & 187 of the System Card, which has the comparison with Opus 4.6, GPT 5.4 and Gemini 3.1 Pro.
Sure, it may be better than Opus 4.6 on some of those, but barely achieves a small increase over GPT-5.4 on the ones I called out.
It's higher than all other models except vs Gemini 3.1 Pro on MMMLU
MMMLU is generally thought to be maxed out - as it it might not be possible to score higher than those scores.
> Overall, they estimated that 6.5% of questions in MMLU contained an error, suggesting the maximum attainable score was significantly below 100%[1]
Other models get close on GPQA Diamond, but it wouldn't be surprising to anyone if the max possible on that was around the 95% the top models are scoring.
barely competitive ? Mythos column is the first column.
You are the only person with this take on hackernews, everyone else "this is a massive a jump". Fwiwi, the data you list shows the biggest jump I remember for mythos
Can you please stop posting comments with personal swipes in them? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.
You're right, I apologize for that. I have been responding with annoyance rather than walking away when I receive replies that appear to be ignoring context.
Let's be clear: your entire post is just pure, unadulterated FUD. You first claim, based on cherry-picked benchmarks, that Mythos is actually only "barely competitive" with existing models, then suggest they must be training to the test, then call it "odd" that they are withholding the release despite detailed and forthcoming explanations from Anthropic regarding why they are doing that, then wrap it up with the completely unsubstantiated that they must be bleeding subscribers and that this must just be to stop that bleed.
Honestly we are all sleeping on GPT-5.4. Particularly with the influx of Claude users recently (and increasingly unstable platform) Codex has been added to my rotation and it's surprising me.
It really is, for complex tasks. Claude excels at low-mid complexity (CRUD apps, most business apps). For anything somewhat out of the distribution, codex at the moment has no peer.
I have always used Claude at max thinking levels since it launched. It has never been up to the task. For clarity, the task being this: https://github.com/tsoniclang/tsonic
Meanwhile, there are half a dozen other projects (business apps, web apps etc) where it works well.
GPT is shit at writing code. It's not dumb - extra high thinking is really good at catching stuff - but it's like letting a smart junior into your codebase - ignore all the conventions, surrounding context, just slop all over the place to get it working. Claude is just a level above in terms of editing code.
Very different experience for me. Codex 5.3+ on xhigh are the only models I've tried so far that write reasonably decent C++ (domains: desktop GUI, robotics, game engine dev, embedded stuff, general systems engineering-type codebases), and idiomatic code in languages not well-represented in training data, e.g. QML. One thing I like is explicitly that it knows better when to stop, instead of brute-forcing a solution by spamming bespoke helpers everywhere no rational dev would write that way.
Not always, no, and it takes investment in good prompting/guardrails/plans/explicit test recipes for sure. I'm still on average better at programming in context than Codex 5.4, even if slower. But in terms of "task complexity I can entrust to a model and not be completely disappointed and annoyed", it scores the best so far. Saves a lot on review/iteration overhead.
It's annoying, too, because I don't much like OpenAI as a company.
Same background as you, and same exact experience as you. Opus and Gemini have not come close to Codex for C++ work. I also run exclusively on xhigh. Its handling of complexity is unmatched.
At least until next week when Mythos and GPT 6 throw it all up in the air again.
Not my experience. GPT 5.4 walks all over Claude from what I've worked with and its Claude that is the one willing to just go do unnecessary stuff that was never asked for or implement the more hacky solutions to things without a care for maintainability/readability.
But I do not use extra high thinking unless its for code review. I sit at GPT 5.4 high 95% of the time.
ChatGPT 5.4 with extra high reasoning has worked really well for me, and I don't notice a huge difference with Opus 4.6 with high reasoning (those are the 2 models/thinking modes I've used the most in the last month or so).
And as a bonus: GPT is slow. I’m doing a lot of RE (IDA Pro + MCP), even when 5.4 gives a little bit better guesses (rarely, but happens) - it takes x2-x4 longer. So, it’s just easier to reiterate with Opus
This. People drastically underestimate how much more useful a lightning fast slightly dumb model is compared to a super smart but mega slow model is. Sure, u may need to bust out the beef now and then. However, the overwhelming majority of work the fast stupid model is a better fit.
I've been messing with using Claude, Codex, and Kimi even for reverse engineering at https://decomp.dev/ it's a ton of fun.
Great because matching bytes is a scoring function that's easy for the models to understand and make progress on.
Yes, it's becoming clear that OpenAI kinda sucks at alignment. GPT-5 can pass all the benchmarks but it just doesn't "feel good" like Claude or Gemini.
An alternative but similar formulation of that statement is that Anthropic has spent more training effort in getting the model to “feel good” rather than being correct on verifiable tasks. Which more or less tracks with my experience of using the model.
Alignment is a subspace of capability. Feeling good is nice, but it's also a manifestation of the level that the model can predict what I do and don't want it to do. The more accurately it can predict my intentions without me having to spell them out explicitly in the prompt, the more helpful it is.
GPT-5 is good at benchmarks, but benchmarks are more forgiving of a misaligned model. Many real world tasks often don't require strong reasoning abilities or high intelligence, so much as the ability to understand what the task is with a minimal prompt.
Not every shop assistant needs a physics degree, and not every physics professor is necessarily qualified to be a shop assistant. A person, or LLM, can be very smart while at the same time very bad at understanding people.
For example, if GPT-5 takes my code and rearranges something for no reason, that's not going to affect its benchmarks because the code will still produce the same answers. But now I have to spend more time reviewing its output to make sure it hasn't done that. The more time I have to spend post-processing its output, the lower its capabilities are since the measurement of capability on real world tasks is often the amount of time saved.
Whenever I come back to ChatGPT after using Claude or Gemini for an extended period, I’m really struck by the “AI-ness.” All the verbal tics and, truly, sloppishness, have been trained away by the other, more human-feeling models at this point.
It still has a very ... plastic feeling. The way it writes feels cheap somehow. I don't know why, but Claude seems much more natural to me. I enjoy reading its writing a lot more.
That said, I'll often throw a prompt into both claude and chatgpt and read both answers. GPT is frequently smarter.
This has been my experience. With very very rigid constraints it does ok, but without them it will optimize expediency and getting it done at the expense of integrating with the broader system.
Me: Let's figure out how to clone our company Wordpress theme in Hugo. Here're some tools you can use, here's a way to compare screenshots, iterate until 0% difference.
Codex: Okay Boss! I did the thing! I couldn't get the CSS to match so I just took PNGs of the original site and put them in place! Matches 100%!
My impression was entirely the opposite; the unsolved subset of SWE-bench verified problems are memorizable (solutions are pulled from public GitHub repos) and the evaluators are often so brittle or disconnected from the problem statement that the only way to pass is to regurgitate a memorized solution.
OpenAI had a whole post about this, where they recommended switching to SWE-bench Pro as a better (but still imperfect) benchmark:
> We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions
> SWE-bench problems are sourced from open-source repositories many model providers use for training purposes. In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix
> improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time
> We’re building new, uncontaminated evaluations to better track coding capabilities, and we think this is an important area to focus on for the wider research community. Until we have those, OpenAI recommends reporting results for SWE-bench Pro.
> My impression was entirely the opposite; the unsolved subset of SWE-bench verified problems are memorizable (solutions are pulled from public GitHub repos) and the evaluators are often so brittle or disconnected from the problem statement that the only way to pass is to regurgitate a memorized solution.
Anthropic accounts for this
>To detect memorization, we use a Claude-based auditor that compares each
model-generated patch against the gold patch and assigns a [0, 1] memorization
probability. The auditor weighs concrete signals—verbatim code reproduction when
alternative approaches exist, distinctive comment text matching ground truth, and
more—and is instructed to discount overlap that any competent solver would produce
given the problem constraints.
Funny, I made my own model at home and got even higher scores than these. I'm a bit concerned about releasing it, though, so I'm just going to keep it local for now.
I'm pretty sure that will be true with AI as well.
No accounting for taste, but part of makes code hard for me to reason about is when it has lots of combinatorial complexity, where the amount of states that can happen makes it difficult to know all the possible good and bad states that your program can be in. Combinatorial complexity is something that objectively can be expensive for any form of computer, be it a human brain or silicon. If the code is written in such a way that the number of correct and incorrect states are impossible to know, then the problem becomes undecidable.
I do think there is code that is "objectively" difficult to work with.
There are a number of things that make code hard to reason about for humans, and combinatorial complexity is just one of them. Another one is, say, size of working memory, or having to navigate across a large number of files to understand a piece of logic. These two examples are not necessarily expensive for computers.
I don't entirely disagree that there is code that's objectively difficult to work with, but I suspect that the Venn diagram of "code that's hard for humans" and "code that's hard for computers" has much less overlap than you're suggesting.
Certainly with current models I have found that the Venn diagram of "code that's hard for humans" and "code that's hard for computers" has actually been remarkably similar, I suspect because it's trained on a lot of terrible code on Github.
I'm sure that these models will get better, and I agree that the overlap will be lower at that point, but I still think what I said will be true.
I wouldn't expect so. These machines have been trained on natural language, after all. They see the world through an anthropomorphic lens. IME & from what I've heard, they struggle with inexpressive code in much the same way humans do.
What do you think about the argument that we are entering a world where code is so cheap to write, you can throw the old one away and build a new one after you've validated the business model, found a niche, whatever?
I mean, it seems like that has always been true to an extent, but now it may be even more true? Once you know you're sitting on a lode of gold, it's a lot easier to know how much to invest in the mine.
It hasn't always been true, it started with rapid development tools in the late 90's I believe.
And some people thought they were building "disposable" code, only to see their hacks being used for decades. I'm thinking about VB but also behemoth Excel files.
I guess the question is, are the issues not worth fixing because implementing a fix is extremely expensive, or because the improvements from fixing it were anticipated to be minor? I assume the answer is generally a mix of the two.
Someone has to figure out how to make the experiences of the two generations consistent in the ways it needs to be and differ only in the ways it doesn't still.
The tl;dr of this is that I don't think that the code itself is what needs to be preserved, the prompt and chat is the actual important and useful thing here. At some point I think it makes more sense to fine tune the prompts to get increasingly more specific and just regenerate the the code based on that spec, and store that in Git.
> At some point I think it makes more sense to fine tune the prompts to get increasingly more specific and just regenerate the the code based on that spec, and store that in Git.
Generating code using a non-deterministic code generator is a bold strategy. Just gotta hope that your next pull of the code slot machine doesn’t introduce a bug or ten.
We're already merging code that has generated bugs from the slot machine. People aren't actually reading through 10,000 line pull requests most of the time, and people aren't really reviewing every line of code.
Given that, we should instead tune the prompts well enough to not leave things to chance. Write automated tests to make sure that inputs and outputs are ok, write your specs so specifically that there's no room for ambiguity. Test these things multiple times locally to make sure you're getting consistent results.
> Write automated tests to make sure that inputs and outputs are ok
Write them by hand or generate them and check them in? You can’t escape the non-determinism inherent in LLMs. Eventually something has to be locked in place, be it the application code or the test code. So you can’t just have the LLM generate tests from a spec dynamically either.
> write your specs so specifically that there's no room for ambiguity
Using English prose, well known for its lack of ambiguity. Even extremely detailed RFCs have historically left lots of room for debate about meaning and intention. That’s the problem with not using actual code to “encode” how the system functions.
I get where you’re coming from but I think it’s a flawed idea. Less flawed than checking in vibe-coded feature changes, but still flawed.
> Write them by hand or generate them and check them in?
Yes, written by hand. I think that ultimately you should know what valid inputs and outputs are and as such the tests should be written by a human in accordance with the spec.
> Less flawed than checking in vibe-coded feature changes, but still flawed.
This is what I'm trying to get at. I agree it's not perfect, but I'm arguing it's less evil than what is currently happening.
Observability into how a foundation model generated product arrived to that state is significantly more important than the underlying codebase, as it's the prompt context that is the architecture.
Yeah, I'm just a little tired of seeing these pull requests of multi-thousand-line pull requests where no one has actually looked at the code.
The solution people are coming up with now is using AI for code reviews and I have to ask "why involve Git at all then?". If AI is writing the code, testing the code, reviewing the code, and merging the code, then it seems to me that we can just remove these steps and simply PR the prompts themselves.
You don't actually need source control to be able to roll back to any particular version that was in use. A series of tarballs will let you do that.
The entire purpose of source control is to let you reason about change sets to help you make decisions about the direction that development (including bug fixes) will take.
If people are still using git but not really using it, are they doing so simply to take advantage of free resources such as github and test runners, or are they still using it because they don't want to admit to themselves that they've completely lost control?
> are they still using it because they don't want to admit to themselves that they've completely lost control?
I think this is the case, or at least close.
I think a lot of people are still convincing themselves that they are the ones "writing" it because they're the ones putting their names on the pull request.
It reminds me of a lot of early Java, where it would make you feel like you were being very productive because everything that would take you eight lines in any other language would take thirty lines across three files to do in Java. Even though you didn't really "do" anything (and indeed Netbeans or IntelliJ or Eclipse was likely generating a lot of that bootstrapping code anyway), people would act like they were doing a lot of work because of a high number of lines of code.
Java is considerably less terrible now, to a point where I actually sort of begrudgingly like writing it, but early Java (IMO before Java 21 and especially before 11) was very bad about unnecessary verbosity.
> If people are still using git but not really using it, are they doing so simply to take advantage of free resources such as github and test runners,
does it have to be free to be useful? the CD part is is even more important than before, and if they still use git as their input, and everyone including the LLM is already familiar with git, whats the need to get rid of it?
there's value in git as a tool everyone knows the basics of, and as a common interface of communicating code to different systems.
passing tarballs around requires defining a bunch of new interfaces for those tarballs which adds a cost to every integration that you'd otherwise get for about free if you used git
A series of tarballs is really unwieldy for that though. Even if you don't want to use git, and even if the LLM is doing everything, having discrete pieces like "added GitHub oauth to login" and "added profile picture to account page" as different commits is still valuable for when you have to ask the LLM "hey about the profile picture on the account page".
Also, the approach you described is what a number of AI for Code Review products are using under-the-hood, but human-in-the-loop is still recognized as critical.
It's the same way how written design docs and comments are significantly more valuable than uncommented and undocumented source.
Because LLMs are designed as emulators of actual human reasoning, it wouldn't surprise me if we discover that the things that make software easy for humans to reason about also make it easier for LLMs to reason about.
No per-agent auto-worktree? This is the killer feature of Conductor, having to type `/worktree` into every new chat isn't really a resolution. Not even sure what selecting 'Worktree' for a new chat does
"having to type `/worktree` into every new chat isn't really a resolution"
I don't know what you're talking about. My experience with Cursor (before this new v3) is that new Cursor agent tabs / cloud agents already intelligently manage worktrees to prevent conflicts.
Wow, maybe something is wrong with my setup. In Cursor 3, I am clicking "New Agent" at the top left. My root repository is correctly listed on top of the composer, and I clicked the icon to the right of it and selected 'Worktree'. Then, I instruct the model to run `pwd` and tell me it's git status. It's always just on `main` in my root repository. I dug through the settings and couldn't find anything, and after finding this comment[0] on their forums gave up. Would you mind sharing a bit more about your setup/how it works?
They apparently removed this in 3.0. - I couldn't begin to guess why.
"Automatic management of worktrees was removed in Cursor 3.0 and replaced with the new commands /worktree and /best-of-n. We also have added worktree support for the Cursor CLI.
Management of worktrees is now fully agentic. This makes it simpler to support use cases such as starting an agent, and only doing work in a worktree later on in the chat's lifecycle.
/best-of-n makes comparing the results of multiple models much easier. The parent agent will provide commentary on the different results and you can pick the best one. Additionally, you can even ask the parent agent to merge different parts of the different implementations into a single commit.
If you had agents that were previously running in a worktree, those chats will still work. However, you will need to use the new commands to start new agents in worktrees."
Every company I've worked at has still had a few engineers who insist on working exclusively in the CLI with vim/emacs prior to AI. Every other engineer used some flavor of a desktop app ranging from more minimal editors to incredibly complex IDEs. I expect we land back on UIs long term.
On the above compared benchmarks is closer to other larger open weights models, and on par with GPT-OSS 120B, for which I also have a frame of reference.
Probably their auditors? Lying about this would be tantamount to (very serious) securities fraud. Not sure what you're basing on your allegations on besides "trust me bro"
Why would lying about having E2EE be securities (as in stock market) fraud? Would that make any lie ever told by a corporation equate to stock market fraud?
reply