Hacker Newsnew | past | comments | ask | show | jobs | submit | babelfish's commentslogin


I've put that link in the top text - thanks!

Edit: looks like https://news.ycombinator.com/item?id=47692043 was posted earlier and is also on the frontpage so we'll merge thither instead.


Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)

  SWE-bench Verified:        93.9% / 80.8% / —     / 80.6%
  SWE-bench Pro:             77.8% / 53.4% / 57.7% / 54.2%
  SWE-bench Multilingual:    87.3% / 77.8% / —     / —
  SWE-bench Multimodal:      59.0% / 27.1% / —     / —
  Terminal-Bench 2.0:        82.0% / 65.4% / 75.1% / 68.5%

  GPQA Diamond:              94.5% / 91.3% / 92.8% / 94.3%
  MMMLU:                     92.7% / 91.1% / —     / 92.6–93.6%
  USAMO:                     97.6% / 42.3% / 95.2% / 74.4%
  GraphWalks BFS 256K–1M:    80.0% / 38.7% / 21.4% / —

  HLE (no tools):            56.8% / 40.0% / 39.8% / 44.4%
  HLE (with tools):          64.7% / 53.1% / 52.1% / 51.4%

  CharXiv (no tools):        86.1% / 61.5% / —     / —
  CharXiv (with tools):      93.2% / 78.9% / —     / —

  OSWorld:                   79.6% / 72.7% / 75.0% / —

Haven't seen a jump this large since I don't even know, years? Too bad they are not releasing it anytime soon (there is no need as they are still currently the leader).

There's speculation that next Tuesday will be a big day for OpenAI and possibly GPT 6. Anthropic showed their hand today.

Sounds like a good opportunity to pause spending on nerfed 4.6 and wait for the new model to be released and then max out over 2 weeks before it gets nerfed again.


the performance degradation I've seen isn't quality/completion but duration, I get good results but much less quickly than I did before 4.6. Still, it's just anecdata, but a lot of folks seem to feel the same.

Been reading posts like these for 3 years now. There’s multiple sites with #s. I’m willing to buy “I’m paying rent on someone’s agent harness and god knows what’s in the system prompt rn”, but in the face of numbers, gotta discount the anecdotal.

You're probably right. It's probably more likely that for some period of time I forgot that I switched to the large context Opus vs Sonnet and it was not needed for the level of complexity of my work.

Yeah, why trust your actual experience over numbers? Nothing surer than synthetic benchmarks

Strawman, and, synthetic benchmark? :)

I don't believe that trackers like this are trustworthy. There's an enormous financial motive to cheat and these companies have a track record of unethical conduct.

If I was VP of Unethical Business Strategy at OpenAI or Anthropic, the first thing I'd do is put in place an automated system which flags accounts, prompts, IPs, and usage patterns associated with these benchmarks and direct their usage to a dedicated compute pool which wouldn't be affected by these changes.


This just looks like random noise to me? Is it also random on short timespans, like running it 10x in a row?

Explained in the methodology at the bottom of this page: https://marginlab.ai/trackers/claude-code/

That does not sound very believable. Last time Anthropic released a flagship model, it was followed by GPT Codex literally that afternoon.

Ya'll know they're teaching to the test. I'll wait till someone devises a novel test that isn't contained in the datasets. Sure, they're still powerful.

My understanding is GPT 6 works via synaptic space reasoning... which I find terrifying. I hope if true, OpenAI does some safety testing on that, beyond what they normally do.

From the recent New Yorker piece on Sam:

“My vibes don’t match a lot of the traditional A.I.-safety stuff,” Altman said. He insisted that he continued to prioritize these matters, but when pressed for specifics he was vague: “We still will run safety projects, or at least safety-adjacent projects.” When we asked to interview researchers at the company who were working on existential safety—the kinds of issues that could mean, as Altman once put it, “lights-out for all of us”—an OpenAI representative seemed confused. “What do you mean by ‘existential safety’?” he replied. “That’s not, like, a thing.”


Amusing! Even if they believe that, they should know the company communicated the opposite earlier.

No chance an openAI spokesperson doesnt know what existential safety is

I did not read the response as...

>Please provide the definition of Existential Safety.

I read:

>Are you mentally stable? Our product would never hurt humanity--how could any language model?


The absolute gall of this guy to laugh off a question about x-risks. Meanwhile, also Sam Altman, in 2015: "Development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity. There are other threats that I think are more certain to happen (for example, an engineered virus with a long incubation period and a high mortality rate) but are unlikely to destroy every human in the universe in the way that SMI could. Also, most of these other big threats are already widely feared." [1]

[1] https://blog.samaltman.com/machine-intelligence-part-1


Why are these people always like this.

Likely an improvement on:

> We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.

<https://arxiv.org/abs/2502.05171>


Oh you mean literally the thing in AI2027 that gets everyone killed? Wonderful.

AI 2027 is not a real thing which happened. At best, it is informed speculation.

Funny if you open their website and go to April 2026 you literally see this: 26b revenue (Anthropic beat 30b) + pro human hacking (mythos?).

I don’t think predictions, but they did a great call until now.


I agree that they called many things remarkably well! That doesn't change the fact that AI 2027 is not a thing which happened, so it isn't valid to point out "this killed us in AI 2027." There are many reasons to want to preserve CoT monitorability. Instead of AI 2027, I'd point to https://arxiv.org/html/2507.11473.

That's sounds really interesting. Do you have some hints where to read more?

Oh, of course they will /s

Is this even real? coming off the heals of GLM5.1's announcement this feels almost like a llama 4 launch to hedge off competition.

not much of a jump 94.5% / 91.3%

Actually, going from 91.3% to 94.5% is a significant jump, because it means the model has gotten a lot better at solving the hardest problems thrown at it. This has downstream effects as well: it means that during long implementation tasks, instead of getting stuck at the most challenging parts and stopping (or going in loops!), it can now get past them to finish the implementation.

We can look at the same numbers in different way:

  Error with 91.3% = 8.7%
  Error with 94.5% = 5.5%

  Error reduction = 8.7% - 5.5% = 3.2%
So the improvement is 3.2% / 8.7% = 36.8%

A jump that we will never be able to use since we're not part of the seemingly minimum 100 billion dollar company club as requirement to be allowed to use it.

I get the security aspect, but if we've hit that point any reasonably sophisticated model past this point will be able to do the damage they claim it can do. They might as well be telling us they're closing up shop for consumer models.

They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped versions.


More than killer AI I'm afraid of Anthropic/OpenAI going into full rent-seeking mode so that everyone working in tech is forced to fork out loads of money just to stay competitive on the market. These companies can also choose to give exclusive access to hand picked individuals and cut everyone else off and there would be nothing to stop them.

This is already happening to some degree, GPT 5.3 Codex's security capabilities were given exclusively to those who were approved for a "Trusted Access" programme.


Describing providing a highly valuable service for money as `rent seeking` is pretty wild.

It could be, formally, if they have a monopoly.

However, I’m tempted to compare to GitHub: if I join a new company, I will ask to be included to their GitHub account without hesitation. I couldn’t possibly imagine they wouldn’t have one. What makes the cost of that subscription reasonable is not just GitHub’s fear a crowd with pitchforks showing to their office, by also the fact that a possible answer to my non-question might be “Oh, we actually use GitLab.”

If Anthropic is as good as they say, it seems fairly doable to use the service to build something comparable: poach a few disgruntled employees, leverage the promise to undercut a many-trillion-dollar company to be a many-billion dollar company to get investors excited.

I’m sure the founders of Anthropic will have more money than they could possibly spend in ten lifetimes, but I can’t imagine there wouldn’t be some competition. Maybe this time it’s different, but I can’t see how.


> It could be, formally, if they have a monopoly.

you have 2 labs at the forefront (Anthropic/OpenAI), Google closely behind, xAI/Meta/half a dozen chinese companies all within 6-12 months. There is plenty of competition and price of equally intelligent tokens rapidly drop whenever a new intelligence level is achieved.

Unless the leading company uses a model to nefariously take over or neutralize another company, I don't really see a monopoly happening in the next 3 years.


Precisely.

I was focusing on a theoretical dynamic analysis of competition (Would a monopoly make having a competitor easier or harder?) but you are right: practically, there are many players, and they are diverse enough in their values and interest to allow collusion.

We could be wrong: each of those could give birth to as many Basilisks (not sure I have a better name for those conscious, invisible, omni-present, self-serving monsters that so many people imagine will emerge) that coordinate and maintain collusion somehow, but classic economics (complementarity, competition, etc.) points at disruption and lowering costs.


> practically, there are many players, and they are diverse enough in their values and interest to allow collusion.

Not only that, but open-weight and fully open-source models are also a thing, and not that far behind.


Why, you thought rented homes aren't valuable?

Rent seeking isn't about whether the product has value or not, but about what's extracted in exchage for that value, and whether competition, lack of monopoly, lack of lock in, etc. keeps it realistic.


My housing is pretty valuable. I pay rent. Which timeline are you in?

Actually you're saying similar things:

Rent-seeking of old was a ground rent, monies paid for the land without considering the building that was on it.

Residential rents today often have implied warrants because of modern law, so your landlord is essentially selling you a service at a particular location.


thanks!


Yes I know that, read your sibling post

Two different "rent"s.

Not really see your sibling post

Well don’t forget we still have competition. Were anthropic to rent seek OpenAI would undercut them. Were OpenAI and anthropic to collude that would be illegal. For anthropic to capture the entire coding agent market and THEN rent seek, these days it’s never been easier to raise $1B and start a competing lab

In practice this doesn't work though, the Mastercard-Visa duopoly is an example, two competing forces doesn't create aggressive enough competition to benefit the consumer. The only hope we have is the Chinese models, but it will always be too expensive to run the full models for yourself.

New companies can enter this space. Google’s competing, though behind. Maybe Microsoft, Meta, Amazon, or Apple will come out with top notch models at some point.

There is no real barrier to a customer of Anthropic adopting a competing model in the future. All it takes is a big tech company deciding it’s worth it to train one.

On the other hand, Visa/Mastercard have a lot of lock-in due to consumers only wanting to get a card that’s accepted everywhere, and merchants not bothering to support a new type of card that no consumer has. There’s a major chicken and egg problem to overcome there.


> In practice this doesn't work though, the Mastercard-Visa duopoly is an example,

MC/Visa duopoly is an example of lock-in via network effects. Not sure that that applies to a product that isn't affected by how many other people are running it.


Chinese competition can always be banned. Example: Chinese electric car competition

Just in one particular country. That hurts their labs, but there are ~190 other countries in the world for Chinese to sell their products to, just like they do with their cars.

And businesses from these other countries would happily switch to Chinese. From security perspective both Chinese and US espionage is equally bad, so why care if it all comes down to money and performance.


Also Chinese smartphones. Huawei was about 12-18 months from becoming the biggest smartphone manufacturer in the world a few years ago. If it would have been allowed to sell its phones freely in the US I'm fairly sure Apple would have been closer to Nokia than to current day Apple.

If Huawei was never banned from using TSMC, they'd likely have a real Nvidia competitor and may have surpassed Apple in mobile chip designs.

They actually beat Apple A series to become the first phone to use the TSMC N7 node.


I don't think it will matter too much in the long run, 8 of the top 10 smartphone manufacturers are Chinese, there's nothing the US government can really do.

That's what OP was saying, I think, noting that running them locally won't be a solution.

> More than killer AI I'm afraid of Anthropic/OpenAI going into full rent-seeking mode so that everyone working in tech is forced to fork out loads of money just to stay competitive on the market.

You should be more concerned about killer AI than rent seeking by OpenAI and Anthropic. AI evolving to the point of losing control is what scientists and researchers have predicted for years; they didn’t think it would happen this quickly but here we are.

This market is hyper competitive; the models from China and other labs are just a level or two below the frontier labs.


The thing is that the current models can ALREADY replicate most software-based products and services on the market. The open source models are not far behind. At a certain point I'm not sure it matters if the frontier models can do faster and better. I see how they're useful for really complex and cutting edge use cases, but that's not what most people are using them for.

but you are assuming that the magical wizards are the only ones who can create powerful AIs... mind you these people have been born just few decades ago. Their knowledge will be transferred and it will only take a few more decades until anyone can train powerful AIs ... you can only sit on tech for so long before everyone knows how to do it

It's not a matter of knowledge, it's a matter of resources. It takes billions of dollars of hardware to train a SOTA LLM and it's increasing all the time. You cannot possibly hope to compete as an independent or small startup.

> It takes billions of dollars of hardware to train a SOTA LLM and it's increasing all the time.

True, but it's also true that the returns from throwing money to the problem are diminishing. Unless one of those big players invents a new, propriatery paradigm, the gap between a SOTA model and an open model that runs on consumer hardware will narrow in the next 5 years.


Eventually these super expensive SXM data center GPUs will cost pennies on the dollar, and we’ll be able to snatch up H200s for our homelabs. Give it a decade.

Also eventually these WEIGHTS will leak. You can’t have the world’s most valuable data that can just be copied to a hard drive stay in the bottle forever, even if it’s worth a billion dollars. Somehow, some way, that genie’s going to get out, be it by some spiteful employee with nothing to lose, some state actor, or just a fuck up of epic proportions.


at the point where those gpus cost pennies, they likely won't even be worth the electricity that goes into them, better models would run on laptops.

Presumably, the hardware to run this level of model will be democratized within the timeframe of the parent comment.


Unless, of course, the powerful manage to scare everyone about how the machines will kill us all and so AI technology needs to be properly controlled by the relevant authorities, and anyone making/using an unlicensed AI is arrested and jailed.

With Gemma-4 open and running on laptops and phones I see the flip side. How many non-HN users or researchers even need Opus 4.6e level performance? OpenAI, Anthropric and Google may be “rent seeking” from large corporations — like the Oracles and IBMs.

Everyone, once AI diffuses enough. You’ll be unhireable if you don’t use AI in a year or two.

You know, they have competitors?

This is my nightmare about AI; not that the machines will kill all the humans, but that access is preferentially granted to the powerful and it's used to maintain the current power structure in blatant disregard of our democratic and meritocratic ideals, probably using "security" as the justification (as usual).

> I get the security aspect, but if we've hit that point any reasonably sophisticated model past this point will be able to do the damage they claim it can do. They might as well be telling us they're closing up shop for consumer models.

I read it like I always read the GPT-2 announcement no matter what others say: It's *not* being called "too dangerous to ever release", but rather "we need to be mindful, knowing perfectly well that other AI companies can replicate this imminently".

The important corps (so presumably including the Linux Foundation, bigger banks and power stations, and quite possibly excluding x.com) will get access now, and some other LLM which is just as capable will give it to everyone in 3 months time at which point there's no benefit to Anthropic keeping it off-limits.


This is why the EAs, and their almost comic-book-villain projects like "control AI dot com" cannot be allowed to win. One private company gatekeeping access to revolutionary technology is riskier than any consequence of the technology itself.

Having done a quick search of "control AI dot com", it seems their intent is educate lawmakers & government in order to aid development of a strong regulatory framework around frontier AI development.

Not sure how this is consistent with "One private company gatekeeping access to revolutionary technology"?


> strong regulatory framework around frontier AI development

You have to decode feel-good words into the concrete policy. The EAs believe that the state should prohibit entities not aligned with their philosophy to develop AIs beyond a certain power level.


And what is malicious about that ideology? I think EAs tend to like the smell of their farts way too much, but their views on AI safety don't seem so bad. I think their thoughts on hypothetical super intelligence or AGI are too focused on control (alignment) and should also focus on AI welfare, but that's more a point of disagreement that I doubt they'd try to forbid.

Couldn't agree more. The "safest" AI company is actually the biggest liability. I hope other companies make a move soon.

No it isn't lol. The consequence of the technology literally includes human extinction. I prefer 0 companies, but I'll take 1 over 5.

> They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped versions.

That’s not going to happen. If you recall, OpenAI didn’t release a model a few years ago because they felt it was too dangerous.

Anthropic is giving the industry a heads up and time to patch their software.

They said there are exploitable vulnerabilities in every major operating system.

But in 6 months every frontier model will be able to do the same things. So Anthropic doesn’t have the luxury of not shipping their best models. But they also have to be responsible as well.


I think they already said somewhere that they can't release Mythos because it requires absurdly large amounts of compute. The economics of releasing it just don't work.

Yet they quote a $20,000 cost for one of the exploits.

> A jump that we will never be able to use since we're not part of the seemingly minimum 100 billion dollar company club as requirement to be allowed to use it.

> They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped

Duh, this was fucking obvious from the start. The only people saying otherwise were zealots who needed a quick line to dismiss legitimate concerns.


Are these fair comparisons? It seems like mythos is going to be like a 5.4 ultra or Gemini Deepthink tier model, where access is limited and token usage per query is totally off the charts.

There are a few hints in the doc around this

> Importantly, we find that when used in an interactive, synchronous, “hands-on-keyboard” pattern, the benefits of the model were less clear. When used in this fashion, some users perceived Mythos Preview as too slow and did not realize as much value. Autonomous, long-running agent harnesses better elicited the model’s coding capabilities. (p201)

^^ From the surrounding context, this could just be because the model tends to do a lot of work in the background which naturally takes time.

> Terminal-Bench 2.0 timeouts get quite restrictive at times, especially with thinking models, which risks hiding real capabilities jumps behind seemingly uncorrelated confounders like sampling speed. Moreover, some Terminal-Bench 2.0 tasks have ambiguities and limited resource specs that don’t properly allow agents to explore the full solution space — both being currently addressed by the maintainers in the 2.1 update. To exclusively measure agentic coding capabilities net of the confounders, we also ran Terminal-Bench with the latest 2.1 fixes available on GitHub, while increasing the timeout limits to 4 hours (roughly four times the 2.0 baseline). This brought the mean reward to 92.1%. (p188)

> ...Mythos Preview represents only a modest accuracy improvement over our best Claude Opus 4.6 score (86.9% vs. 83.7%). However, the model achieves this score with a considerably smaller token footprint: the best Mythos Preview result uses 4.9× fewer tokens per task than Opus 4.6 (226k vs. 1.11M tokens per task). (p191)


The first point is along the lines of what I'd expect given that claude code is generally reliable at this point. A model's raw intelligence doesn't seem as important right now compared to being able to support arbitrary length context.

If it only really shines once you give it a long leash and hours to work, that’s a rough fit for actual hands-on coding. A lot of that pain is just cloud round-trip tax showing up in the product, which is a big part of why we’re building rig.ai around local inference and a tight interactive loop.

I'm curious if frontier labs use any forms of compression on their models to improve performance. The small % drop of Q8 or FP8 would still put it ahead of Opus, but should double token throughput. Maybe then interactive use would feel like an improvement.

The quote comparing them here was for BrowseComp which "tests an agent's ability to find hard-to-locate information on the open web." (for those wondering). The new model seems significantly better than Opus4.6 judging by the 'Overall results summary'

Good catch. If it's "too slow" even when ran in a state-of-the-art datacenter environment, this "Mythos" model is most closely comparable to the "Deep Research" modes for GPT and Gemini, which Claude formerly lacked any direct equivalent for.

I don't think that's what's being hinted at. The system card seems to say that the model is both token efficient and slow in practice. Deep research modes generally work by having many subagents/large token spend. So this more likely the fact that each token just takes longer to produce, which would be because the model is simply much larger.

By epoch AIs datacenter tracking methods, anthropic has had access to the largest amount of contiguous compute since late last year. So this might simply be the end result result of being the first to have the capacity to conduct a training run of this size. Or the first seemingly successful one at any rate.


"Slow and token-efficient" could be achieved quite trivially by taking an existing large MoE model and increasing the amount of active experts per layer, thus decreasing sparsity. The broader point is that to end users, Mythos behaves just like Deep Research: having it be "more token efficient" compared to running swarms of subagents is not something that impacts them directly.

Not discussing Mythos here, but Opus. Opus to me has been significantly better at SWE than GPT or Gemini - that gets me confused why Opus is ranking clearly lower than GPT, and even lower than Gemini.

When did you last compare them? Codex right now is considerably better in my experience. Can't speak for Gemini.

Tried Gemini 2 weeks ago to see where it's at, with gemini-cli.

Failed to use tools, failed to follow instructions, and then went into deranged loop mode.

Essentially, it's where it was 1.5 years ago when I tried it the last time.

It's honestly unbelievable how Google managed to fail so miserably at this.


I have not experienced any issues with Gemini 3.1 Pro.

Their harness might be behind

I think failures that I observed with gemini are unrelated to the harness. Because the same failures happened with third party harnesses too.

It’s great on AI Studio. Harness issues, I agree.

Agree, I never actually had great success with Opus. I think its the failures that are annoying, its probably better than codex when its "good", but it fails in annoying ways that I think codex very seldom does.

I wouldn't call codex considerably better. It may depend on specific codebase and your expectations, but codex produces more "abstraction for the sake of abstraction" even on simple tasks, while opus in my experience usually chooses right level of abstraction for given task.

A secret art known to the cognoscenti as "benchmark gaming".

We're gonna need some new benchmarks...

ARC-AGI-3 might be the only remaining benchmark below 50%


Opus 4.6 currently leads the remote labor index at 4.17. GPT-5.4 isn't measured on that one though: https://www.remotelabor.ai/

GPT 5.4 Pro leads Frontier Maths Tier 4 at 35%: https://epoch.ai/benchmarks/frontiermath-tier-4/


> We're gonna need some new benchmarks...

You can't consistently benchmark something that is qualitative by nature. I'm struggling to understand how people don't understand this.


Humanity's Last Exam (HLE) is already insanely difficult. It introduces 2,500 questions spanning mathematics, humanities, natural sciences, ancient languages, ...

Here is an example question: https://i.redd.it/5jl000p9csee1.jpeg

No human could even score 5% on HLE.


I've never understood the point of things like HLE, it doesn't really prove or show anything since 99.99% of humans can't do a single question on this exam.

That is, it's easy to make benchmarks which humans are bad at, humans are really bad at many things.

Divide 123094382345234523452345111 by 0.1234243131324, guess what, humans would find that hard, computers easy. But it doesn't mean much.

Humanity's last exam (HLE) couldn't be completed by most of humanity, the vast majority, so it doesn't really capture anything about humanity or mean much if a computer can do it.


the point is that each question is something that a specialist in a field would be able to do, but deems challenging enough that the ability to solve it would imply significant general usefulness in that domain

I mean they could just feed the solutions into the training data. Then suddenly the bot will do real good at HLE.

Exactly. This is called overfitting and it's most definitely a thing.

but how does it perform on pelican riding a bicycle bench? why are they hiding the truth?!

(edit: I hope this is an obvious joke. less facetiously these are pretty jaw dropping numbers)


We are all fans for Simon’s work, and his test is, strangely enough, quite good.

> Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)

> Terminal-Bench 2.0: 82.0% / 65.4% / 75.1% / 68.5%

> GPQA Diamond: 94.5% / 91.3% / 92.8% / 94.3%

> MMMLU: 92.7% / 91.1% / — / 92.6–93.6%

> USAMO: 97.6% / 42.3% / 95.2% / 74.4%

> OSWorld: 79.6% / 72.7% / 75.0% / —

Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen Opus 4.6 or GPT-5.4, I don't know what to make of the significant jumps on other benchmarks within these same categories. Training to the test? Better training?

And the decision to withhold general release (of a 'preview' no less!) seems to be well, odd. And the decision to release a 'preview' version to specific companies? You know any production teams at these massive companies that would work with a 'preview' anything? R&D teams, sure, but production? Part of me wants to LoL.

What are they trying to do? Induce FOMO and stop subscriber bleed-out stemming from the recent negative headlines around problems with using Claude?


> Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen

We're not reading the same numbers I think. Compared to Opus 4.6, it's a big jump nearly in every single bench GP posted. They're "only" catching up to Google's Gemini on GPQA and MMMLU but they're still beating their own Opus 4.6 results on these two.

This sounds like a much better model than Opus 4.6.


> We're not reading the same numbers I think.

We must not be.

That's why I listed out the ones where it is barely competitive from @babelfish's table, which itself is extracted from Pg 186 & 187 of the System Card, which has the comparison with Opus 4.6, GPT 5.4 and Gemini 3.1 Pro.

Sure, it may be better than Opus 4.6 on some of those, but barely achieves a small increase over GPT-5.4 on the ones I called out.


> barely competitive

It's higher than all other models except vs Gemini 3.1 Pro on MMMLU

MMMLU is generally thought to be maxed out - as it it might not be possible to score higher than those scores.

> Overall, they estimated that 6.5% of questions in MMLU contained an error, suggesting the maximum attainable score was significantly below 100%[1]

Other models get close on GPQA Diamond, but it wouldn't be surprising to anyone if the max possible on that was around the 95% the top models are scoring.

[1] https://en.wikipedia.org/wiki/MMLU


You are reading the percentages wrong.

Because 100% is maximum, you should be looking at error rates instead. GPT has 25% on Terminal Bench and the new model has 18%, almost 1.4x reduction.


barely competitive ? Mythos column is the first column.

You are the only person with this take on hackernews, everyone else "this is a massive a jump". Fwiwi, the data you list shows the biggest jump I remember for mythos


The biggest jump in the numbers they quoted is 6%.

Please look at the columns OTHER than Opus as well.


> Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)

> Terminal-Bench 2.0: 82.0% / 65.4% / 75.1% / 68.5%

> USAMO: 97.6% / 42.3% / 95.2% / 74.4%

> The biggest jump in the numbers they quoted is 6%.

Just in the numbers you quoted, thats a 16.6% jump in terminal-bench and a 55.3% absolute increase in USAMO over their previous Opus 4.6 model.


I don’t know if you’re willingly disregarding everything being said to you or there’s a language barrier here.

Can you please stop posting comments with personal swipes in them? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.


You're right, I apologize for that. I have been responding with annoyance rather than walking away when I receive replies that appear to be ignoring context.

Appreciated! and of course, I know it's not easy - believe me I know...

It's higher than all other models except vs Gemini 3.1 Pro on MMMLU

this just in: HN user forgets how sigmoid functions work

Please don't respond to a bad comment by breaking the site guidelines yourself. That only makes things worse.

https://news.ycombinator.com/newsguidelines.html


Let's be clear: your entire post is just pure, unadulterated FUD. You first claim, based on cherry-picked benchmarks, that Mythos is actually only "barely competitive" with existing models, then suggest they must be training to the test, then call it "odd" that they are withholding the release despite detailed and forthcoming explanations from Anthropic regarding why they are doing that, then wrap it up with the completely unsubstantiated that they must be bleeding subscribers and that this must just be to stop that bleed.

Honestly we are all sleeping on GPT-5.4. Particularly with the influx of Claude users recently (and increasingly unstable platform) Codex has been added to my rotation and it's surprising me.

Totally. Best-in-class for SWE work (until Mythos gets released, if ever, but I suspect the rumored "Spud" will be out by then too)

It really isn’t. I wish it was, because work complains about overuse of Opus.

It really is, for complex tasks. Claude excels at low-mid complexity (CRUD apps, most business apps). For anything somewhat out of the distribution, codex at the moment has no peer.

I find that more experienced devs are more likely to prefer Codex… anecdotal but… it’s a thing.

This is because no one bothers to set thinking to high, as it now defaults to medium in CC.

Once you set thinking to high it works just as well as 5.4 even for pretty complex tasks


I have always used Claude at max thinking levels since it launched. It has never been up to the task. For clarity, the task being this: https://github.com/tsoniclang/tsonic

Meanwhile, there are half a dozen other projects (business apps, web apps etc) where it works well.


GPT is shit at writing code. It's not dumb - extra high thinking is really good at catching stuff - but it's like letting a smart junior into your codebase - ignore all the conventions, surrounding context, just slop all over the place to get it working. Claude is just a level above in terms of editing code.

Very different experience for me. Codex 5.3+ on xhigh are the only models I've tried so far that write reasonably decent C++ (domains: desktop GUI, robotics, game engine dev, embedded stuff, general systems engineering-type codebases), and idiomatic code in languages not well-represented in training data, e.g. QML. One thing I like is explicitly that it knows better when to stop, instead of brute-forcing a solution by spamming bespoke helpers everywhere no rational dev would write that way.

Not always, no, and it takes investment in good prompting/guardrails/plans/explicit test recipes for sure. I'm still on average better at programming in context than Codex 5.4, even if slower. But in terms of "task complexity I can entrust to a model and not be completely disappointed and annoyed", it scores the best so far. Saves a lot on review/iteration overhead.

It's annoying, too, because I don't much like OpenAI as a company.

(Background: 25 years of C++ etc.)


Same background as you, and same exact experience as you. Opus and Gemini have not come close to Codex for C++ work. I also run exclusively on xhigh. Its handling of complexity is unmatched.

At least until next week when Mythos and GPT 6 throw it all up in the air again.


Not my experience. GPT 5.4 walks all over Claude from what I've worked with and its Claude that is the one willing to just go do unnecessary stuff that was never asked for or implement the more hacky solutions to things without a care for maintainability/readability.

But I do not use extra high thinking unless its for code review. I sit at GPT 5.4 high 95% of the time.


ChatGPT 5.4 with extra high reasoning has worked really well for me, and I don't notice a huge difference with Opus 4.6 with high reasoning (those are the 2 models/thinking modes I've used the most in the last month or so).

And as a bonus: GPT is slow. I’m doing a lot of RE (IDA Pro + MCP), even when 5.4 gives a little bit better guesses (rarely, but happens) - it takes x2-x4 longer. So, it’s just easier to reiterate with Opus

Yeah, need some good RE benchmarks for the LLMs. :)

RE is very interesting problem. A lot more that SWE can be RE'd. I've found the LLMs are reluctant to assist, though you can workaround.


What is RE in this context?

Reverse engineering

This. People drastically underestimate how much more useful a lightning fast slightly dumb model is compared to a super smart but mega slow model is. Sure, u may need to bust out the beef now and then. However, the overwhelming majority of work the fast stupid model is a better fit.

I've been messing with using Claude, Codex, and Kimi even for reverse engineering at https://decomp.dev/ it's a ton of fun. Great because matching bytes is a scoring function that's easy for the models to understand and make progress on.

I want to get into RE with AI. Which model you liking the most?

Mind sharing the use cases you're using IDA via MCP for?

Yes, it's becoming clear that OpenAI kinda sucks at alignment. GPT-5 can pass all the benchmarks but it just doesn't "feel good" like Claude or Gemini.

An alternative but similar formulation of that statement is that Anthropic has spent more training effort in getting the model to “feel good” rather than being correct on verifiable tasks. Which more or less tracks with my experience of using the model.

Alignment is a subspace of capability. Feeling good is nice, but it's also a manifestation of the level that the model can predict what I do and don't want it to do. The more accurately it can predict my intentions without me having to spell them out explicitly in the prompt, the more helpful it is.

GPT-5 is good at benchmarks, but benchmarks are more forgiving of a misaligned model. Many real world tasks often don't require strong reasoning abilities or high intelligence, so much as the ability to understand what the task is with a minimal prompt.

Not every shop assistant needs a physics degree, and not every physics professor is necessarily qualified to be a shop assistant. A person, or LLM, can be very smart while at the same time very bad at understanding people.

For example, if GPT-5 takes my code and rearranges something for no reason, that's not going to affect its benchmarks because the code will still produce the same answers. But now I have to spend more time reviewing its output to make sure it hasn't done that. The more time I have to spend post-processing its output, the lower its capabilities are since the measurement of capability on real world tasks is often the amount of time saved.


Whenever I come back to ChatGPT after using Claude or Gemini for an extended period, I’m really struck by the “AI-ness.” All the verbal tics and, truly, sloppishness, have been trained away by the other, more human-feeling models at this point.

GPT was clearly changed after its sycophantic models lead to the lawsuits.

It still has a very ... plastic feeling. The way it writes feels cheap somehow. I don't know why, but Claude seems much more natural to me. I enjoy reading its writing a lot more.

That said, I'll often throw a prompt into both claude and chatgpt and read both answers. GPT is frequently smarter.


GPT is more accurate. But Claude has this way of association between things that seems smarter and more human to me.

This has been my experience. With very very rigid constraints it does ok, but without them it will optimize expediency and getting it done at the expense of integrating with the broader system.

My favorite example of this from last night:

Me: Let's figure out how to clone our company Wordpress theme in Hugo. Here're some tools you can use, here's a way to compare screenshots, iterate until 0% difference.

Codex: Okay Boss! I did the thing! I couldn't get the CSS to match so I just took PNGs of the original site and put them in place! Matches 100%!


I thought they were bluffing when they talked about the scaling laws, but looking at the benchmark scores, they were not.

I wonder if misalignment correlates with higher scores.


Wow. Mythos must be insanely good considering how good a model Opus already is. I hope it's usable on a humble subscription...

You get a single call a month. Use it wisely.

What is the meaning of life, the universe, and everything?

> Thought for 7.5 million years


Hello, Claude!

> Rate limit reached


Lots of benchmaxxing here. A few simple randomizations puts it back on par with gemini 3.1 and under 5.4 pro in most benchmarks

The real part is SWE-bench Verified since there is no way to overfit. That's the only one we can believe.

My impression was entirely the opposite; the unsolved subset of SWE-bench verified problems are memorizable (solutions are pulled from public GitHub repos) and the evaluators are often so brittle or disconnected from the problem statement that the only way to pass is to regurgitate a memorized solution.

OpenAI had a whole post about this, where they recommended switching to SWE-bench Pro as a better (but still imperfect) benchmark:

https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

> We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions

> SWE-bench problems are sourced from open-source repositories many model providers use for training purposes. In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix

> improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time

> We’re building new, uncontaminated evaluations to better track coding capabilities, and we think this is an important area to focus on for the wider research community. Until we have those, OpenAI recommends reporting results for SWE-bench Pro.


> My impression was entirely the opposite; the unsolved subset of SWE-bench verified problems are memorizable (solutions are pulled from public GitHub repos) and the evaluators are often so brittle or disconnected from the problem statement that the only way to pass is to regurgitate a memorized solution.

Anthropic accounts for this

>To detect memorization, we use a Claude-based auditor that compares each model-generated patch against the gold patch and assigns a [0, 1] memorization probability. The auditor weighs concrete signals—verbatim code reproduction when alternative approaches exist, distinctive comment text matching ground truth, and more—and is instructed to discount overlap that any competent solver would produce given the problem constraints.


I stand corrected.

it might have broken a couple metrics since if you get above 90 percent it might be that the metric can not measure you well anymore right?

damn... ok that's impressive.

Funny, I made my own model at home and got even higher scores than these. I'm a bit concerned about releasing it, though, so I'm just going to keep it local for now.

Sure, for humans. Not sure they'll be the primary readers of code going forward

I'm pretty sure that will be true with AI as well.

No accounting for taste, but part of makes code hard for me to reason about is when it has lots of combinatorial complexity, where the amount of states that can happen makes it difficult to know all the possible good and bad states that your program can be in. Combinatorial complexity is something that objectively can be expensive for any form of computer, be it a human brain or silicon. If the code is written in such a way that the number of correct and incorrect states are impossible to know, then the problem becomes undecidable.

I do think there is code that is "objectively" difficult to work with.


All the good practices about strong typing, typically in Scala or Rust, also work great for AI.

If you make sure the compiler catches most issues, AI will run it, see it doesn't build and fix what needs to be fixed.

So I agree that a lot of things that make code good, including comments and documentation, is beneficial for AI.


There are a number of things that make code hard to reason about for humans, and combinatorial complexity is just one of them. Another one is, say, size of working memory, or having to navigate across a large number of files to understand a piece of logic. These two examples are not necessarily expensive for computers.

I don't entirely disagree that there is code that's objectively difficult to work with, but I suspect that the Venn diagram of "code that's hard for humans" and "code that's hard for computers" has much less overlap than you're suggesting.


Certainly with current models I have found that the Venn diagram of "code that's hard for humans" and "code that's hard for computers" has actually been remarkably similar, I suspect because it's trained on a lot of terrible code on Github.

I'm sure that these models will get better, and I agree that the overlap will be lower at that point, but I still think what I said will be true.


I wouldn't expect so. These machines have been trained on natural language, after all. They see the world through an anthropomorphic lens. IME & from what I've heard, they struggle with inexpressive code in much the same way humans do.

What do you think about the argument that we are entering a world where code is so cheap to write, you can throw the old one away and build a new one after you've validated the business model, found a niche, whatever?

I mean, it seems like that has always been true to an extent, but now it may be even more true? Once you know you're sitting on a lode of gold, it's a lot easier to know how much to invest in the mine.


It hasn't always been true, it started with rapid development tools in the late 90's I believe.

And some people thought they were building "disposable" code, only to see their hacks being used for decades. I'm thinking about VB but also behemoth Excel files.


I guess the question is, are the issues not worth fixing because implementing a fix is extremely expensive, or because the improvements from fixing it were anticipated to be minor? I assume the answer is generally a mix of the two.

Someone has to figure out how to make the experiences of the two generations consistent in the ways it needs to be and differ only in the ways it doesn't still.

I actually think that might actually be a good path forward.

I hate self-promotion but I posted my opinions on this last night https://blog.tombert.com/Posts/Technical/2026/04-April/Stop-...

The tl;dr of this is that I don't think that the code itself is what needs to be preserved, the prompt and chat is the actual important and useful thing here. At some point I think it makes more sense to fine tune the prompts to get increasingly more specific and just regenerate the the code based on that spec, and store that in Git.


> At some point I think it makes more sense to fine tune the prompts to get increasingly more specific and just regenerate the the code based on that spec, and store that in Git.

Generating code using a non-deterministic code generator is a bold strategy. Just gotta hope that your next pull of the code slot machine doesn’t introduce a bug or ten.


We're already merging code that has generated bugs from the slot machine. People aren't actually reading through 10,000 line pull requests most of the time, and people aren't really reviewing every line of code.

Given that, we should instead tune the prompts well enough to not leave things to chance. Write automated tests to make sure that inputs and outputs are ok, write your specs so specifically that there's no room for ambiguity. Test these things multiple times locally to make sure you're getting consistent results.


> Write automated tests to make sure that inputs and outputs are ok

Write them by hand or generate them and check them in? You can’t escape the non-determinism inherent in LLMs. Eventually something has to be locked in place, be it the application code or the test code. So you can’t just have the LLM generate tests from a spec dynamically either.

> write your specs so specifically that there's no room for ambiguity

Using English prose, well known for its lack of ambiguity. Even extremely detailed RFCs have historically left lots of room for debate about meaning and intention. That’s the problem with not using actual code to “encode” how the system functions.

I get where you’re coming from but I think it’s a flawed idea. Less flawed than checking in vibe-coded feature changes, but still flawed.


> Write them by hand or generate them and check them in?

Yes, written by hand. I think that ultimately you should know what valid inputs and outputs are and as such the tests should be written by a human in accordance with the spec.

> Less flawed than checking in vibe-coded feature changes, but still flawed.

This is what I'm trying to get at. I agree it's not perfect, but I'm arguing it's less evil than what is currently happening.


This is actually a pretty good callout.

Observability into how a foundation model generated product arrived to that state is significantly more important than the underlying codebase, as it's the prompt context that is the architecture.


Yeah, I'm just a little tired of seeing these pull requests of multi-thousand-line pull requests where no one has actually looked at the code.

The solution people are coming up with now is using AI for code reviews and I have to ask "why involve Git at all then?". If AI is writing the code, testing the code, reviewing the code, and merging the code, then it seems to me that we can just remove these steps and simply PR the prompts themselves.


> why involve Git at all then?

I made a similar point 3 weeks ago. It wasn't very well received.

https://news.ycombinator.com/item?id=47411693

You don't actually need source control to be able to roll back to any particular version that was in use. A series of tarballs will let you do that.

The entire purpose of source control is to let you reason about change sets to help you make decisions about the direction that development (including bug fixes) will take.

If people are still using git but not really using it, are they doing so simply to take advantage of free resources such as github and test runners, or are they still using it because they don't want to admit to themselves that they've completely lost control?


> are they still using it because they don't want to admit to themselves that they've completely lost control?

I think this is the case, or at least close.

I think a lot of people are still convincing themselves that they are the ones "writing" it because they're the ones putting their names on the pull request.

It reminds me of a lot of early Java, where it would make you feel like you were being very productive because everything that would take you eight lines in any other language would take thirty lines across three files to do in Java. Even though you didn't really "do" anything (and indeed Netbeans or IntelliJ or Eclipse was likely generating a lot of that bootstrapping code anyway), people would act like they were doing a lot of work because of a high number of lines of code.

Java is considerably less terrible now, to a point where I actually sort of begrudgingly like writing it, but early Java (IMO before Java 21 and especially before 11) was very bad about unnecessary verbosity.


> If people are still using git but not really using it, are they doing so simply to take advantage of free resources such as github and test runners,

does it have to be free to be useful? the CD part is is even more important than before, and if they still use git as their input, and everyone including the LLM is already familiar with git, whats the need to get rid of it?

there's value in git as a tool everyone knows the basics of, and as a common interface of communicating code to different systems.

passing tarballs around requires defining a bunch of new interfaces for those tarballs which adds a cost to every integration that you'd otherwise get for about free if you used git


A series of tarballs is really unwieldy for that though. Even if you don't want to use git, and even if the LLM is doing everything, having discrete pieces like "added GitHub oauth to login" and "added profile picture to account page" as different commits is still valuable for when you have to ask the LLM "hey about the profile picture on the account page".

A series of tarballs is version control.

Git gives you the series of past snapshots if that's all you want it for, but in infrastructure you don't need to re-invent.


Yep.

Also, the approach you described is what a number of AI for Code Review products are using under-the-hood, but human-in-the-loop is still recognized as critical.

It's the same way how written design docs and comments are significantly more valuable than uncommented and undocumented source.


Because LLMs are designed as emulators of actual human reasoning, it wouldn't surprise me if we discover that the things that make software easy for humans to reason about also make it easier for LLMs to reason about.

AIs struggle with tech debt as much if not more than humans.

Ive noticed that theyre often quite bad at refactoring, also.


Entropy and path dependence are unavoidable laws of mathematics. Not even evolution can avoid them.

I think someday it will be completely unreadable for humans. Ai will have its optimized form.

No per-agent auto-worktree? This is the killer feature of Conductor, having to type `/worktree` into every new chat isn't really a resolution. Not even sure what selecting 'Worktree' for a new chat does

"having to type `/worktree` into every new chat isn't really a resolution"

I don't know what you're talking about. My experience with Cursor (before this new v3) is that new Cursor agent tabs / cloud agents already intelligently manage worktrees to prevent conflicts.


Wow, maybe something is wrong with my setup. In Cursor 3, I am clicking "New Agent" at the top left. My root repository is correctly listed on top of the composer, and I clicked the icon to the right of it and selected 'Worktree'. Then, I instruct the model to run `pwd` and tell me it's git status. It's always just on `main` in my root repository. I dug through the settings and couldn't find anything, and after finding this comment[0] on their forums gave up. Would you mind sharing a bit more about your setup/how it works?

[0] https://forum.cursor.com/t/working-with-worktrees-in-cursor/...


Wow, it seems I'm the fool here.

I'm on Version: 2.6.19.

Per https://cursor.com/docs/configuration/worktrees#how-is-this-...

They apparently removed this in 3.0. - I couldn't begin to guess why.

"Automatic management of worktrees was removed in Cursor 3.0 and replaced with the new commands /worktree and /best-of-n. We also have added worktree support for the Cursor CLI.

Management of worktrees is now fully agentic. This makes it simpler to support use cases such as starting an agent, and only doing work in a worktree later on in the chat's lifecycle.

/best-of-n makes comparing the results of multiple models much easier. The parent agent will provide commentary on the different results and you can pick the best one. Additionally, you can even ask the parent agent to merge different parts of the different implementations into a single commit.

If you had agents that were previously running in a worktree, those chats will still work. However, you will need to use the new commands to start new agents in worktrees."


i would expect it before the end of the month, why not?

Model independence

That gap was closed by opencode months ago.

different products - CLI vs apps

Not really, no. Coding CLIs are hugely popular with the "App user" crowd, see Claude Code.

I think that's more fashion than anything.

Every company I've worked at has still had a few engineers who insist on working exclusively in the CLI with vim/emacs prior to AI. Every other engineer used some flavor of a desktop app ranging from more minimal editors to incredibly complex IDEs. I expect we land back on UIs long term.


Wow, 30B parameters as capable as a 1T parameter model?

On the above compared benchmarks is closer to other larger open weights models, and on par with GPT-OSS 120B, for which I also have a frame of reference.

Maybe? Previous valuation is $730B + $122B raised in this announcement = $852B valuation in this announcement (no actual increase in valuation)

Previous was $730B pre money. This one is $852B post money. So yeah it's the same one. Good catch.

yup and begging for retailers money.

same

Probably their auditors? Lying about this would be tantamount to (very serious) securities fraud. Not sure what you're basing on your allegations on besides "trust me bro"


Why would lying about having E2EE be securities (as in stock market) fraud? Would that make any lie ever told by a corporation equate to stock market fraud?


Yes! As Matt Levine says, “everything is securities fraud”


So if Microsoft tells me upgrading to windows 11 will make my computer better, you think that's securities fraud?


Did I say that?


Yes. You said everything is securities fraud


Cursor just wrote a great blog post on this - "Fast regex search: indexing text for agent tools" https://cursor.com/blog/fast-regex-search


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: