That's a tautology. People think chinese models are 10x more efficient because t...

irthomasthomas · 2026-03-10T10:50:05 1773139805

Opus doubled in speed with version 4.5, leading me to speculate that they had promoted a sonnet size model. The new faster opus was the same speed as Gemini 3 flash running on the same TPUs. I think anthropics margins are probably the highest in the industry, but they have to chop that up with google by renting their TPUs.

F7F7F7 · 2026-03-10T15:08:50 1773155330

The conspiracy theorist side of me whispers "instead of the rumored Sonnet 5.0 you got Opus 4.6...suspicious"

aerhardt · 2026-03-10T13:38:45 1773149925

I guess more than a tautology it is an inversion of observed causes and effects?

xhevahir · 2026-03-15T10:56:31 1773572191

It's petitio principii, or begging the question, if I'm not mistaken.

grayxu · 2026-03-10T12:00:30 1773144030

This is not a valid argument. TPS is essentially QoS and can be adjusted; more GPUs allocated will result in higher speed.

yorwba · 2026-03-10T12:44:28 1773146668

There are sequential dependencies, so you can't just arbitrarily increase speed by parallelizing over more GPUs. Every token depends on all previous tokens, every layer depends on all previous layers. You can arbitrarily slow a model down by using fewer, slower GPUs (or none at all), though.

erichocean · 2026-03-10T13:01:00 1773147660

Partially true, you can predict multiple tokens and confirm, which typically gives a 2-3x speedup in practice.

(Confirmation is faster than prediction.)

Many models architectures are specifically designed to make this efficient.

---

Separately, your statement is only true for the same gen hardware, interconnects, and quantization.

grumpoholic · 2026-03-10T13:01:48 1773147708

With speculative decoding you can use more models to speed up the generation however.

salawat · 2026-03-12T23:26:18 1773357978

Yes, because speculation has NEVER bitten us in the ass before, right? Coughs in Spectre

Speculative decoding is just running more hardware to get a faster prediction. Essentially, setting more money on fire if you're being billed per token.

re-thc · 2026-03-10T08:24:48 1773131088

> That's a tautology. People think chinese models are 10x more efficient because they're 10x cheaper

They do have different infrastructure / electricity costs and they might not run on nvidia hardware.

It's not just the models.

jychang · 2026-03-10T08:34:48 1773131688

Except there are providers that serve both chinese models AND opus as well. On the same hardware.

Namely, Amazon Bedrock and Google Vertex.

That means normalized infrastructure costs, normalized electricity costs, and normalized hardware performance. Normalized inference software stack, even (most likely). It's about a close of a 1 to 1 comparison as you can get.

Both Amazon and Google serve Opus at roughly ~1/2 the speed of the chinese models. Note that they are not incentivized to slow down the serving of Opus or the chinese models! So that tells you the ratio of active params for Opus and for the chinese models.

Shakahs · 2026-03-10T10:45:29 1773139529

AWS and GCP both have their own custom inference chips, so a better example for hosting Opus on commodity hardware would be Digital Ocean.

giancarlostoro · 2026-03-10T11:29:47 1773142187

And Microsoft's Azure. It's on all 3 major cloud providers. Which tells me, they can make profit from these cloud providers without having to pay for any hardware. They just take a small enough cut.

https://code.claude.com/docs/en/microsoft-foundry

https://www.anthropic.com/news/claude-in-microsoft-foundry

re-thc · 2026-03-10T11:37:14 1773142634

> Both Amazon and Google serve Opus at roughly ~1/2 the speed of the chinese models

We were responded about 10x not 0.5x.

x86 vs arm64 could have different performance. The Chinese models could be optimized for different hardware so it could show massive differences.

atq2119 · 2026-03-10T12:55:08 1773147308

These providers do not run models on CPUs, x86 vs. Arm is irrelevant.

re-thc · 2026-03-10T18:17:12 1773166632

They run Nvidia and Huawei for example. And mine was just an example.

raggi · 2026-03-10T12:49:51 1773146991

Deployments like bedrock have no where near SOTA operational efficiency, 1-2 OOM behind. The hardware is much closer, but pipeline, schedule, cache, recomposition, routing etc optimizations blow naive end to end architectures out of the water.

Analemma_ · 2026-03-10T17:12:50 1773162770

Do you have evidence for any of this, or are you repeating a bunch of buzzwords you’ve heard breathlessly repeated on Twitter?

raggi · 2026-03-11T14:02:20 1773237740

Many techniques are documented in papers, particularly those coming out of the Asian teams. I know of work going on in western providers that is similarly advanced. In short, read the papers.

nullstyle · 2026-03-10T14:51:45 1773154305

Evidence?

fennecfoxy · 2026-03-10T09:55:28 1773136528

I mean GN has covered the Nvidia black market in China enough that we pretty much know that they run on Nvidia hardware still.

dryarzeg · 2026-03-10T10:06:36 1773137196

How is this related to the inference, may I ask? Except for some very hardware-specific optimizations of model architecture, there's nothing to prevent one to host these models on your own infrastructure. And that's what actually many OpenRouter providers, at least some of which are based in US, are doing. Because most of Chinese models mentioned here are open-weight (except for Qwen who has one proprietary "Max" model), and literally anyone can host them, not just someone from China. So it just doesn't really matter.

fennecfoxy · 2026-03-10T10:26:36 1773138396

I mean sure, but in terms of cost per dollar/per watt of inference Nvidia's GPUs are pretty up there - unless China is pumping out domestic chips cheaply enough.

Also with Nvidia you get the efficiency of everything (including inference) built on/for Cuda, even efforts to catch AMD up are still ongoing afaik.

I wouldn't be surprised if things like DS were trained and now hosted on Nvidia hardware.

re-thc · 2026-03-10T10:51:01 1773139861

> unless China is pumping out domestic chips cheaply enough

They are. Nvidia makes A LOT of profit. Hey, top stock for a reason.

> I wouldn't be surprised if things like DS were trained and now hosted on Nvidia hardware

DS is "old". I wouldn't study them. The new 1s have a mandate to at least run on local hardware. There are data center requirements.

I agree it could still be trained on Nvidia GPUs (black market etc), but not running.

yorwba · 2026-03-10T11:05:18 1773140718

> The new 1s have a mandate to at least run on local hardware.

They do? Source?

But if that's true, it would explain why Minimax, Z.ai and Moonshot are all organized as Singaporean holding companies, with claimed data center locations (according to OpenRouter) in the US or Singapore and only the devs in China. Can't be forced to use inferior local hardware if you're just a body shop for a "foreign" AI company. ;)

re-thc · 2026-03-10T11:33:22 1773142402

> with claimed data center locations (according to OpenRouter) in the US or Singapore and only the devs in China

They just have a China only endpoint and likely a company under a different name.

Nothing to do with AI. TikTok is similar (global vs China operations).