Well, for one, Anthropic mostly uses Google TPUs and Amazon Inferentia2 chips, n...

Chamix · 2026-03-13T21:59:17 1773439157

I appreciate the detailed comment! I took the day off and am bored so have a brain dump of a reply - basically I think we are talking past each other on two major points:

1. All the discussion about model size is CRITICALLY bisected into talking about TOTAL model size vs ACTIVE parameter size (of a "head" in an "Mixture of Experts"). Everything you've said trend-wise is mostly accurate for ACTIVE parameter count, which is what determines inference cost and speed.

But I am primarily talking about TOTAL parameter count (which has to just fit inside cluster HBM). The total parameter count only affects training cost and has nothing to do with inference cost or speed. So there is no downside to making total parameter count as big as your inference cluster will fit.

2. You touch on distllation, and this heavily relates to the post-gpt-4 base model (call it 5th gen, if gpt-4 was 4th gen), which indeed was used for all models through gpt5.1.

The actual base 5th gen model was as large as OAI could fit on training clusters, and only then distilled down to whatever total size a release model targeted, and the little secret with sparse MOE is the entire model weights don't have to fit (again, plenty of public papers detailing techniques) on a single HBM pool when training. This leads to the 2nd little secret, that GPT-4.5 is ALSO using that same base model; as I said in another comment, 4.5 was all an experiment in testing a huge ACTIVE parameter model (which again is all that determines cost and speed), not so much total (which is capped by inference cluster hardware anyways!) How do you think OAI would be able to serve 4.5 at scale if the model itself was 10x total bigger than everything else? But its easy to serve a model with active parameters 10x bigger!

So this same huge 5th gen base model was distilled down and RLed over and over again in different permutations and sizes to feed the whole OAI model lineup, from o4-mini to advanced voice to gpt4.5 all the way until finally 5.2 starts using a new, "6th gen" base model (with various failed base model trainings between 5th and 6th) (shallotpeat!).

Picking up misc pieces, yes 4o was tiny when served at Q4, which is what Maia 100 did (with some Q6). We are still taking about a ~1T total model. Quantization both static and dynamic was the whole drive behind gpt4-turbo variants which led straight into 4o targeting an extremely economical deployment of 5th gen base. Economical was sorely needed (arrakis!) since this all was at the critical junction when 8xH100s had not been deployed quite at scale yet, but AI use was rocketing off to mainstream, so we had silly situations like Azure being forced to serve on 256gb clusters. (We could go into a whole separate spiel about quantization +history, but suffice it to say everything in deployment is just Q4 these days, and training is mostly Q8)

But this DOES NOT mean o1 was tiny, which conveniently was deployed right when 8xH100s WERE available at scale. We split into the instant tree, where 4.1 was bigger than 4o and 5-instant was bigger than 4.1 etc. And the thinking tree, where o1 = o3 < 5-thinking < 5.2-thinking. Again, the ACTIVE counts were very small comparatively, especially as it let you cheaply experiment and then train with substantial inference compute required for RL training/unrolling! But there was no reason not to fit increasingly large distilled versions of the 5th-gen/6th-gen base models as the inference fleet buildouts (particularly in 2H 2025) came online! The same 5th and now 6th gen base models were refined and twisted (foundry!) into totally different end models and sizes.

I just think this really all comes down to total vs active, not understanding a huge base model can be distilled into arbitrarily sized release models, and then bizarrely giving weight to Meta's completely incompetent Llama 4 training run (I was there, Gandalf!) as giving any sort of insight on what sort of sparsity ratio cutting edge labs are using. You cannot learn anything about total parameter size from active parameter count+ derivatives (token speed, cost, etc)! But on this topic we could again diverge into an entire debate; I'll just say Google is likely doing like 0.1%-OOM in some production configs (Jim Keller is basically shouting extreme sparsity from the rooftops!).

Brief rebuttal summary:

1. Incorrect as of late 2025. Whole public reporting about Anthropic dissatisfaction with "Project Ranier". Dario talking about Nvidia compute candidly on Dwarkish interview!

2. Active vs Total

3. 4o is small, 4-bit 4o on Azure even smaller. 4o is 5th gen base distilled not gpt-4 distilled.

4. 256gb at Q4 fits 1T parameters! Active vs total

5. 5th gen pretrain / base model is huge! 4.5 uses the same base as 4o and 5.1! Can be shrunk to arbitrary size before RL/post training create finished model! Active vs total

6. Active vs total

7. Active vs total, also Ironwood/TPUv7 and Blackwell give much cheaper Q4 inference

8. Don't trust the Zuck

Anyways its all a mess and I don't think its possible to avoid talking past each other or misunderstanding in semi-casual conversation - even just today Dylan Patel (who is extremely well informed!) was on Dwarkesh podcast talking about 5.4-instant having a smaller active parameter count than GPT-4 (220B active), which is completely true, but instantly gets misinterpreted on twitter et al that 5.4 is a smaller model than gpt-4, ignores that 5-4.instant are 5.4-thinking are totally different models, etc etc, just too much nuance to easily convey.

jychang · 2026-03-19T11:05:49 1773918349

1. Claiming that gpt-4o and gpt-4.5 came from the same training run is ridiculous, gpt-4.5 was not distilled from the same pretrain as 4o.

- Mark Chen has literally publicly said as much, it's a completely different pretrain run.

- And clearly if openai has a good big base model before 4.5, they would have released it back in 2024.

"How do you think OAI would be able to serve 4.5 at scale if the model itself was 10x total bigger than everything else?" through pipeline parallelism, not tensor parallelism. Don't need to synchronize an all-reduce across clusters. You lose tons of tokens/sec per user though. That's exactly what we see with gpt-4.5 in real life- slow ~10token/sec inference.

2. 4o was definitely not served fully at 4-bit/6-bit, and even at 4-bit a 1T model wouldn't fit in a Maia cluster with reasonable kv cache for users. You can't quant attention down to 4-bit/6-bit, that would give the model brain damage. A production environment would quant attention down to fp8 at most. Even local home users don't quant attention down to 4 bit. Unsloth UD Q4 quants usually quant attention to Q8. https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/blob/mai...

blk.0.attn_qkv.weight [2 048, 8 192] Q8_0

Also, Q4/Q6/Qwhatever are quants used by llama.cpp only, and nobody in a production environment would be using llama.cpp at all. So, saying "Qwhatever" is a clear indicator you have no clue what you're talking about.

Since 4o predates widespread MLA, they're clearly using GQA and thus you can estimate the size per token from an approximate attention head size. Note that Azure offers 4o with max context of 128k tokens. That's about 4-8gb kv cache at full context size. Even at 4bit (it's not at 4bit), 4o is 500b at most, if you actually want to serve customers! Providers do not do batch=1 inference, that would leave the GPU core idle while memory bandwidth is saturated. So they'd have to batch many users onto one machine, with all their kv caches resident in memory. There's just no way you can fit a 1T model with 8+ bit attention and a bunch of users' kv cache into 256GB, even if the ffn was fp4.

3. Microsoft leaked the size of 4o, you know. And there's also other estimates. They all estimate 4o at around 200b. https://arxiv.org/pdf/2412.19260 or https://epoch.ai/gradient-updates/frontier-language-models-h...

4. "(We could go into a whole separate spiel about quantization +history, but suffice it to say everything in deployment is just Q4 these days, and training is mostly Q8)"

More accurately, most deployments are FP4 for ffn, and still 8 bit or 16 bit for attention. And only the chinese labs train at FP8. There's very little reason to train at FP8 when your AdamW states and gradients are still FP32 and FP16. And note that even deepseek uses FP16/FP32 AdamW/gradients.

https://arxiv.org/pdf/2412.19437 That's deepseek using FP8 live weight copy + FP32 master + FP32 grad + BF16 moments = 13 bytes per parameter. BF16 weights is 14 bytes per parameter. There's very reason to use FP8 weights over BF16 weights during training, you don't save that much VRAM/compute, unless you're very desperate like Deepseek. Most labs now still train for W16A16 but apply QAT, not train at FP8. Even the chinese labs do this now- Kimi K2.5 is BF16 native, and just quantize ffn down to int4 with QAT. You can tell, because Kimi K2.5 attention tensors are BF16 and not FP8.

4. "instant tree" "And the thinking tree, where o1 = o3 < 5-thinking < 5.2-thinking." What you're describing is a massive waste of money. Nobody's doing that. Each time you distill a model to a different size, you have to do that separately. That's a waste of compute. Nope, openai just took the same model, and kept on posttraining it more and more, and published some checkpoints. That's what everyone does. The various gpt-4o-2024-05-13 and gpt-4o-2024-08-06 and gpt-4o-2024-11-20 and gpt-5 and gpt-5.1 ... and o1 and o3 and gpt-5-thinking models are NOT different sizes.

Every lab takes a model, and iterate on it and train it more and more. Creating a bunch of distills is expensive. Training a model compute is approximately Compute ≈ 6(number of active params)(tokens trained). Posttraining is basically just throwing a few more tokens into the model and doing some forward and backwards passes. I don't know how many tokens they trained on, but it's somewhere in the 10T to 100T range. Distilling a model compute ≈ [2(big model active params) +6(small model active params)](tokens trained). This is way more expensive per token than training! There's less passes, but you don't get the value you think from distills.

Look at Deepseek! Deepseek V3? 671B total parameters checkpoint. R1? 671B total parameters checkpoint. V3 0324? 671B total parameters checkpoint. R1 0528? 671B total parameters checkpoint. V3.1 combined thinking and non-thinking? 671B total parameters checkpoint. V3.1 Terminus? 671B total parameters checkpoint. V3.2? 671B total parameters checkpoint.

5. Sparsity matters. Nobody currently is going below 1% sparsity.

MoE sparsity is just the ratio of total number of experts to active experts. Most labs settle on around 8 out of 256 (like Deepseek, GLM, etc) aka 6.25%. There's plenty of research showing that models break down at too high of a sparsity, which is why total params is correlated to active params.

Also, please don't use the word "head" to refer to a MoE expert. The word "head" has a specific meaning in ML and it's not that. It's referring to the component in multi-head attention. That's like using the word "transmission" when talking about a car but not referring to the actual transmission. It's making you look really weird.

Actually, we know what architecture openai was using a few years ago- because openai released it. That was the whole point of gpt-oss. Notably, it uses mxfp4 for MoE, but still uses BF16 for GQA attention, and it has 4 of 128 experts sparsity. Yes, even OpenAI realized that staying around 6.25% sparsity is a good idea. And note that OpenAI clearly did not think quantizing attention is a good idea, even if they apply QAT to create a mxfp4 ffn.

Basically, you have no clue what you're talking about. You're somehow claiming that openai is doing a ton of distills, one for each of 4o/o1/o3/gpt-5/gpt-5.1 thinking and nonthinking, to different sizes... instead of just taking a model they already have, and doing more training and more checkpoints like everyone else. They'd be insane if they were doing that.