This is not a valid argument. TPS is essentially QoS and can be adjusted; more G...

yorwba · 2026-03-10T12:44:28 1773146668

There are sequential dependencies, so you can't just arbitrarily increase speed by parallelizing over more GPUs. Every token depends on all previous tokens, every layer depends on all previous layers. You can arbitrarily slow a model down by using fewer, slower GPUs (or none at all), though.

erichocean · 2026-03-10T13:01:00 1773147660

Partially true, you can predict multiple tokens and confirm, which typically gives a 2-3x speedup in practice.

(Confirmation is faster than prediction.)

Many models architectures are specifically designed to make this efficient.

---

Separately, your statement is only true for the same gen hardware, interconnects, and quantization.

grumpoholic · 2026-03-10T13:01:48 1773147708

With speculative decoding you can use more models to speed up the generation however.

salawat · 2026-03-12T23:26:18 1773357978

Yes, because speculation has NEVER bitten us in the ass before, right? Coughs in Spectre

Speculative decoding is just running more hardware to get a faster prediction. Essentially, setting more money on fire if you're being billed per token.