More

sdpmas · 2026-03-20T05:08:56 1773983336

oh ensemble can be distilled to a single model easily.

SknCode · 2026-03-20T08:17:10 1773994630

sigmoid10 · 2026-03-20T11:00:12 1774004412

Same way you distill any model. Training data efficiency matters only while you train the source model/ensemble. Once you have that you are purely compute bound during distillation.

sdpmas · 2026-03-20T00:21:25 1773966085

i think evolution meta-learns the architecture, hyperparams. some domain knowledge too (for ex, we all perceive the world as 3d) but not much. if you compare the text consumed by human vs AI (and i think this is fair b/c even with evolution text is a pretty recent invention for humans), the gap is many orders of magnitude.

throwaway894345 · 2026-03-20T01:29:54 1773970194

Tangentially, some scientists think humans may have hardwiring for detecting snakes https://en.wikipedia.org/wiki/Snake_detection_theory

sdpmas · 2026-03-19T23:07:51 1773961671

the T stands for tea :)

naruhodo · 2026-03-19T23:36:48 1773963408

Ah, so it's a source of randomness! Presumably 1.0 corresponds to a really hot cup of fresh tea.

sdpmas · 2026-03-19T19:45:18 1773949518

thanks!

sdpmas · 2026-03-19T19:45:03 1773949503

> you can simply generate more, and higher quality, artificial data

this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that.

good point on chinchilla, but our models are still absurdly large no matter what standards you compare them to.

littlestymaar · 2026-03-19T19:51:47 1773949907

> this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that

I'm (and so is the post itself) talking about LLMs in particular, and this is indeed true for LLM.

sdpmas · 2026-03-19T20:03:57 1773950637

continual learning is LLMs :) ultimately everything will be/already is data bottlenecked.

sdpmas · 2026-03-05T08:50:17 1772700617

absolutely!

sdpmas · 2026-03-05T00:35:53 1772670953

yes! typically the optimizer that trains faster also get better data efficiency. it maybe not be absolutely true, but that has been my observation so far. also see https://arxiv.org/pdf/2510.09378 for second-order methods.

vladf · 2026-03-05T02:33:19 1772677999

That still looks like a “converge faster” paper.

https://arxiv.org/abs/2006.10732

The above provides a nuanced theoretical view. GD inductive bias is probably better unless your model is misspecified

alyxya · 2026-03-05T01:21:46 1772673706

Fundamentally I don't believe second-order methods get better data efficiency by itself, but changes to the optimizer can because the convergence behavior changes. ML theory lags behind the results in practice.

sdpmas · 2026-03-05T00:33:38 1772670818

no ensembling means train 8 models and during inference avg logits of all 8 models to make a prediction.

magicalhippo · 2026-03-05T15:08:09 1772723289

Maybe some newer references are better, but my mind went to the Model Soups paper[1]:

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups."

[1]: https://arxiv.org/abs/2203.05482

jiggawatts · 2026-03-05T07:58:20 1772697500

That doesn't seem all that different to a MoE architecture.

yorwba · 2026-03-05T08:47:30 1772700450

It's the opposite of a MoE architecture in many ways. MoE splits every individual feed-forward layer into many tiny subnetworks, only a small number of which contribute to the layer output, and they get trained together to complement each other.

Ensembling makes multiple copies of the entire model, trains them independently on the same task, and then has every copy contribute to the output.

Reducing computation vs. increasing it; operating at per-layer granularity vs. whole model; specialization vs. redundancy.

sdpmas · 2026-03-04T22:33:51 1772663631

yeah, we do incorporate some of the findings from the paper in our repo! like aggressive regularization and ensembling.

_0ffh · 2026-03-05T00:33:30 1772670810

I see you already mention diffusion - iirc there was a result not too long ago that diffusion models keep improving with more epochs for longer than AR models do.

sdpmas · 2026-03-05T00:37:09 1772671029

diffusion is promising, but still an open question how much data efficient they are compared to AR. in practice, you can also train AR forever with high enough regularization, so let's see.

_0ffh · 2026-03-05T00:39:47 1772671187

Yes, it could go either way of course.

Still, just for reference, here's the paper I remembered: https://arxiv.org/pdf/2507.15857

sdpmas · 2026-03-05T00:55:33 1772672133

thanks, here's another one: https://arxiv.org/abs/2511.03276

sdpmas · 2026-03-04T22:32:46 1772663566

yes, agreed, modded-nanogpt is already a data-efficient variant of original nanogpt. just that the kinds of algorithms it allows are somewhat constrained because it optimizes for wall clock time.