Same way you distill any model. Training data efficiency matters only while you train the source model/ensemble. Once you have that you are purely compute bound during distillation.
i think evolution meta-learns the architecture, hyperparams. some domain knowledge too (for ex, we all perceive the world as 3d) but not much. if you compare the text consumed by human vs AI (and i think this is fair b/c even with evolution text is a pretty recent invention for humans), the gap is many orders of magnitude.
> you can simply generate more, and higher quality, artificial data
this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that.
good point on chinchilla, but our models are still absurdly large no matter what standards you compare them to.
> this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that
I'm (and so is the post itself) talking about LLMs in particular, and this is indeed true for LLM.
yes! typically the optimizer that trains faster also get better data efficiency. it maybe not be absolutely true, but that has been my observation so far. also see https://arxiv.org/pdf/2510.09378 for second-order methods.
Fundamentally I don't believe second-order methods get better data efficiency by itself, but changes to the optimizer can because the convergence behavior changes. ML theory lags behind the results in practice.
Maybe some newer references are better, but my mind went to the Model Soups paper[1]:
The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups."
It's the opposite of a MoE architecture in many ways. MoE splits every individual feed-forward layer into many tiny subnetworks, only a small number of which contribute to the layer output, and they get trained together to complement each other.
Ensembling makes multiple copies of the entire model, trains them independently on the same task, and then has every copy contribute to the output.
Reducing computation vs. increasing it; operating at per-layer granularity vs. whole model; specialization vs. redundancy.
I see you already mention diffusion - iirc there was a result not too long ago that diffusion models keep improving with more epochs for longer than AR models do.
diffusion is promising, but still an open question how much data efficient they are compared to AR. in practice, you can also train AR forever with high enough regularization, so let's see.
yes, agreed, modded-nanogpt is already a data-efficient variant of original nanogpt. just that the kinds of algorithms it allows are somewhat constrained because it optimizes for wall clock time.