Hacker Newsnew | past | comments | ask | show | jobs | submit | sdpmas's commentslogin

oh ensemble can be distilled to a single model easily.


How?


Same way you distill any model. Training data efficiency matters only while you train the source model/ensemble. Once you have that you are purely compute bound during distillation.


i think evolution meta-learns the architecture, hyperparams. some domain knowledge too (for ex, we all perceive the world as 3d) but not much. if you compare the text consumed by human vs AI (and i think this is fair b/c even with evolution text is a pretty recent invention for humans), the gap is many orders of magnitude.


Tangentially, some scientists think humans may have hardwiring for detecting snakes https://en.wikipedia.org/wiki/Snake_detection_theory


the T stands for tea :)


Ah, so it's a source of randomness! Presumably 1.0 corresponds to a really hot cup of fresh tea.


thanks!


> you can simply generate more, and higher quality, artificial data

this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that.

good point on chinchilla, but our models are still absurdly large no matter what standards you compare them to.


> this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that

I'm (and so is the post itself) talking about LLMs in particular, and this is indeed true for LLM.


continual learning is LLMs :) ultimately everything will be/already is data bottlenecked.


absolutely!


yes! typically the optimizer that trains faster also get better data efficiency. it maybe not be absolutely true, but that has been my observation so far. also see https://arxiv.org/pdf/2510.09378 for second-order methods.


That still looks like a “converge faster” paper.

https://arxiv.org/abs/2006.10732

The above provides a nuanced theoretical view. GD inductive bias is probably better unless your model is misspecified


Fundamentally I don't believe second-order methods get better data efficiency by itself, but changes to the optimizer can because the convergence behavior changes. ML theory lags behind the results in practice.


no ensembling means train 8 models and during inference avg logits of all 8 models to make a prediction.


Maybe some newer references are better, but my mind went to the Model Soups paper[1]:

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups."

[1]: https://arxiv.org/abs/2203.05482


That doesn't seem all that different to a MoE architecture.


It's the opposite of a MoE architecture in many ways. MoE splits every individual feed-forward layer into many tiny subnetworks, only a small number of which contribute to the layer output, and they get trained together to complement each other.

Ensembling makes multiple copies of the entire model, trains them independently on the same task, and then has every copy contribute to the output.

Reducing computation vs. increasing it; operating at per-layer granularity vs. whole model; specialization vs. redundancy.


yeah, we do incorporate some of the findings from the paper in our repo! like aggressive regularization and ensembling.


I see you already mention diffusion - iirc there was a result not too long ago that diffusion models keep improving with more epochs for longer than AR models do.


diffusion is promising, but still an open question how much data efficient they are compared to AR. in practice, you can also train AR forever with high enough regularization, so let's see.


Yes, it could go either way of course.

Still, just for reference, here's the paper I remembered: https://arxiv.org/pdf/2507.15857


thanks, here's another one: https://arxiv.org/abs/2511.03276


yes, agreed, modded-nanogpt is already a data-efficient variant of original nanogpt. just that the kinds of algorithms it allows are somewhat constrained because it optimizes for wall clock time.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: