I find it hard to believe that seven-figure software engineers at top labs aren't being careful about how much post-ChatGPT-era internet content is going into their training data.
I agree - but as the Internet descends into all-slop-all-the-time (seriously, just do a search for reviews or travel
advice or technical questions -or most anything - to see it), where do you expect the high quality training material on future things to come from? I have a hard time imagining it.
Your Claude Code sessions. Every interaction. Every time the model is asked to do something and then gets feedback on that something (this didn't work I got this traceback)
Textbooks, company wikis, news corpora, structured reports of all kinds from far more sources than what is available on the web.
On your first line -- is it clear that's a good thing? Massive "it depends".
Sadly, enterprise fizzbuzz style is wildly successful compared to ghostty style.
Put another way, a gem of code versus the masses of mess. It's amazing new models aren't worse. And now most of this human interaction is with vibers.
LLMs trained by the crowd risk being medianizers, or rather, mediocritizers.
One need not look further than "Absolutely!" to see this in play -- user selection matters for corpus matters for model. Suddenly content everywhere is “Little houses, all alike.”
On your second line -- I couldn't agree more strongly.
ANTHROP\C has been sitting inside high performance white collar industries with top builders, that signal is priceless compared to feedback farms in Kenya.
Bet on models that see spikey pointy mastery at play.
I agree - but as the Internet descends into all-slop-all-the-time (seriously, just do a search for reviews or travel advice or technical questions -or most anything - to see it), where do you expect the high quality training material on future things to come from? I have a hard time imagining it.