Math and coding competition problems are easier to train because of strict rules...

NitpickLawyer · 2026-03-24T06:29:26 1774333766

> I don't see this getting better.

We went from 2 + 7 = 11 to "solved a frontier math problem" in 3 years, yet people don't think this will improve?

datsci_est_2015 · 2026-03-24T09:01:23 1774342883

I’ve seen this style of take so much that I’m dying for someone to name a logical fallacy for it, like “appeal to progress” or something.

Step away from LLMs for a second and recognize that “Yesterday it was X, so today it must be X+1” is such a naive take and obviously something that humans so easily fall into a trap of believing (see: flying cars).

Gareth321 · 2026-03-24T10:23:24 1774347804

In finance we say "past performance does not guarantee future returns." Not because we don't believe that, statistically, returns will continue to grow at x rate, but because there is a chance that they won't. The reality bias is actually in favour of these getting better faster, but there is a chance they do not.

aspenmartin · 2026-03-24T16:48:23 1774370903

this is true because markets are generally efficient. It's very hard to find predictive signals. That is a completely different space than what we're talking about here. Performance is incredibly predictable through scaling laws that continue to hold even at the largest scales we've built

andrewflnr · 2026-03-25T02:18:26 1774405106

Even more insane than assuming the trend will continue is assuming it will not continue. We don't know for sure (especially not by pure reason), but the weight of probability sure seems to lean one direction.

mikkupikku · 2026-03-24T13:38:52 1774359532

Logical fallacies are vastly overrated. Unless the conversation is formal logic in the first place, "logical fallacies" are just a way to apply quick pattern matching to dismiss people without spending time on more substantive responses. In this case, both you and the other are speculating about the near future of a thing, neither of you knows.

datsci_est_2015 · 2026-03-24T14:01:01 1774360861

Hard to make a more substantive response when the OP’s entire comment was a one-sentence logical fallacy. I’m not cherry-picking here.

> In this case, both you and the other are speculating about the near future of a thing, neither of you knows.

One of us is making a much grander claim than the other:

  - LLMs have limitless potential for growth; because they are not capable of something today does not mean they won’t be capable of it tomorrow
  - LLMs have fundamental limitations due to their underlying architecture and therefore are not limitless in capability

fenomas · 2026-03-24T15:14:54 1774365294

The post you replied to was:

> We went from 2 + 7 = 11 to "solved a frontier math problem" in 3 years, yet people don't think this will improve?

All that says is that the speaker thinks models will improve past where they are today. Not that it's a logical certainty (the first thing you jumped on them for), and certainly not anything about "limitless potential for growth" (which nobody even mentioned). With replies like this, invoking fallacies and attacking claims nobody made, you're adding a lot of heat and very little light here (and a few other threads on the page).

datsci_est_2015 · 2026-03-24T22:07:38 1774390058

> All that says is that the speaker thinks models will improve past where they are today. Not that it's a logical certainty

Exceedingly generous interpretation in my opinion. I tend to interpret rhetorical questions of that form as “it’s so obvious that I shouldn’t even have to ask it”.

fenomas · 2026-03-25T00:02:52 1774396972

> generous interpretation

The term of art for that is steelmanning, and HN tries to foster a culture of it. Please check the guidelines link in the footer and ctrl+f "strongest".

mikkupikku · 2026-03-24T16:07:06 1774368426

Better put than I could have.

graemep · 2026-03-24T14:12:54 1774361574

OK, its not a logical fallacy, its a false assumption.

The belief in the inevitability of progress is a bad assumption. Especially if you assume a particular technology will keep advancing.

mikkupikku · 2026-03-24T16:08:32 1774368512

We won't know if his assumption is false until time passes and moves future speculation into the empirical present.

graemep · 2026-03-24T18:12:33 1774375953

A possibility is not a fact. Assuming a possibility will happen is not justified. Therefore it is false as an assumption, even if it is true it is a possiblity.

mikkupikku · 2026-03-24T20:50:48 1774385448

I genuinely have no idea what you're on about. One guy expressed his belief about how the future will play out, and another disagreed. Time will be the judge of it, not either of us.

aspenmartin · 2026-03-24T16:50:19 1774371019

Hmm...the sun comes up today is a pretty good bet that the sun comes up tomorrow.

We have robust scaling laws that continue to hold at the largest scales. It is absolutely a very safe bet that more compute + more training + algorithmic improvements will certainly improve performance it's not like we're rolling a 1 trillion dollar die.

famouswaffles · 2026-03-24T13:24:17 1774358657

Well if people give the exact same 'reasons' why it could not do x task in the past that it did manage to do then it is tiring seeing the same nonsense again. The reason here does not even make much sense. This result is not easily verifiable math.

torginus · 2026-03-24T13:09:28 1774357768

Yeah, and even if we accept that models are improving in every possible way, going from this to 'AI is exponential, singularity etc.' is just as large a leap.

tim333 · 2026-03-24T16:12:06 1774368726

The comment doesn't say it must be X+1. It implies it will improve which I would say is a pretty safe bet.

gf000 · 2026-03-24T13:09:27 1774357767

https://xkcd.com/605/

snemvalts · 2026-03-24T08:54:46 1774342486

Scaling law is a power law , requiring orders of magnitude more compute and data for better accuracy from pre-training. Most companies have maxed it out.

For RL, we are arriving at a similar point https://www.tobyord.com/writing/how-well-does-rl-scale

Next stop is inference scaling with longer context window and longer reasoning. But instead of it being a one-off training cost, it becomes a running cost.

In essence we are chasing ever smaller gains in exchange for exponentially increasing costs. This energy will run out. There needs to be something completely different than LLMs for meaningful further progress.

Validark · 2026-03-24T07:32:59 1774337579

I tend to disagree that improvement is inherent. Really I'm just expressing an aesthetic preference when I say this, because I don't disagree that a lot of things improve. But it's not a guarantee, and it does take people doing the work and thinking about the same thing every day for years. In many cases there's only one person uniquely positioned to make a discovery, and it's by no means guaranteed to happen. Of course, in many cases there are a whole bunch of people who seem almost equally capable of solving something first, but I think if you say things like "I'm sure they're going to make it better" you're leaving to chance something you yourself could have an impact on. You can participate in pushing the boundaries or even making a small push on something that accelerates someone else's work. You can also donate money to research you are interested in to help pay people who might come up with breakthroughs. Don't assume other people will build the future, you should do it too! (Not saying you DON'T)

3abiton · 2026-03-24T07:51:24 1774338684

The problem class is rather very structured which makes it "easier", yet the results are undeniably impressive

number6 · 2026-03-24T06:40:45 1774334445

But can it count the R's in strawberry?

Paradigma11 · 2026-03-24T07:06:27 1774335987

That question is equivalent to asking a human to add the wavelengths of those two colors and divide it by 3.

snovv_crash · 2026-03-24T07:28:33 1774337313

Unless you're aware of hyperspectral image adapters for LLMs they aren't capable of that either.

szszrk · 2026-03-24T07:40:22 1774338022

Unfair - human beats AI in this comparison, as human will instantly answer "I don't know" instead of yelling a random number.

Or at best "I don't know, but maybe I can find out" and proceed to finding out/ But he is unlikely to shout "6" because he heard this number once when someone talked about light.

koliber · 2026-03-24T07:51:24 1774338684

> human will instantly answer "I don't know" instead of yelling a random number.

Seems that you never worked with Accenture consultants?

szszrk · 2026-03-24T11:45:47 1774352747

Fair.

Yet this can be filtered with fixed rules, like "output produced by corporate structures is untrusted random data".

thegabriele · 2026-03-24T09:15:06 1774343706

Why is that?

Paradigma11 · 2026-03-24T10:33:36 1774348416

Because LLMs dont have a textual representation of any text they consume. Its just vectors to them. Which is why they are so good at ignoring typos, the vector distance is so small it makes no difference to them.

Aditya_Garg · 2026-03-24T06:47:29 1774334849

yes its ridiculously good at stuff like that now. I dare you to try and trick it.

frizlab · 2026-03-24T07:01:03 1774335663

https://news.ycombinator.com/item?id=47495568

thedatamonger · 2026-03-24T07:14:53 1774336493

what bothers me is not that this issue will certainly disappear now that it has been identified, but that that we have yet to identify the category of these "stupid" bugs ...

sigmoid10 · 2026-03-24T07:32:34 1774337554

We already know exactly what causes these bugs. They are not a fundamental problem of LLMs, they are a problem of tokenizers. The actual model simply doesn't get to see the same text that you see. It can only infer this stuff from related info it was trained on. It's as if someone asked you how many 1s there are in the binary representation of this text. You'd also need to convert it first to think it through, or use some external tool, even though your computer never saw anything else.

Measter · 2026-03-24T13:38:33 1774359513

> It's as if someone asked you how many 1s there are in the binary representation of this text.

I'm actually kinda pleased with how close I guessed! I estimated 4 set bits per character, which with 491 characters in your post (including spaces) comes to 1964.

Then I ran your message through a program to get the actual number, and turns out it has 1800 exactly.

datsci_est_2015 · 2026-03-24T09:05:52 1774343152

Okay but, genuinely not an expert on the latest with LLMs, but isn’t tokenization an inherent part of LLM construction? Kind of like support vectors in SVMs, or nodes in neural networks? Once we remove tokenization from the equation, aren’t we no longer talking about LLMs?

fenomas · 2026-03-24T10:07:20 1774346840

It's not a side effect of tokenization per se, but of the tokenizers people use in actual practice. If somebody really wanted an LLM that can flawlessly count letters in words, they could train one with a naive tokenizer (like just ascii characters). But the resulting model would be very bad (for its size) at language or reasoning tasks.

Basically it's an engineering tradeoff. There is more demand for LLMs that can solve open math problems, but can't count the Rs in strawberry, than there is for models that can count letters but are bad at everything else.

nopinsight · 2026-03-24T08:02:59 1774339379

LLMs in some form will likely be a key component in the first AGI system we (help) build. We might still lack something essential. However, people who keep doubting AGI is even possible should learn more about The Church-Turing Thesis.

https://plato.stanford.edu/entries/church-turing/

gf000 · 2026-03-24T13:15:25 1774358125

AGI is definitely possible - there is nothing fundamentally different in the human brain that would surpass a Turing machine's computational power (unless you believe in some higher powers, etc).

We are just meat-computers.

But at the same time, there is absolutely no indication or reason to believe that this wave of AI hype is the AGI one and that LLMs can be scaled further. We absolutely don't know almost anything about the nature of human intelligence, so we can't even really claim whether we are close or far.

benterix · 2026-03-24T08:52:05 1774342325

This is a long read on things most people here know at least in some form. Could you pint to a particular fragment or a quote?

zeroonetwothree · 2026-03-25T00:59:00 1774400340

> We went from 2 + 7 = 11 to "solved a frontier math problem" in 3 years, yet people don't think this will improve?

This is disingenuous... I don't think people were impressed by GPT 3.5 because it was bad at math.

It's like saying: "We went from being unable to take off and the crew dying in a fire to a moon landing in 2 years, imagine how soon we'll have people on Mars"

eamag · 2026-03-24T12:13:43 1774354423

Self driving

saidnooneever · 2026-03-24T07:09:10 1774336150

if you let million monkeys bash typewriter. something something book

zozbot234 · 2026-03-24T05:20:09 1774329609

This is not formally verified math so there is no real verifiable-feedback aspect here. The best models for formalized math are still specialized ones. although general purpose models can assist formalization somewhat.

jack_pp · 2026-03-24T05:30:30 1774330230

Maybe to get a real breakthrough we have to make programming languages / tools better suited for LLM strengths not fuss so much about making it write code we like. What we need is correct code not nice looking code.

bloppe · 2026-03-24T07:19:42 1774336782

> programming languages / tools better suited for LLM strengths

The bitter lesson is that the best languages / tools are the ones for which the most quality training data exists, and that's pretty much necessarily the same languages / tools most commonly used by humans.

> Correct code not nice looking code

"Nice looking" is subjective, but simple, clear, readable code is just as important as ever for projects to be long-term successful. Arguably even more so. The aphorism about code being read much more often than it's written applies to LLMs "reading" code as well. They can go over the complexity cliff very fast. Just look at OpenClaw.

anthonyrstevens · 2026-03-24T15:36:24 1774366584

>> simple, clear, readable code is just as important as ever for projects to be long-term successful

Is it though? I'm a long-time code purist, but I am beginning to wonder about the assumptions underlying our vocation.

bloppe · 2026-03-24T17:41:50 1774374110

I guess it's hard to tell until we see more long-term AI-generated project, but many of the ones we have so far (OpenClaw and OpenCode for instance) are well-known for their stability issues, and it seems "even more AI" is not about to fix that.

kube-system · 2026-03-24T06:05:31 1774332331

If you can’t validate the code, you can’t tell if it’s correct.

3836293648 · 2026-03-24T07:42:48 1774338168

No?

That's literally the thing they suggested to move away from. That is just an issue when using tools designed for us.

Make them write in formal verification languages and we only have to understand the types.

To be clear, I don't think this is a good idea, at least not yet, but we do not have to always understand the code.

eru · 2026-03-24T05:31:29 1774330289

Lean might be a step in that direction.

kuerbel · 2026-03-24T05:52:52 1774331572

Yes yes

Let it write a black box no human understands. Give the means of production away.

anabis · 2026-03-24T07:29:06 1774337346

> But once you go beyond that to less defined things such as code quality

I think they have a good optimization target with SWE-Bench-CI.

You are tested for continuous changes to a repository, spanning multiple years in the original repository. Cumulative edits needs to be kept maintainable and composable.

If there are something missing with the definition of "can be maintained for multiple years incorporating bugfixes and feature additions" for code quality, then more work is needed, but I think it's a good starting point.

eptcyka · 2026-03-24T06:31:20 1774333880

Do we need all that if we can apply AI to solve practical problems today?

computably · 2026-03-24T07:31:51 1774337511

What is possible today is one thing. Sure people debate the details, but at this point it's pretty uncontroversial that AI tooling is beneficial in certain use cases.

Whether or not selling access to massive frontier models is a viable business model, or trillion-dollar valuations for AI companies can be justified... These questions are of a completely different scale, with near-term implications for the global economy.

fmbb · 2026-03-24T07:04:30 1774335870

Depends on the cost.

raincole · 2026-03-24T06:13:55 1774332835

Except it's not how this specific instance works. In this case the problem isn't written in a formal language and the AI's solution is not something one can automatically verify.

pjerem · 2026-03-24T06:48:11 1774334891

I mean, even if the technology stopped to improve immediately forever (which is unlikely), LLMs are already better than most humans at most tasks.

Including code quality. Not because they are exceptionally good (you are right that they aren’t superhuman like AlphaGo) but because most humans are rather not that good at it anyway and also somehow « hallucinate » because of tiredness.

Even today’s models are far from being exploited at their full potential because we actually developed pretty much no tools around it except tooling to generate code.

I’m also a long time « doubter » but as a curious person I used the tool anyway with all its flaws in the latest 3 years. And I’m forced to admit that hallucinations are pretty rare nowadays. Errors still happen but they are very rare and it’s easier than ever to get it back in track.

I think I’m also a « believer » now and believe me, I really don’t want to because as much as I’m excited by this, I’m also pretty much frightened of all the bad things that this tech could to the world in the wrong hands and I don’t feel like it’s particularly in the right hands.

charcircuit · 2026-03-24T07:45:05 1774338305

LLMs already do unsupervised learning to get better at creative things. This is possible since LLMs can judge the quality of what is being produced.

otabdeveloper4 · 2026-03-24T05:23:32 1774329812

LLMs can often guess the final answer, but the intermediate proof steps are always total bunk.

When doing math you only ever care about the proof, not the answer itself.

jamesfinlayson · 2026-03-24T05:48:23 1774331303

Yep, I remember a friend saying they did a maths course at university that had the correct answer given for each question - this was so that if you made some silly arithmetic mistake you could go back and fix it and all the marks were for the steps to actually solve the problem.

number6 · 2026-03-24T06:44:04 1774334644

This would have greatly helped me. I always was at a loss which trick I had to apply to solve this exam problem, while knowing the mathematics behind it. Just at some point you had to add a zero that was actually a part of a binomial that then collapsed the whole fromula

dash2 · 2026-03-24T16:58:09 1774371489

Not in this case: the LLM wrote the entire paper, and anyway the proof was the answer.

eru · 2026-03-24T05:32:28 1774330348

Once you have a working proof, no matter how bad, you can work towards making it nicer. It's like refactoring in programming.

If your proof is machine checkable, that's even easier.

prmoustache · 2026-03-24T06:34:23 1774334063

That is also how humans work mostly. Once every full moon we may get an "intuition" but most of the time we lean on collective knowledge, biases and behavior patterns to take decisions, write and talk.

otabdeveloper4 · 2026-03-24T09:44:33 1774345473

I haven't had success in getting AI's to output working proofs.

You'd need a completely different post-training and agent stack for that.

datsci_est_2015 · 2026-03-24T09:09:01 1774343341

What’s funny is that there are total cranks in human form that do the same thing. Lots of unsolicited “proofs” being submitted by “amateur mathematicians” where the content is utter nonsense, but like a monkey with a typewriter, there’s the possibility that they stumble upon an incredible insight.

typs · 2026-03-24T06:30:43 1774333843

I mean, this is why everyone is making bank selling RL environments in different domains to frontier labs.