No, I think the "reasoning" step really does make a difference here.
There's more than just next token prediction going on. Those reasoning chain of thoughts have undergone their own reinforcement learning training against a different category of samples.
They've seen countless examples of how a reasoning chain would look for calculating a mortgage, or searching a flight, or debugging a Python program.
So I don't think it is accurate to describe the eventual result as "just next token prediction". It is a combination of next token production that has been informed by a chain of thought that was based on a different set of specially chosen examples.
Do you believe it's possible to produce a given set of model weights with an infinitely large number of different training examples?
If not, why not? Explain.
If so, how does your argument address the fact that this implies any given "reasoning" model can be trained without giving it a single example of something you would consider "reasoning"? (in fact, a "reasoning" model may be produced by random chance?)
There's more than just next token prediction going on. Those reasoning chain of thoughts have undergone their own reinforcement learning training against a different category of samples.
They've seen countless examples of how a reasoning chain would look for calculating a mortgage, or searching a flight, or debugging a Python program.
So I don't think it is accurate to describe the eventual result as "just next token prediction". It is a combination of next token production that has been informed by a chain of thought that was based on a different set of specially chosen examples.