I have a theory: all these people reporting degrading model quality over time ar...

vintermann · 2025-09-20T18:59:11 1758394751

They test specific prompts with temperature 0. It is of course possible that all their tests prompts were lucky, but still then, shouldn't you see an immediate drop followed by a flat or increasing line?

Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format.

Spivak · 2025-09-20T21:02:27 1758402147

If it was random luck, wouldn't you expect about half the answers to be better? Assuming the OP isn't lying I don't think there's much room for luck when you get all the questions wrong on a T/F test.

lostmsu · 2025-09-21T04:00:24 1758427224

With T=0 on the same model you should get the same exact output text. If they are not getting it, other environmental factors invalidate the test result.

nothrabannosir · 2025-09-20T18:59:22 1758394762

TFA is about someone running the same test suite with 0 temperature and fixed inputs and fixtures on the same model over months on end.

What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest.

chaos_emergent · 2025-09-20T18:45:28 1758393928

Yes exactly, my theory is that the novelty of a new generation of LLMs’ performances tends to cause an inflation in peoples’ perceptions of the model, with a reversion to a better calibrated expectation over time. If the developer reported numerical evaluations that drifted over time, I’d be more convinced of model change.

zzzeek · 2025-09-20T18:58:48 1758394728

your theory does not hold up for this specific article as they carefully explained they are sending identical inputs into the model each time and observing progressively worse results with other variables unchanged. (though to be fair, others have noted they provided no replication details as to how they arrived at these results.)

gtsop · 2025-09-20T20:42:25 1758400945

I see your point but no, it's getting objectively worse. I have a similar experience of casually using chatgpt for various use cases, when 5 dropped i noticed it was very fast but oddly got some details off. As time moved on it became both slower and the output deteriorated.

yieldcrv · 2025-09-20T18:46:40 1758394000

fta: “I am glad I have proof of this with the test system”

I think they have receipts, but did not post them there

Aurornis · 2025-09-20T18:58:26 1758394706

A lot of the claims I’ve seen have claimed to have proof, but details are never shared.

Even a simple graph of the output would be better than nothing, but instead it’s just an empty claim.

yieldcrv · 2025-09-20T19:41:27 1758397287

That's been my experience too

but I use local models and sometimes the same ones for years already, and the consistency and expectations there is noteworthy, while I also have doubts about the quality consistency I have from closed models in the cloud. I don't see these kind of complaints from people using local models, which undermines the idea that people were just wowed three months ago and less impressed now.

so perhaps it's just a matter of transparency

but I think there is consistent fine tuning occuring, alongside filters added and removed in an opaque way in front of the model

colordrops · 2025-09-20T18:46:32 1758393992

Did any of you read the article? They have a test framework that objectively shows the model getting worse over time.

Aurornis · 2025-09-20T18:59:42 1758394782

I read the article. No proof was included. Not even a graph of declining results.

colordrops · 2025-09-21T16:55:54 1758473754

Ok fair, but not including the data is not the same as the article saying it was subjective "feel".