In 2024, AI made astonishing progress across the board. Top models have improved...

dartos · on Jan 2, 2025

Personally I see a lot of benchmarks come and go with the scores on these benchmarks always creeping upwards, but my practical day to day experience of using LLMs has remained pretty lukewarm.

The things they were good at a year ago they’re still good at and what they were bad at they’re still bad at.

The products and infrastructure around them is better. Claude Artifacts are cool, for example. O1 has some clever prompting under the hood.

I just don’t know how much stock we should put in benchmarks.

They’re dubious for software performance as well.

starchild3001 · on Jan 2, 2025

I think the jump from ChatGPT to GPT4 was also something like 130 ELO points. This roughly equals 2/3 preference for the new model, 1/3 preference for the old model. That's roughly how much top models have improved for an "average query".

Now, none of us enter average queries :) We all have specific use cases in mind. So any specific user's mileage will vary.

I personally feel substantial improvement. I don't feel the need to google every LLM answer 5 times to feel confident about it :)

dartos · on Jan 5, 2025

That’s really great for you.

I have been mislead several times in very subtle ways when trying to ask specific questions to chatgpt and Claude.

It actually burned me at work. Made me look bad and caused my project to take an extra a sprint because of incorrect info about the web history API that came with working examples.

I just can’t ever trust information from these systems, unless they include sources to first party docs.