In 2024, AI made astonishing progress across the board. Top models have improved by roughly 130 ELO points on public leaderboards, while costs for running those same models have dropped many fold. Smaller, low-latency models now offer near–state-of-the-art performance (see Gemini Flash, for example). Multimodal capability is increasingly the norm, and strong reasoning skills are widely expected. New “hard problem” leaderboards have emerged, and they, too, have seen major gains. Long-context models have become standard, and mainstream products like Google Search now integrate AI by default.
Overall, the field has seen tremendous progress -- whether or not most users realize just how far we’ve come. Marcus's predictions don't sound specific enough -- no GPT5? Correct. But, what does that even mean?
Personally I see a lot of benchmarks come and go with the scores on these benchmarks always creeping upwards, but my practical day to day experience of using LLMs has remained pretty lukewarm.
The things they were good at a year ago they’re still good at and what they were bad at they’re still bad at.
The products and infrastructure around them is better. Claude Artifacts are cool, for example. O1 has some clever prompting under the hood.
I just don’t know how much stock we should put in benchmarks.
I think the jump from ChatGPT to GPT4 was also something like 130 ELO points. This roughly equals 2/3 preference for the new model, 1/3 preference for the old model. That's roughly how much top models have improved for an "average query".
Now, none of us enter average queries :) We all have specific use cases in mind. So any specific user's mileage will vary.
I personally feel substantial improvement. I don't feel the need to google every LLM answer 5 times to feel confident about it :)
I have been mislead several times in very subtle ways when trying to ask specific questions to chatgpt and Claude.
It actually burned me at work. Made me look bad and caused my project to take an extra a sprint because of incorrect info about the web history API that came with working examples.
I just can’t ever trust information from these systems, unless they include sources to first party docs.
Overall, the field has seen tremendous progress -- whether or not most users realize just how far we’ve come. Marcus's predictions don't sound specific enough -- no GPT5? Correct. But, what does that even mean?