Hacker Newsnew | past | comments | ask | show | jobs | submit | sanxiyn's commentslogin

Well, that's because this post is about Gen 13. In FL2 post (presumably on same Gen 12 servers), they say 25% lower latency.

In this case the code is public and you can see they are not cheating in that sense.

The harness seems extremely benchmark specific that gives them a huge advantage over what most models can use. This isn't a qualifying score for that reason.

Here is the ARC-AGI-3 specific harness by the way - lots of challenge information encoded inside: https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...


I agree it's not cheating that restricted sense. But I'm not really convinced that it can't be cheating in a more general sense. You can try like 10^10 variations of harnesses and select the one that performs best. And probably if you then look at it, it will not look like it's necessarily cheating. But you have biased the estimator by selecting the harness according to the value.

Once the model has seen the questions and answers in the training stage, the questions are worthless. Only a test using previously unseen questions has merit.

They aren't training new models for this. This is an agent harness for Opus 4.6.

All traffic is monitored, all signal sources are eventually incorporated into the training set in one way or another. The person you're responding to is correct, even a single API call to any AI provider is sufficient to discount future results from the same provider.

ok! So if someone uses an existing, checkpointed, open source model then the answer is yes the results are valid and it doesn't matter that the tests are public.

Yes, assuming the checkpoint was before the announcement & public availability of the test set.

You live in a conspiracy world. Those AI providers don't update the models that fast. You can try ask them solve ARC-AGI-3 without harness and see them struggle as yesterday yourself.

Which part is the conspiracy? Be as concrete as possible.

They are definitely cheating, they have crafted prompts[1] that explain the game rules rather than have the model explore and learn.

1. https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...


Where do you see that? I only skimmed the prompts but don't see any aspects of any of the games explained in there. There are a few hints which are legitimate prior knowledge about games in general, though some looks too inflexible to me. Prior knowledge ("Core priors") is a critical requirement of the ARC series, read the reports.

There is. Official leaderboard is without harness, and community leaderboard is with harness. Read ARC-AGI-3 Technical Paper for details.

I went through the technical paper again, and while they explain why they decided against the harness, I disagree with them - my take is that if harnesses are overfitting, then they should be penalized on the hidden test set.

Anyway, searching both in ARC-AGI's paper and website and directly on kaggle, I failed to find a with-harness leaderboard; can you please give the link?



Ah, it's based on this repo [0] and there's only 1 non-example submission there [1], from 2 weeks ago (so it only covers the preview games), and their schema doesn't a field to show that it's only the preview, nor does the thing properly parse the score or cost into the table. And the biggest thing is that apparently there's no validation whatsoever - submissions are not ever run on the hidden test games, so is essentially useless as a comparison.

[0] https://github.com/arcprize/ARC-AGI-Community-Leaderboard [1] https://github.com/arcprize/ARC-AGI-Community-Leaderboard/bl...


That works well. Anthropic wrote a writeup on it.

https://www.anthropic.com/engineering/harness-design-long-ru...


Yes, but a programming language with a proverbial sufficiently smart compiler. That is very useful.

Try writing an exhaustive spec for anything non-trivial and you might see the problem.

Been saying this for a while now. I work in aerospace, and I can tell you from first hand experience software engineers don't know what designing a spec is.

Aero, mechanical, and electrical engineers spend years designing a system. Design, requirements, reviews, redesign, more reviews, more requirements. Every single corner of the system is well understood before anything gets made. It's a detailed, time consuming, arduous process.

Software engineers think they can duplicate that process with a few skills and a weekend planning session with Claude Code. Because implementation is cheaper we don't have to go as hard as the mechanical and electrical folks, but to properly spec a system is still a massive amount of up front effort.


And software isn't as constrained by physics as hardware, which massively expands both the design space as well as how many ways things can go wrong.

While I think all of your design choices are defensible, I do think you should release the full human baseline data. The second best action count is fine, but other choices are reasonable as well.

The study is likely "SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration". Regression rate plot is figure 6.

Read the study to understand what it is measuring and how it was measured. As I understand parent's summary is fine, but you want to understand it first before repeating it to others.

https://arxiv.org/abs/2603.03823


Observation 3


You can disable this at Settings > Capabilities > Memory > Search and reference chats.


Not really. Due to combinatorial explosion some path is hard to hit randomly in this kind of source code. I would have preferred if after 2M random battles the reference implementation had 99% code coverage, than 99% pass rate.

I don't know anything about Pokemon, but I briefly looked at the code. "weather" seemed like a self contained thing I could potentially understand. Looking at https://github.com/vjeux/pokemon-showdown-rs/blob/master/src...

> NOTE: ignoringAbility() and abilityState.ending not fully implemented

So it is almost certain even after 99.96% pass rate, it didn't hit battle with weather suppressing Pokemon but with ability ignored. Code coverage driven testing loop would have found and fixed this one easily.


Good catch. I should really look at the code before commenting on it.


Yes, but in principle it isn't that different from running on Trainium or Inferentia (it's a matter of degree), and plenty of non-AI organizations adopted Trainium/Inferentia.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: