A good benchmark is not simply a set of unit tests.
What you want in a benchmark is a set of things you can use to measure general improvement; doing better should decrease the propensity of a particular failure mode. Doing this in a way that generalizes beyond specific sub-problems, or even specific inputs in the benchmark suite, is difficult. Building a benchmark suite that's large and comprehensive enough that generalization isn't necessary is also a challenge.
Think about an analogy to software security. Exploiting a SQL injection attack in insecure code is easy. Coming up with a set of unit tests that ensures an entire black box software system is free of SQL injection attacks is quite a bit more difficult. Red teaming vs blue teaming, except the blue team doesn't get source code in this case. So the security guarantee has to come from unit tests alone, not systematic design decisions. Just like in software security, knowing that you've systematically eliminated a problem is much more difficult than finding one instance of the problem.
What you want in a benchmark is a set of things you can use to measure general improvement; doing better should decrease the propensity of a particular failure mode. Doing this in a way that generalizes beyond specific sub-problems, or even specific inputs in the benchmark suite, is difficult. Building a benchmark suite that's large and comprehensive enough that generalization isn't necessary is also a challenge.
Think about an analogy to software security. Exploiting a SQL injection attack in insecure code is easy. Coming up with a set of unit tests that ensures an entire black box software system is free of SQL injection attacks is quite a bit more difficult. Red teaming vs blue teaming, except the blue team doesn't get source code in this case. So the security guarantee has to come from unit tests alone, not systematic design decisions. Just like in software security, knowing that you've systematically eliminated a problem is much more difficult than finding one instance of the problem.