Yes, I think that we really don't have a good way of benchmarking these systems....

Yes, I think that we really don't have a good way of benchmarking these systems.

For example, GPT-3.5-turbo apparently beats davinci on every benchmark that OpenAI has, yet anecdotally most people who try to use them both end up strongly preferring davinci despite the much higher cost.

Presumably, this is what OpenAI is trying resolve with their 'Evals' project, but based on what I have seen so far it won't help much.