Here’s the problem: Nearly half of the questions on these AI tests are already in the model’s memory banks. Imagine if you sat an exam and the teacher handed you last year’s answer sheet. That’s what’s happening here. GPT-4, for example, can guess the right answer more than half the time, even when the answer is hidden. Why? Because it’s seen it all before. Most of these benchmarks are old news, and the models have already studied them. So, when you see a high score, it might just mean the AI is a good memorizer—not a good thinker.

It gets worse. The people building these models often cherry-pick the best results, or keep using the same old tests until the models ace them. Take SuperGLUE: it was supposed to be tough, but models hit the ceiling almost right away. And when someone accused a big-name model of getting better just because it was fed the answers during training, it showed how easy it is to game the system.

So, if you can’t trust the headline numbers, what should you do? The paper suggests a few questions you should always ask before you believe the hype:

  • Was the model trained before or after this benchmark was published?
  • Does the vendor report scores across multiple independent evaluations, or cherry-pick favorable results?
  • Can they demonstrate capability on fresh, unreleased test sets rather than public benchmarks?
  • What is the methodology behind reported scores?

The solution? Stop letting the test-takers grade their own exams. The paper argues for outside, community-run tests that keep changing, so the models can’t just memorize the answers. One idea is called PeerBench, where the tests are locked away and only revealed after the fact. It’s not perfect yet, but it’s a step in the right direction.

Why does this matter? Because you need to know what you’re really buying. Don’t just take the vendor’s word for it. Ask for proof that the model can handle new, unseen problems. If you’re a researcher, remember that a jump in scores might just mean the model got better at remembering, not thinking. And if you’re making the rules, don’t assume that public test scores tell the whole story. After all, we don’t let doctors or pilots grade their own exams. So why should we let AI companies do it?

Read the position paper on arXiv