By Rohit Kumar Singh

We live in an age captivated by the rapid ascent of artificial intelligence (AI). Machines that can write poetry, generate stunning artwork, and even hold conversations are becoming commonplace. It feels like we are on the cusp of something revolutionary. But how do we actually know how smart these AI tools are becoming? How do we measure their progress? Just like students take exams, AI developers rely on tests called “benchmarks” to grade their creations. These benchmarks have become the de facto report card for AI, guiding trillions of dollars in investment and shaping the future of the technology.

But what if the tests are flawed? What if the report card isn’t telling the whole story? Imagine using a third-grade spelling test to assess a university professor’s overall intellect. They would ace it, sure, but it wouldn’t tell you much about their ability to conduct complex research or lecture on quantum physics. According to a growing chorus of experts, we might be facing a similar situation with AI. The benchmarks we have relied on, some with rather colourful acronyms like “HellaSwag”, are increasingly seen as inadequate rulers for measuring the burgeoning capabilities of modern AI.

The ‘Wild West’ of AI testing

Researchers are sounding the alarm. Many common benchmarks, they argue, are “easily gamed, outdated, or do a bad job of taking stock of a model’s actual skills”. Think of it like IIT-JEE-specific intensive coaching at Kota: AI models can become very good at scoring high on specific benchmarks without necessarily developing broader, more flexible intelligence. A revealing study called BetterBench evaluated popular AI tests and found that their quality left much to be desired. Anka Reuel of the Stanford Institute for Human-Centered AI paints a stark picture, describing the current situation as “kind of like the Wild West when it comes to benchmarks”.

When AIs ace the test

This reliance on outdated tests becomes particularly problematic as AI models get more and more intelligent at lightning speed. Alice Gatti, a researcher at Center for AI Safety, notes that advanced AIs are now “routinely ‘acing’ earlier benchmarks like MMLU (massive multitask language understanding)”, a previously challenging test covering diverse subjects. When the best student in the class gets 100% on every test, the tests stop being useful for measuring further growth. To address this, Gatti and her colleagues developed a formidable new benchmark called “Humanity’s Last Exam” (HLE). They gathered nearly 3,000 complex multiple-choice and short-answer questions from leading experts across numerous fields — questions designed to be difficult even for human specialists and specifically “Google-proofed” to prevent simple look-ups. For now, HLE reveals that the best AIs still struggle with truly expert-level reasoning.

Measuring answers vs. asking questions

Perhaps the biggest challenge lies in what we are measuring. Are we assessing true understanding, reasoning, and creativity, or just the ability to regurgitate information and find patterns? True intelligence isn’t just about having the right answers; it’s also about curiosity, critical thinking, formulating new ideas, and understanding context. Our current benchmarks often fall short of evaluating these deeper cognitive abilities. We need tests that probe not just what AI knows, but how it thinks.

Why should the average citizen care about any of this?

Because the benchmarks being used today aren’t just academic tools. They directly influence how AI is adopted in everything from education and healthcare to criminal justice and financial services. If an AI system is labelled “safe” or “human-level” based on weak tests, it could be deployed in ways that harm people or reinforce bias. In India, where AI is on its way to being integrated into governance, welfare delivery, and digital public infrastructure, the risks are even more acute. Without robust, context-sensitive benchmarks, we risk importing flawed models from global tech giants and deploying them in environments they were never designed for. What’s needed is not just stronger evaluation standards, but Indian participation in creating and governing them.

As AI becomes central to public decision-making, our frameworks for evaluating it must evolve. We need benchmarks that are not only harder but also smarter — tests that reflect the complexity of human language, values, and context. That means involving ethicists, domain experts, and yes, everyday users — not just engineers — in the design of these tests. The old saying goes: “What gets measured gets managed.” If we measure AI with the wrong yardsticks, we will manage it badly. And in a world where AI is making life-changing decisions — from who gets a loan to how a disease is diagnosed — we can’t afford that.

So, the next time you hear that an AI system passed some test with flying colours, ask a different question: was it the right test?

The writer is member, National Consumer Disputes Redressal Commission.

Disclaimer: Views expressed are personal and do not reflect the official position or policy of FinancialExpress.com. Reproducing this content without permission is prohibited.