OpenAI, on Tuesday, unveiled IndQA, a benchmark designed to evaluate how well artificial intelligence (AI) models understand and reason about questions rooted in Indian languages and cultural contexts.
IndQA marks the AI giant’s first focused effort to create a region specific benchmark. OpenAI’s aim is to create similar benchmarks for other languages and regions moving forward.
Srinivas Narayan, CTO b2b applications at OpenAI, noted India was selected “as an obvious starting point given its market size, linguistic diversity with approximately one billion people who don’t use English as their primary language, and cultural richness.”
Meanwhile, India represents OpenAI’s second-largest market for ChatGPT, amongst its 8 million weekly active users globally. IndQA employs a rubric-based approach where each response is graded against criteria written by domain experts for specific questions. These criteria outline what an ideal answer should include or avoid, with each criterion assigned a weighted point value based on its importance.
A model-based grader then checks whether each criterion is met, with the final score calculated as the sum of points earned out of the total possible. The benchmark currently comprises 2,278 questions spanning 11 Indian languages (Hindi, Hinglish, Gujarati, Punjabi, Kannada, Odia, Marathi, Malayalam, Tamil, Bengali, and Telugu) and 10 cultural domains (Law and ethics, Architecture and design, Food and cuisine, Everyday life, Religion and spirituality, Sports and recreation, Literature and linguistics, Media and entertainment, Arts and culture, and History), developed in collaboration with 261 domain experts including journalists, linguists, scholars, artists, and industry practitioners.
Each question underwent adversarial filtering, being tested against OpenAI’s strongest models at the time of creation, including GPT-4o, OpenAI o3, GPT-4.5, and partially GPT-5 post-public launch.
According to the benchmark results, OpenAI’s GPT-5 Thinking High model scored highest with 34.9%, followed by Google’s Gemini 2.5 Pro Thinking at 34.3%, Gemini 2.5 Flash Thinking (29.7%), Grok 4 (28.5%), OpenAI o3 High (28.1%), Gemini 2.5 Flash No Thinking (26.1%), OpenAI o3 (23.3%), GPT-4o (20.3%), and GPT-4 Turbo (12.1%).
A breakdown by language showed that GPT-5 Thinking High generally outperformed Gemini 2.5 Pro and Grok 4 across most of the 12 languages. The highest scores for all models were seen in Hindi and Hinglish, with GPT-5 scoring approximately 45% and 44% respectively. Performance was lowest for all models in Bengali and Telugu.
However, OpenAI cautions that IndQA should not be interpreted as a language leaderboard. Since questions are not identical across languages, cross-language scores cannot be directly compared. Instead, the company plans to use IndQA to measure improvement over time within a model family or configuration.
In the domain-specific analysis, GPT-5 Thinking High showed its strongest performance in ‘Law and ethics’ (42%), while Google’s Gemini 2.5 Pro notably scored highest of all models in ‘Literature and linguistics’ (41%). All models demonstrated their lowest performance in the ‘History’ domain.
The company expressed its commitment to improving performance on these metrics and sharing results based on the benchmark for future model releases.
