In the early stages of large language models, success was mostly measured by English fluency and performance on benchmarks such as SQuAD or MMLU. Current expectations extend well beyond benchmark accuracy. Users now expect models to handle local traditions, cultural idioms, and region-specific knowledge reliably. A Tamil speaker inquiring about Pongal customs, a Gujarati speaker seeking general information on family law norms, or a Kannada speaker asking about local festivals expects culturally grounded responses rather than literal translations or generic text.
This shift in expectations has revealed a persistent gap in modern AI systems. They can produce surface-level fluency in many languages, but still lack the cultural context necessary for accurate interpretation. This gap highlights the need for a different class of benchmark.
In November 2025, OpenAI introduced IndQA, a benchmark designed to assess the ability of AI systems to understand and reason about Indian languages and cultural contexts. It is not a translation dataset. It is a reasoning challenge built from the ground up by human experts across India. It tests whether models truly understand people, customs, stories, traditions, and daily life in South Asia.
This newsletter examines what IndQA is, its significance, the current results it reveals, and the opportunities that lie ahead for builders seeking to create culturally aware AI systems.
IndQA (Indian Question-Answering) is a large benchmark designed to evaluate AI models on culturally grounded questions from India. According to OpenAI, the dataset comprises 2,278 questions spanning 12 languages and 10 cultural domains. These 12 languages include 11 Indian languages and English, which is included because of its widespread use in Indian education, media, and public life.
Each question includes a native language prompt, an English translation for auditing, an expert-written ideal answer, and a rubric with weighted evaluation criteria.
The benchmark was created with the help of 261 domain experts across India. These experts include linguists, historians, writers, journalists, and scholars with deep knowledge of specific cultural regions. They designed questions that reflect authentic local knowledge, such as:
The varied symbolic uses of marigold flowers across regional festival traditions in India
The linguistic and literary impact of the Bhakti movement on specific regional languages
The context-dependent roles of community elders in village dispute resolution
The architectural differences between the Odia temple and the Dravidian temple
IndQA is therefore a cultural reasoning benchmark.
Before AI systems can become genuinely useful at a global scale, they must move beyond surface-level fluency and demonstrate real cultural reasoning. IndQA highlights how large language models often excel at grammar and vocabulary, yet fail at context-specific understanding. This gap is not a minor flaw. It cuts directly into trust, usability, and product readiness.
Linguistic diversity is the reality
India has 22 official languages and hundreds of dialects. At least seven Indian languages have more than 50 million native speakers within India alone. More broadly, around 80 percent of the world’s population does not speak English as their primary language.
Yet most AI training data and evaluation benchmarks remain English-centered. This mismatch produces models that sound fluent but do not reason correctly about culture. IndQA directly targets this issue by evaluating the kinds of knowledge that translation benchmarks or factual datasets cannot capture.
Existing multilingual benchmarks are saturated
Models such as GPT-4o, Gemini 2.0, and Claude 3.5 Sonnet achieve near-ceiling performance on many multilingual benchmarks. When models begin maxing out these datasets, the benchmarks stop being useful for measuring progress. Researchers refer to this as benchmark saturation.
IndQA avoids this by keeping only questions that strong models struggled with, preserving headroom for future models. The benchmark uses adversarial filtering, meaning that only questions that strong models failed were kept in the final dataset.
Culture matters more than grammar
Fluency alone does not guarantee understanding. In Hindi, Bengali, Marathi, Tamil, and many other languages, the same word can carry different meanings depending on region, caste, or social context. A model that translates accurately may still misinterpret the intended meaning. IndQA evaluates these nuances directly.
IndQA is relevant for real products
A virtual tutor that cannot understand a Marathi idiom, an assistant that misinterprets a Punjabi religious ritual, or a chatbot that misunderstands a Tamil proverb will break user trust. IndQA encourages AI developers to consider real-world applications rather than relying solely on laboratory performance.
IndQA’s design combines native language prompts, expert-written ideal answers, and detailed scoring rubrics. It follows a formal evaluation pipeline similar to academic assessment frameworks and high-stakes educational testing. The structure includes four core components.
Native language prompts
All questions are authored directly in the target language by domain experts. They are not translated from English templates. Each prompt is crafted to reflect authentic cultural context, regional nuance, and naturally occurring phrasing rather than standardized textbook language. This ensures:
Natural linguistic variation
Inclusion of idioms, honorifics, and culturally grounded references
Realistic code-switching patterns where appropriate, such as those seen in Hinglish
This design choice avoids the distortions that occur when English-centric prompts are translated into other languages.
English translations
Each item includes an English translation to enable transparent auditing. These translations are provided for evaluators, not for the models, and are intentionally literal rather than culturally embellished. Their purpose is to:
Allow non-native evaluators to verify model behavior.
Avoid injecting new cultural context into the translated version.
Maintain consistency across the grading pipeline.
The translated version is never used as the primary prompt during evaluation.
Expert-written ideal answers
Every question includes a full ideal answer written by a subject matter expert. These ideal answers are not simple fact lists. They often contain:
A clear explanation of the underlying cultural or historical concept
References to regional distinctions
Correct usage of culturally specific terminology and register
Avoidance of stereotypes, oversimplifications, or insensitive framing
The ideal answer serves as the gold standard against which candidate responses are scored.
Rubric-based scoring
IndQA employs a weighted rubric for each question, a methodology derived from educational research and large-scale human evaluation studies. Each rubric typically contains:
Multiple criteria, such as factual accuracy, cultural specificity, contextual understanding, and sensitivity.
Weights for each criterion, reflecting their relative importance to the question.
Binary or graded scoring for each criterion (yes, partially, or no).
Summation of weighted points to produce a final numeric score.
This structure allows partial credit. A model may identify the correct concept but miss details, or provide context but omit culturally specific elements.
The total possible score varies by question depending on the number and weight of criteria. IndQA normalizes aggregated performance across questions for cross-model comparison.
How scoring is operationalized
Although OpenAI did not publish every internal detail, public materials and reporting indicate that:
Human graders or judge models compare the candidate’s response to the ideal answer.
Each rubric criterion is evaluated independently.
Points are added only when the model meets the criterion fully or partially.
Errors such as cultural insensitivity or factual mistakes result in deductions.
The final score is aggregated across all criteria and normalized.
Why is this design significant?
IndQA deliberately avoids simple automated accuracy checks or translation-based grading. Instead, it measures whether a model can:
Reason about culturally grounded knowledge.
Avoid hallucinating invented cultural “facts.”
Maintain register, tone, and context appropriate to the region.
Understand relationships between customs, history, and daily practice.
This moves evaluation closer to how real multilingual users assess model quality.
IndQA evaluates model reasoning across 12 languages that reflect the linguistic, cultural, and script diversity of India. These languages represent the major language families used across the country and capture meaningful real-world variation in how people communicate with AI systems.
The following languages are included in the IndQA benchmark:
Bengali
English
Hindi
Hinglish
Kannada
Marathi
Odia
Telugu
Gujarati
Malayalam
Punjabi
Tamil
Note: Hinglish is intentionally included because it is widely used in India and is recognized by multiple sources as a meaningful hybrid language for evaluation.
These languages were selected because they represent both scale and diversity. Several of them are among the largest first-language communities in the world. Hindi, Bengali, Marathi, Telugu, Tamil, and Gujarati each have more than 50 million native speakers according to Indian census data. This scale makes them central to real-world AI usage in India and critical for evaluating cultural reasoning at the national level.
The set also includes languages with significant regional and cultural influence, languages that utilize distinct scripts, and widely used code-mixing patterns, such as Hinglish. Together, they span Indo Aryan and Dravidian language families and cover scripts including Devanagari, Bengali, Odia, Malayalam, Gurmukhi, and Latin.
Excluding Urdu is a missed opportunity.
Tens of millions of people across India and Pakistan speak Urdu. It has a rich literary tradition and uses the Nastaliq script, which modern models often struggle with. Excluding Urdu removes an important cultural language and a chance to test reasoning across a script family that receives less representation in AI evaluation.
IndQA spans 10 domains:
Architecture and design
Arts and culture
Everyday life
Food and cuisine
History
Law and ethics
Literature and linguistics
Media and entertainment
Religion and spirituality
Sports and recreation
These domains strike a balance between formal cultural knowledge and lived experience. They encompass a range of standard topics, including architecture, law, and religion, as well as everyday cultural practices, media consumption, and community traditions. These domains:
Cover multiple aspects of life and culture.
Many questions require multi-step reasoning and a deep understanding of context.
Reflects what real users ask in Indian languages.
Evaluates whether models avoid stereotypes and cultural errors.
By combining diverse domains with multilingual coverage, IndQA tests abilities that extend far beyond translation or memorization, moving evaluation toward cultural intelligence.
IndQA provides one of the clearest pictures of how frontier models handle culturally grounded reasoning across Indian languages. The results show that even the strongest systems perform well below the levels they typically reach on English-centric benchmarks. This section summarizes model performance using three visualizations.
The chart below shows overall IndQA scores for leading frontier models. Each bar represents the normalized final score across all languages and domains.
These results make two points clear. First, IndQA is significantly more challenging than familiar multilingual or English benchmarks, many of which show near saturation. Second, the gap between current model capabilities and truly culture-aware reasoning is substantial.
Variation by language
Model performance varies across languages, although these differences cannot be interpreted as a ranking of language difficulty, as each language uses a distinct set of questions. The chart below visualizes these patterns.
Higher scores tend to appear in languages that have greater digital presence or richer training data availability, such as Hindi and Hinglish. Lower scores are observed in languages such as Bengali and Telugu. These patterns reflect the distribution of data rather than inherent differences in language complexity.
Because IndQA questions are linguistically and culturally specific within each language, cross-language comparison is explicitly invalid. A higher score in one language does not mean that the language is easier or that models perform better in an absolute sense.
Variation by domain
IndQA also reveals meaningful differences in performance across cultural domains. Some domains are more structured and knowledge-driven, while others require historical reasoning or nuanced cultural interpretation. The chart below highlights these patterns.
Domains such as Law and Ethics and Food and Cuisine tend to produce higher scores, which suggests that models handle structured cultural knowledge better than deeply historical or context-dependent topics. In contrast, domains such as history consistently show lower performance across models. This pattern suggests that understanding historical narratives, regional heritage, and culturally embedded memories remains a significant challenge for current systems.
These domain-level breakdowns show that models do not simply “speak the language.” They must interpret cultural context, infer relationships, and avoid factual or cultural errors. IndQA exposes where those gaps remain.
IndQA is a crucial step toward culturally grounded evaluation, but like any benchmark, it has its limitations and areas for future development.
Limitations
The benchmark includes 12 languages, which represent major linguistic groups but do not capture India’s full range of regional languages, tribal languages, or dialects.
Questions differ across languages, which means scores cannot be used for direct cross-language comparison.
The evaluation focuses on single-turn question answering rather than multi-turn conversation, long context reasoning, or multimodal inputs.
Some languages have significantly less public digital data, which may contribute to lower model performance.
Adversarial filtering makes the benchmark intentionally hard, so low scores do not necessarily imply poor real-world usability but rather represent challenging test cases.
Future opportunities
Expanding the benchmark to cover more Indian languages and scripts, including those with smaller but culturally rich communities.
Introducing multimodal tasks such as evaluating images of festivals, architecture, or regional food paired with text.
Adding multi-turn conversational tasks to evaluate how models maintain cultural context across dialogue.
Including additional cultural domains such as folklore, traditional medicine, regional music, and occupational practices.
Supporting community contributions from universities, cultural institutions, and regional experts to extend domain coverage and cultural depth.
IndQA represents a significant shift in how AI systems are evaluated. It moves beyond translation quality or surface fluency to test whether a model can understand cultural meaning, context, and nuance. The results show that while frontier models have made progress, culturally grounded reasoning remains a frontier of its own. For builders aiming at a global or regional scale, IndQA offers a clear path for what to measure and where to improve.
Want to take the next step from understanding culturally grounded evaluation to building multilingual, context-aware AI systems? Take a look at the following course:
In this hands-on course, you will learn how to utilize OpenAI’s platform to develop intelligent, real-world AI applications. You’ll begin by exploring how AI development has evolved and gain practical coding experience with OpenAI’s APIs, setting a strong foundation for creative experimentation and applied problem-solving. Next, you will explore OpenAI’s core capabilities in text, audio, images, and embeddings. You’ll learn to build conversational systems, use web search and function calling, process multimedia inputs, and evaluate model performance. In the process, you’ll develop the technical fluency required to connect models with real-world workflows. Finally, you’ll learn to build and deploy agentic AI systems. You’ll create autonomous agents, design workflows visually with the Agent Builder, integrate ChatKit for user interfaces, and implement security and monitoring. By the end, you’ll be equipped to develop and ship reliable, production-grade AI applications.