Why AI Performance Varies Across Languages
The structural reasons why “multilingual” AI solutions underperform in most languages, the implications for users worldwide, and what product builders can do to close the gap
AI will have an immense impact on people worldwide, but leading AI models don’t deliver the same results across languages. Most people, including product builders themselves, assume that “multilingual” models will “work” effectively because they seem fluent, but this isn’t a reliable indicator of how they actually perform. There’s a real performance gap between languages, and it creates serious disadvantages for people on the wrong side of it.
AI solutions are only truly useful when they can interpret what people mean, not just what they say. However, the models powering AI systems are primarily designed around a few languages. They underperform in most languages used by the global majority. Whether you build or use AI products, understanding how language influences AI capabilities is critical for recognizing what these technologies can really do in a global context.
Why Performance Gaps Exist Between Languages
Large language models (LLMs) are trained on huge amounts of data, most of which is scraped from the internet. However, the composition of online content does not reflect the distribution of spoken languages. Almost 50% of online content is in English, but it’s only spoken by ~20% of the world. This disparity matters because a model’s training data strongly influences its behavior. In a sense, LLMs’ “intelligence” is primarily shaped by the English language. The structures, patterns, and signals encoded within the language itself likely get encoded within the model.
When a model can’t handle variations in how the same information is communicated between different languages, it can fail to interpret context correctly. While leading LLMs can technically operate in other languages, their performance generally tends to degrade in non-English languages. Performance is improving across many languages as models improve, but users generally get better results when interacting with frontier models in English. Two structural causes explain the gap: a lack of high-quality training data in non-English languages and fundamental model limitations resulting from how models are built.
1) Limited Multilingual Training Data
There’s a lack of high-quality training data in many languages.
LLMs generally have a greater understanding of the dominant language in their training data. Researchers found that frontier models work well for the 1.52 billion people who speak English, but underperform for hundreds of millions of people who speak low-resource languages where there isn’t enough high-quality training data. This doesn’t always mean the raw data doesn’t exist, since people may still have online conversations and produce content in these languages, but labelled training data may not be readily available.
A language may be low-resource because it has too few speakers, or even if it is widely used, there’s limited digitization. For example, Swahili has 200 million speakers, but there isn’t a proportional amount of labelled data.
Two key limitations in low-resource languages impact model performance:
Data Quantity: There’s a scarcity of labeled data in the target language (e.g., Swahili, Bengali), and in many cases a lack of even raw, unlabeled language data (e.g., Mongolian, Amharic).
Data Quality: The available labeled data in the target language is of poor quality.
While data quantity issues can be solved with enough investment in data collection, data quality issues are much harder to address. Large-scale collection initiatives often don’t involve the communities whose language they are collecting. The people have little say in how the data is labeled and used, so it’s unlikely that critical social and cultural context is accurately reflected in the labeled training data. There’s also a real scarcity of skilled annotators in many languages.
Major LLMs are not attuned to the contexts and cultures of the non-English-speaking world. When operating in low-resource languages, AI models simply lack sufficient sociocultural context. A literal translation doesn’t always convey meaning accurately. An AI system can generate an intelligible response in a user’s language but still fail to understand their request accurately and communicate its response precisely.
In critical contexts such as healthcare, law, and finance, this lack of precision can have serious consequences. Researchers have observed that models produce responses in non-English languages that technically “make sense,” but use highly unnatural, culturally incorrect phrasings. Even outside critical contexts, this issue can create serious usability problems.
2) Fundamental Model Limitations
Leading frontier models are designed primarily to work for English.
While training data matters, even with sufficient data in most major languages, AI companies would still need to make fundamental architectural changes to the models themselves. Researchers found that model limitations like tokenizer inefficiencies and English-centric reasoning account for over 70-80% of failures during testing.
Tokenizer Inefficiencies
LLMs process information as tokens. They break down individual words into multiple tokens, converting them into numerical representations that they use during reasoning. Models trained predominantly on English often break non-English text into inefficient fragments, consuming significantly more tokens to express the same ideas in other languages. Higher token consumption means the model’s context window fills up faster, causing accuracy to drop significantly in longer sessions.
If tokenizer inefficiencies are not addressed, even if models “work” in a specific language, it’s computationally and financially more expensive for regional providers to build solutions that deliver good results. At the individual session level, these differences may not be large, but at scale, these costs may be substantial. Providers (and/or their users) are effectively paying more for the same (or more likely lower) performance whenever the system is not operating in English.
English-Centric Reasoning
When working in other languages, the model doesn’t “think in English” in a conscious sense, but its internal representations are shaped by its English-dominant training data. It can fail to map culturally or linguistically nuanced concepts between languages when operating in low-resource languages. The dominant training language will usually have an outsized influence on the model’s reasoning. For example, many leading open-source models built by Chinese AI labs face a similar problem, since these models sometimes apply Chinese-centric reasoning.
Examples of failures caused by this dominant language-centric reasoning:
Failures in language understanding: Models often leave recognizable signals of misunderstanding within their reasoning traces. They fail to accurately interpret meaning, which leads to worse task performance. The gap tends to increase in languages less similar to the model’s dominant reasoning language, resulting in inconsistent, divergent outputs. Many concepts also don’t have exact equivalents across languages, making more complex information harder to translate, reducing performance on complex tasks.
Issues with linking references: Many tasks involve references to people, objects, or entities that the model needs to track. In certain languages, these relationships are harder for the model to track due to multiple language-specific factors. For example, an item could be referenced using formal/informal terms, gendered/gender-neutral terms, or context-specific identifiers. While this may be perfectly understandable for a native speaker, the model can fail to make the right connections and treat the same referenced item as separate items, or vice versa.
Misaligned latent space: AI models often use a latent space when processing information. This is a compressed representation of relevant information that keeps the model focused on data points that actually matter for the task. This latent space is often heavily biased toward the predominant language used in model training. If a model is trained primarily on English data, when given the same information in English and another language, its reasoning is more likely to be accurate in the former case.
What This Means for Product Builders
Companies have always localized their products to serve users in geographically, culturally, and linguistically diverse regions. This has mostly meant user experience level optimizations, such as translating product copy for the local context, along with some system-level changes. With AI products, they also have to consider how the system’s “intelligence” must be adapted. Although a model may have multilingual capabilities, it will not perform equally well across every language. If product builders don’t take steps to close the performance gaps, they can’t build effective solutions for regional markets.
When building AI solutions, product builders must optimize the system around the model’s intelligence. They need to add specific architectural affordances to ensure that their solutions actually work in the user’s regional context. For example, if they are offering an AI fintech solution, they will need to design their context management system to ensure that the model fully understands the cultural context around their core use cases: How do users communicate their financial needs, and what specific cultural context is necessary to correctly interpret their intent? The model will not be able to infer many details on its own.
The model has a limited understanding of cultural nuances, even if it can technically communicate in the target language. For example, in certain languages, definitive statements are also softened, so the model may assume that a fact is uncertain, even when it’s true. The system must give the model guidance on how to handle these scenarios to ensure the model accurately interprets context. The system may also need to be designed to gather more context when operating in certain languages. For example, in many cultures, details are implied rather than explicitly stated, so the system may need to ask the user more questions to clarify context.
The performance gap also creates safety risks. Models are more likely to comply with harmful requests in non-English languages. Safety mechanisms like content filters, toxicity detection, and misinformation classifiers are language-dependent. Researchers tested guardrail performance on harmful prompts and found that refusal rates dropped sharply when these prompts were in West African languages compared to English. Ineffective mechanisms could either permit harmful responses or reject benign requests due to misinterpretation. Alignment and safety mechanisms won’t be as reliable across languages, which is a more fundamental problem than quality issues alone. Product builders have to address this to ensure their solutions don’t harm users simply because they use a different language.
What’s At Stake For Society At Large
Language isn’t a nice‑to‑have, it’s what determines whether technology actually empowers communities or leaves them out.
— Inbal Becker‑Reshef, Managing Director of Microsoft’s AI for Good Lab
Since ChatGPT’s launch in late 2022, AI usage has been increasing globally. While more than a billion people now use it to some degree, adoption remains uneven. Nearly twice as many people in the Global North use AI compared to the Global South. The performance gap across languages is probably a major reason for this difference. If AI doesn’t work for people in their language, they are less likely to use it, and even if they do, they are less likely to fully benefit from it. This creates asymmetry at scale because AI works well only for a few dominant languages. The performance gaps ensure that certain populations disproportionally benefit from AI.
As AI becomes more embedded in global infrastructure, this gap becomes more problematic. Many countries are now interested in using AI to improve how public services are delivered and how government institutions operate. For example, AI may be involved in processing people’s loan applications, insurance claims, or tax filings, even when these systems have poorer performance in many regional languages. This could create disparities even within multilingual societies, where people whose language AI performs better get a systemic advantage, worsening inequality within communities and even entire nations.
Case Study: AI In Healthcare
Performance gaps between languages impact health outcomes.
Healthcare is an area where AI promises incredible benefits. However, any AI solution that involves human interaction will have inconsistent or inferior performance in many linguistic and cultural contexts. Earlier this year, the Gates Foundation and OpenAI announced $50M in funding to deploy AI tools in 1,000 primary health clinics across Africa, intended to assist with patient triage and medical advice in local languages. While the initiative may be well-intentioned, it assumes these tools can operate effectively in regional languages.
Researchers tested LLM performance on medical knowledge in 8 African languages and found that while performance improves with newer models, it still lags performance in English. This may create a new layer of healthcare inequality. In regions with severely limited healthcare access, these performance gaps may be less of a concern because they help people who previously had little to no access to healthcare at all. However, this assumes that these systems can be reliable in these regional contexts. It still needs a fundamental understanding of the language to correctly associate the user feedback with relevant conditions or treatment plans.
The Problem Is Fixable, But Only To An Extent
While performance will generally increase across many languages as models improve, the approach that leading AI labs are currently taking effectively makes the performance gap systemic. They are unlikely to prioritize fully addressing this gap, even if their models are not really optimized for 80% of people who don’t use English. There isn’t a strong financial incentive for them to do so. They are treating language purely as a resource used to access markets, instead of an essential medium for enabling the model’s “intelligence.” Fixing the problem at the model-level is the only real way to improve its ability to accurately and efficiently reason in other languages.
Building foundational models from scratch is not a realistic option in many cases. However, solution providers can implement targeted post-training. This can be a practical, cost-effective solution to ensure models perform equally well in specific languages. Researchers propose deliberately post-training models on more linguistically diverse data and curating precise datasets to address specific model weaknesses. These strategies can be effective in scenarios where a target language was not directly included in model training. However, the efficacy drops as languages become more dissimilar to the high-resource languages in the model’s core training data.
While it’s possible to address the performance gap with post-training, implementing it may not even be an option in many regions of the world. The scarcity of labeled datasets in the target language, the costs of model training, and the lack of specialized expertise make post-training entirely out of reach for many communities. Even if regional providers want to make AI solutions work more effectively for their communities, it will be an uphill battle in many scenarios (if it’s feasible at all). They might succeed in reducing the performance gaps, but closing them entirely will be challenging.
Conclusion
The performance gap across languages is a consequence of how the leading models were built and who they were built for. The leading AI labs are pushing for global adoption, even though their models perform the best in only a few dominant languages, while underperforming in the languages the majority of people speak. Their models will continue to improve, and multilingual performance will likely improve as a byproduct, but parity will remain elusive. In many cases, the people most affected by the gap won’t have any meaningful influence over how it gets addressed.
While many people globally may benefit from AI, they will not benefit equally. AI leaders such as OpenAI CEO Sam Altman have even acknowledged this. The upside will be concentrated amongst those societies for whom this technology is designed by default. When product builders deploy systems that work well for some users and poorly for others, they are choosing to deliver unequal outcomes to the people they serve. For AI to truly benefit everyone, they have to close the gap with deliberate design.
Thanks For Reading
If you haven’t already, please consider subscribing and sharing this newsletter with a friend. I hope you have a great week!
Key References
arXiv
Digital Divide Data | Low-Resource Languages in AI: Closing the Global Language Data Gap
Fast Company | AI isn’t built for all languages and cultures. There’s a push to fix that
Stanford University Human-Centered AI (HAI)
Microsoft
Tech Policy Press | The Multilingual AI Gap Is Not Closing — It Is Being Rebranded
The Economist | Top AI Models Underperform in Languages Other Than English
The Strategic Linguist | From Language Policy to Algorithmic Advantage



That's a shame. Do you think it can be improved by a good ontology and good data in the specific language and the model gets access to that?