Researchers evaluated three leading models: GPT-4 developed by OpenAI, Claude 3 Opus by Anthropic, and Llama 3 by Meta.
In each experiment, the researchers presented the model with a short fictional user biography before asking a question. They systematically varied three factors:
- Education level (higher or lower)
- English proficiency (native or non-native)
- Country of origin (United States, Iran, or China)
The questions were drawn from two datasets: one designed to measure honesty and truthfulness, and another consisting of scientific questions to evaluate factual accuracy.
Accuracy Drops for Non-Native Speakers
Across all three models, responses were significantly less accurate when questions were attributed to users with lower levels of education or non-native English proficiency. The decline was most pronounced when both factors were combined—lower education and non-native English.
This finding is particularly notable because large language models are often marketed as tools that can help bridge educational and informational gaps—especially for users who may not have advanced formal training.
Iran at the Center of Performance Gaps
When comparing users from the United States, China, and Iran with similar education levels, the study found that Claude 3 Opus, developed by Anthropic, performed worse for users identified as being from Iran. The gap appeared both in scientific accuracy and in measures related to truthfulness.
The disparity extended beyond answer quality. The model was also more likely to refuse to respond.
According to the published data, Claude 3 Opus declined to answer approximately 11 percent of questions from users described as less educated and non-native English speakers. In contrast, the refusal rate under control conditions—where no user biography was provided—was 3.6 percent, representing more than a threefold increase.
Dismissive Tone and Selective Refusal
A manual review of responses found that in 43.7 percent of cases involving less educated users, the model’s tone contained dismissive or condescending elements. For highly educated users, this figure was below one percent. In some instances, the model appeared to mimic broken English or exaggerate an accent.
The study also reported selective refusals. On topics such as nuclear energy, human anatomy, and certain historical events, the model declined to provide information to less educated users identified as being from Iran or Russia—while answering the same questions for other users.
Such patterns raise concerns about consistency, fairness, and global trust in AI systems.
Reflecting Social Biases in Training Data
The researchers caution that these outcomes may not stem from deliberate design choices but rather from biases embedded in training data. Social science research has long documented that non-native English speakers are sometimes subconsciously perceived as less competent in certain contexts. The new findings suggest that similar patterns may be reflected in large language models.
One of the study’s authors noted that if language models are to meaningfully reduce global information inequality, their embedded biases must be systematically identified and mitigated. Otherwise, the technology risks reinforcing the very disparities it claims to address.
Personalization and the Risk of Amplified Inequality
The findings come at a time when AI developers—including OpenAI—are expanding personalization features such as persistent memory, which allows systems to retain user information across conversations. While these tools can enhance user experience, they also introduce the possibility that models may treat different groups of users differently if safeguards are not carefully implemented.
For countries like Iran, where users increasingly rely on AI tools but have limited involvement in their development and training, these results carry broader implications. The issue extends beyond technical accuracy; it touches on digital equity, algorithmic transparency, and equal access to knowledge.
The MIT study ultimately raises a fundamental question for the AI industry: are these systems truly global, or do they still view the world through narrow linguistic and geographic lenses?
Your Comment