A quiet but far-reaching shift is taking place across the internet—one that puts already-marginalized voices at greater risk. Generative artificial intelligence (AI) is no longer just a tool of innovation; it is fast becoming a mechanism that amplifies existing inequalities. For communities that communicate in underrepresented languages, the impact goes beyond biased outputs. It threatens to further exclude them from digital spaces, distort their realities, and expose them to new forms of harm—with little oversight and even less accountability.
The once-blurry line between AI developers and social platforms has disappeared. Companies like Meta and xAI now repurpose massive user datasets, including posts, images, and behavioral patterns, to train AI models, often without consent or meaningful safeguards. These actions frequently take place under vague claims of “legitimate interest,” triggering legal and ethical backlash. Meta, for instance, uses nearly all public Facebook and Instagram content from adult users to train its models by default, unless they are explicitly marked as private. Users in regions like the European Union can opt out, while others cannot.
The growing integration of social media and generative AI systems is quietly reshaping our understanding of privacy, freedom of expression, and even truth itself. Yet, the entanglement of generative AI and social media imposes disproportionate risks on vulnerable communities. These models not only overlook their languages and experiences—they can algorithmically distort them, reinforcing misrepresentation, exclusion, and bias in the very content they generate.
Direct AI training from user data
The practice of scraping online content to train AI is not new, but what is unprecedented is the scale and directness of the data extraction. Companies like Meta, X, and TikTok now train models on vast volumes of public user data, including posts, images, and behavioral patterns, raising serious questions about consent and surveillance.
In January 2025, LinkedIn was hit with a lawsuit in the United States for allegedly using users’ private messages to train AI models. Around the same time, Meta openly acknowledged that it had been training its AI systems on public Facebook and Instagram content going back as far as 2007. Elon Musk’s integration of X and xAI further illustrates how platform ownership and AI development have merged. This is not just a shift in business strategy—it marks the beginning of a new era in which platforms no longer simply host user content, but quietly repurpose it to fuel AI systems.
Learning hate, producing bias
One of the gravest risks is the amplification of discrimination and hate embedded in training data. AI fails to distinguish between harmful and harmless content—it simply learns from everything it encounters. Thus, if models are fed racist, misogynistic, or violent content, their outputs will inevitably reflect those biases—but now disguised in the seemingly neutral, polished language of AI.
After Elon Musk’s takeover of Twitter and its rebranding as X in October 2022, content moderation policies were significantly relaxed. In early 2025, Grok 3 generated racist images of Black footballers and misogynistic slurs in Hindi—demonstrating how unfiltered training data can directly translate into harmful outputs.
Moreover, awareness that one’s online activity may train AI leads many users—especially from marginalized communities—to self-censor. Generative AI also reinforces disinformation by mimicking and amplifying false narratives, eroding trust in digital content.
Privacy violations at scale
When platforms use behavioral data to train AI—including data users believe to be private or even deleted—the line between public and personal becomes dangerously blurred.
Under the EU General Data Protection Regulation (GDPR), the processing of personal data must rely on one of six legal bases: consent, contract performance, legal obligation, vital interests, public interest, or legitimate interest. For special categories of data such as sexual orientation, health information, or political and religious beliefs, explicit consent is generally required. While some companies claim “legitimate interest” as their basis for training AI models on publicly accessible online content, this justification does not hold when individuals have no reasonable expectation that their data will be repurposed for AI training.
Over a decade ago, a 2013 study showed that Facebook “likes” could predict a person’s race, religion, and even IQ with startling accuracy. Today, with more advanced tech, platforms can infer even more—and users have lost all control over how that data is used.
Why users of non-mainstream languages are especially at risk
Users who speak and write in underrepresented languages—such as Persian and many others outside dominant global tongues—face unique and often overlooked risks in the age of generative AI. Most AI systems are trained primarily on content in English and a handful of major languages. As a result, content in less widely spoken languages is often misunderstood, misrepresented, or ignored altogether. Platforms tend to invest fewer resources in content moderation for these languages, allowing misinformation, hate speech, and harmful narratives to spread unchecked.
In countries under sanctions or political isolation, users lack meaningful legal remedies when their rights are violated online. At the same time, censorship and poor connectivity limit access to education and inhibit digital literacy. These users are also excluded from platform transparency efforts, rarely receiving safety updates or AI disclosures in their own language—making accountability nearly impossible.
In these contexts, AI systems don’t act as neutral tools—they can become engines of digital inequality, reinforcing surveillance, exclusion, and harm.
How to mitigate the harm
Despite the mounting risks, a rights-based path forward is possible. At the technical level, cleaning and diversifying AI training datasets is essential to reduce the reproduction of harmful biases. But technical fixes alone are not enough. Independent oversight must be built into the development and deployment of AI systems—including public audits of training processes and model outputs.
Civil society actors have a vital role to play here, not only by holding powerful tech companies accountable through reporting and policy analysis, but also by advancing digital literacy efforts that help people recognize algorithmic bias and critically engage with AI-generated content.
Stronger legal protections are also urgently needed, particularly for users in regions where international sanctions limit access to justice. Iranian users, for example, often have no meaningful avenue to challenge how their data is collected or used. Without enforceable safeguards and public scrutiny, AI risks becoming yet another tool of exclusion and inequality—rather than one of empowerment.