A language of no importance: the consequences of neglecting marginalized languages in the digital world

Most of the internet’s content, services, and tools are available in only a few dozen economically powerful languages, leaving speakers of marginalized languages unable to fully enjoy the rights-related benefits that internet access provides. Multi-stakeholder collaboration between private companies, public bodies, and civil society initiatives is urgently needed to bridge this digital linguistic divide.

The internet and associated digital technologies have fostered myriad opportunities for people all around the world to produce, share, and consume information and ideas. They also reflect—and in many ways perpetuate—the very same global inequalities in knowledge production and access that predated the digital revolution. 

Many of the core foundations of internet communications technology—such as Unicode characters, QWERTY keyboards, digital scripts, and the languages of software and programming—were developed by both government agencies and private actors of states with considerable geopolitical and economic power. 

As such, many of those core technologies reflect the design choices of those using left-to-right, Roman alphabet-based Western languages, and were not designed with speakers using different alphabets or orthographies, with oral language histories, or with particular accessibility requirements, in mind. 

Furthermore, as the internet gradually assumed a more central role in our day-to-day work and private lives, these same wealthier states were able to build the physical infrastructures required for internet access, give their citizens relatively cheap access to increasingly fast broadband connections, and incorporate digital literacy and ICT skills into their state education systems. 

Individuals in these states, therefore, have historically been able to not only consume but also contribute to the internet ecosystem far more than many of their counterparts in less developed countries. 

Many of those core technologies reflect the design choices of those using left-to-right, Roman alphabet-based Western languages, and were not designed with speakers using different alphabets or orthographies, with oral language histories, or with particular accessibility requirements, in mind. 

The picture of internet access is changing. Today, 75% of the internet’s five billion users are from the Global South. Yet, despite these regions being home to 90% of the world’s 7000+ languages, most major platforms and services are still only available or functional in a small number of geopolitically dominant languages. 

A recent report from Whose Knowledge, The Oxford Internet Institute, and the Centre for Internet and Society explores linguistic differences in the interfaces of major platforms, in Google Maps geography, and in Wikipedia knowledge ecosystems. The researchers demonstrate through these numerical studies that even languages with tens to hundreds of millions of speakers are considerably under-resourced in terms of both the types and volumes of knowledge available online. 

While many argue that most internet users speak a regional or international language in addition to their first language through which they can access information and express themselves online, it must be pointed out that this fluency requires considerable exposure to the language in question through either immersion or schooling. Many around the world— particularly those that are most disadvantaged along other metrics—have access to neither.

Today, 75% of the internet’s five billion users are from the Global South.

Furthermore, the absence of digital content on the internet in marginalized languages usually entails the absence of the large linguistic datasets needed to train emerging natural language processing (NLP) applications—such as online translators, spelling and grammar checks and predictive text applications, voice assistants, text summarizers, keyword filters, and chatbots. These assistive tools increasingly allow those in the developed world using influential languages to save more time and money, and are also serving as vital tools in the fight against harmful and illegal content on online platforms. 

Yet both the tools and their prerequisite databases simply do not exist for those who are already being ‘left behind’ in myriad other ways. As these technological divides continue to widen, even the unseen costs of the increasingly advanced English-language automated tools will likely be borne by those least likely to ever benefit from them

These stark inequalities in the available content, services, and moderation systems for marginalized language communities are increasingly on the radar of international institutions. International instruments like the Declaration on Minorities and the Declaration on the Rights of Indigenous Peoples have emphasized the positive obligations on states to protect and promote marginalized cultures and languages, including through education and technological development. 

The UNESCO Global Action Plan on the International Decade of Indigenous Languages includes goals related to the building of digital infrastructures and datasets for marginalized languages. And the OHCHR’s B-Tech Team is exploring how the UN Guiding Principles for Business and Human Rights apply to technology companies and their obligations to consider how their products might particularly impact the rights of marginalised groups, or exacerbate discrimination or contribute to physical harm.

However, the gaps between standards and implementation are considerable. Many linguistically-diverse states have limited resources and funds to achieve these goals, and many online platforms with global influence and multi-billion-dollar budgets are expending resources only on languages and linguistic technologies of strategic business importance.

Both private technology companies and governments bear responsibilities to protect the rights of minorities, and ensure that marginalized language speakers have equal access to the rights enabled by digital and internet technologies. 

As such, companies facilitating information and content-sharing online must invest in the development of robust digital infrastructure and interfaces for all of the languages of their user base, and provide avenues for content creation and curation in local languages where appropriate. This might include supporting and partnering with language digitization initiatives led and powered by local experts—like The O Foundation’s Open Speaks project, which equips language activists with the necessary tools and frameworks for comprehensive audiovisual documentation of their languages, or the Living Dictionaries app from the Living Tongues Institute for Endangered Languages, which curates searchable digital lexicons of endangered languages all around the world. 

At the state level, there is clearly a need to fund and resource the collection of data and literature in marginalized languages to ensure their survival and to lay the groundwork for the development of appropriate language technologies. Improving and expanding ICT education for speakers of marginalized languages will also be of vital importance, and partnership with translation companies might aid progress in this regard. 

In the long term, digital roadmaps—both national and international—should explicitly acknowledge and address the specific needs of marginalized language communities in accessing and enjoying their rights online.


*Kiriol translation provided by Zacchaeus Yisa