Language Biases in Tech: A Full Stack Problem


Take a minute to imagine you’re a newcomer to the internet. First of all, you are not alone. The web has been around for decades, yes, but on the scale of the world’s population, regular connectivity is still technically a minority experience. With an estimated 3.3 billion internet users out of a world population of 7.2 billion, and a stunning 833 percent growth rate over the past five years, we can expect diversity on the internet to increase significantly, especially as the world internet population inches toward a tipping point.

Now imagine you don’t speak English, Chinese, Arabic, Spanish or another majority language on the internet. Imagine you speak Bihari or Ilokano, minority languages in India and the Philippines, respectively. Again, your experience isn’t unique. With the so-called “next billion” coming online, we can expect a significant increase in language diversity on the internet.

For English speakers, the internet might seem like a teeming wonderland of information and games and social connections, but for those who are just coming online, the internet has a dearth of content—if any—in their native languages. The pipelines for voice and civic action that we’ve seen for much of the world are facing a significant challenge: crossing language and cultural barriers.

For one, some languages are completely invisible and unusable on browsers, operating systems, and keyboards. In the words of Tibetan blogger Dechen Pemba, who can’t access the Tibetan language on a phone:

Given that the Tibetan literary tradition goes back to the 7th century and its linguistic influence reaches far across the Himalayas encompassing areas of India, Bhutan, Mongolia, Russia and Pakistan, my pet hate is when Tibetan language is described as “obscure”. I wonder how it is possible that the language of Tibetan Buddhism and Tibetan Buddhists, comprising of as many as 60 million people, can be wilfully left behind in terms of modern technology? For instance, Google has failed to incorporate a Tibetan font into its Android software, failed to develop a Tibetan language interface and failed to include Tibetan in Google Translate, the most useful of tools. At least Apple has seen the light there.

In a recent series of lectures at UCLA hosted by the Digital Media Arts program and the Processing Foundation, I talked through some of these issues, drawing on an essay I’d written for the Digital Asia Hub, a new think tank in Hong Kong that’s grown out of the Berkman Center for Internet and Society.

Here’s a summary of the key points I think we should be paying attention to with regards to the language biases inherent to our technologies. These are pulled directly from the Digital Asia Hub essay and transcripts from the UCLA talk provided by the terrific Open Transcripts, with minor editing to contextualize the words for this piece:

Language biases create sharp divides in the global web—laying the foundation for digital ghettos of information and community.

Without improved language and writing script support, new netizens run the risk of living in digital ghettos created by their native tongues. Any online actions they engage in or media they create will be largely invisible and unappreciated by those outside their cultural-linguistic spheres. This can have significant effects, for instance, on human rights advocacy, which can depend so heavily on using social media and email to raise awareness among international news sources.

New internet users who don’t speak majority languages will likely be unable to participate in global internet culture and conversations as both readers and contributors. A number of internet researchers looking at language divides online have noted that minority languages speakers, especially those from the global south, will experience substantial information inequality online. Indeed, people’s inability to speak English can significantly affect their very adoption and use of the internet, even if they are aware of its existence.

The internet has proven to be a crucial pipeline for attention for those who have traditionally been marginalized. But language barriers can prevent the broader public from understanding their voices.

I think a lot of us are famil­iar with the internet’s role in build­ing social move­ments and the abil­ity to amplify one’s perspective and words. Certainly the Umbrella Movement in Hong Kong and the Black Lives Matter movement here in the U.S. rely on the abil­ity to broad­cast a mes­sage, to use hash­tags, and to cre­ate a pipeline from social media to main­stream media, and then hope­fully to other audi­ences.

And cer­tainly we can think about major hash­tags and major move­ments that’ve been in English or a major­ity lan­guage: #TweetLikeAForeignJournalist in Kenya was a cri­tique of media cov­er­age of East Africa. And then #JeSuisCharlie, a sim­ple enough French phrase for people to remember, understand and repeat online and offline.

But there are a num­ber of other move­ments in other lan­guages that are more dif­fi­cult to under­stand, and get sig­nif­i­cantly less atten­tion: There’s #sas­soufit in Congo; there’s the gau wu (#鳩嗚) move­ment, part of the Hong Kong Umbrella Movement, but also a tangential group with dif­fer­ent aims and strate­gies. As I argued at a recent panel on the topic of biased data, language is one important barrier that prevents these movements from reaching a wider audience.

Ultimately, language biases in our technologies are a full stack problem. These compound on each other, and as technologists, we have to think holistically about solutions.

In tech­nol­ogy design we talk about the full stack, a series of the layers, such as the code and the user interface, on which software is built. As we note during the biased data panel discussion, human-facing part of that code is in English. Admittedly, much of code is constructed from sim­ple phrases, like “if” and “then”. Yes, you can learn those phrases, but imag­ine try­ing to relearn code in a lan­guage that you don’t speak, and sud­denly hav­ing to learn two lan­guages: the pro­gram­ming lan­guage and then the lan­guage in which the pro­gram­ming lan­guage is expressed.

And then it moves up to the typog­ra­phy pres­sures. The abil­ity to input Arabic on a mobile phone up until recently was severely lim­ited, and Arabic speak­ers developed “Arabizi”, a chat language made of Roman letters and numbers to express their lan­guage online. This was incred­i­bly cre­ative, but it was also a response to a lack of support for the Arabic script. This affects many other languages whose primary script is not Latin.

Then it goes up from there into con­tent. If you want to engage with the broader internet, you have to have access, and we can include language as a form of access. As one example, Stack Overflow is a critical go-to source for the open source community and coders in general, but the majority of the knowl­edge on the site is only avail­able in English and Portuguese right now. If someone who speaks neither language wants to ask a question from this rich community of more experienced practitioners, whom could they ask?

And then the stack moves all the way to the typog­ra­phy. We’re talk­ing about the polit­i­cal deci­sions around typog­ra­phy. In lan­guages that use Latin let­ters, you have a wide vari­ety of typog­ra­phy and fonts that you can use, and if you have that kind of crit­i­cal knowl­edge about the impli­ca­tions of all these fonts you can really make impor­tant design deci­sions. But if you have access to only one or two fonts, sud­denly the abil­ity for you to cre­ate a space around the very con­tent and the sites that you’re try­ing to cre­ate again becomes lim­ited and you’re inher­it­ing some­one else’s designs around your typog­ra­phy.

To be clear, language biases in tech are an extension of the language biases we live with in broader society. As we discuss what it means to “speak American” in this diverse, multilingual country, and as we look to a world multilingual internet, it’s important to remember how often language barriers manifest. Just recently, I wrote about U.S. candidates’ attempts at Spanish language engagement on Twitter, which sometimes falls flat for native speakers. Both Clinton and Sanders have been called to task online for their not-always-perfect Spanish:

https://speakbridge.io/medias/embed/democratic-debates-2016/democratic-debates-2016-general/725


https://speakbridge.io/medias/embed/democratic-debates-2016/democratic-debates-2016-general/706

This is a bias of content, one that is higher up on the technology stack, but that creates a barrier between a candidate and their electorate. Whether a language is misunderstood, or, like Tibetan, completely invisible, the barrier of understanding creates a barrier to access. Solving this at all levels will take a lot of work, but it will be essential for a truly interconnected, accessible, and civically-engaged internet.