skip to main content

Can AI help to increase access to all languages?

'Since translation is an integral part of globalisation efforts in science, technology, and commerce, the more languages that can be translated, the better.' Photo: Getty Images
'Since translation is an integral part of globalisation efforts in science, technology, and commerce, the more languages that can be translated, the better.' Photo: Getty Images

Analysis: the No Langage Left Behind AI project is looking to create an effective and efficient way to translate between 200 languages

Languages are the main medium of communication but there are more than 7,100 languages spoken around the world. People who live in different parts of the world speak different languages and it's sometimes hard to communicate with people who don't speak our language. This hinders relationships between people and makes it hard to understand one another or build trust. The ability to translate language, then, makes it easier to communicate across borders, and make information more accessible.

With the advances in technology and artificial intelligence, online translators such as Google Translate, DeepL, and Bing Translate have made communication a lot easier among those speaking different languages. Such applications generally use machine translation, which is the automatic conversion of text from one language to another.

Because human language is less rigid and more variable in nature, machine translation has become one of the most tedious artificial intelligence tasks. It is typically not a literal word-for-word translation, and it is often incapable of dealing with the idiomatic expressions or the ambiguity that can arise in context. For example, "Lá fhéile Pádraig sona dhuit" was translated as "Happy St. Patrick’s Day" by Google in 2018.

Such translation failures were due to the fact that machine translation was carried out using a rule-based approach, where the grammar, syntax and semantics about words, phrases and different parts-of-speech of a particular language were firmly stated. The translator simply applied these rules while translating and seemingly ended up failing to accurately translate most of the new sentences that it came across.

However, machine translation is increasingly based on a more sophisticated approach. To build a translator that can translate between languages, a large dataset of parallel text is provided as examples to the machine. Parallel text is a type of text where phrases or sentences in one language contain their corresponding proper translations into other languages.

Whenever the machine translates anything from one language to another, the translator has to choose which words (or sentences) should have their meaning translated and which ones should not. Hence, the more data or "parallel text" the translator sees, the better the translation.

Why are most of the existing online translators limited to translating a few well-spoken languages?

But not all languages have a lot of data available for AI translators to learn from in this way. While English, Spanish, French and Mandarin are high resource languages, with many examples for an AI translator to use, there are also many low resource languages which lack such data.

This is one of the reasons why most of the existing online translators are limited to translating a few well-spoken languages. To address this issue, Meta's AI Research team recently introduced the No-Language-Left-Behind project. This is the first AI project that is able to effectively translate between 200 languages including such low-resource languages as Luganda, Asturian, Urdu etc.

This project addresses the lack of data for low-resource languages by building a massive dataset of parallel text for 200 languages, including low-resource languages. They have done this by searching and matching sentences from documents across different languages.

Deadling with toxic content

The main risk in trying to automatically construct such massive data is toxic content, translations or sentences which might be considered offensive or profane. If not properly handled, the machine translation application will continue to learn from the toxic content and come up with offensive translations.

The Meta project handles toxic content by filtering out any offensive translations before the translator learns from them. This is done by randomly choosing a subset of translations from a language and testing if it is offensive. If a word or phrase is found to be toxic, the project uses another parallel translation from a different language. The reviewer will then test if this new more neutral parallel translation is at all offensive.

To help the project learn from low-resource languages, researchers are looking at efficient ways to use limited language resources such as sharing machine translation between multiple languages. They are also examining how to work with traditional human translators as well as machine translators.

The more languages that can be translated, the better

To date, the project has expanded the horizons in online translation for a large number of languages. Since translation is an integral part of globalisation efforts in science, technology, and commerce, the more languages that can be translated, the better. This also prompts inclusivity of minority groups providing accessibility to information in different languages.

Meta are also being quite transparent and are sharing all the data, implementation details, challenges and evaluation criteria associated with the project. By doing this, they are encouraging an open-source collaboration from researchers around the globe to pitch in to expand this project to more languages.


The views expressed here are those of the author and do not represent or reflect the views of RTÉ