It will soon be easier to view Facebook and Instagram posts in less-spoken global languages, but one expert suggests Meta should talk to native speakers to improve the tool.
It will soon be easier to view Facebook and Instagram posts in 200 lesser-spoken languages around the world.
Meta's No Language Left Behind (NLLB) project announced in a paper published this month that they have scaled their original technology.
The project includes a dozen “under-resourced” European languages, such as Scottish Gaelic, Galician, Irish, Langurian, Bosnian, Icelandic and Welsh.
According to Meta, it is a language with less than a million sentences in the data that can be used.
Experts say Meta should consult with native speakers and language experts to improve the service as the tool still needs work.
How the project works
Meta trains its artificial intelligence (AI) with data from the Opus repository, an open-source platform that contains a collection of authentic speech or written texts for various languages that can program machine learning.
Contributors to the dataset are experts in natural language processing (NLP): the subset of AI research that gives computers the ability to translate and understand human language.
Meta said they also use a collection of data mined from sources like Wikipedia in their database.
The data is used to create what it calls a meta-multilingual language model (MLM), where AI can translate “between any pair of languages without relying on English data,” according to its website. .
The NLLB team tests the quality of their translations against a benchmark of human-translated sentences, which is also open source. It includes a list of “toxic” words or phrases that humans can teach the software to filter out when translating text.
According to their latest paper, the NLLB team increased translation accuracy by 44 percent from their first model released in 2020.
When the technology is fully implemented, Metta estimates that there will be more than 25 billion translations per day on Facebook News Feed, Instagram, and other platforms.
'talk to people'
William Lamb, Professor of Gaelic Ethnology and Linguistics at the University of Edinburgh, is an expert on Scottish Gaelic, one of the under-resourced languages identified by Metta in his NLLB project.
About 2.5 percent of Scotland's population, about 130,000 people, told the 2022 census that they had some proficiency in the 13th-century Celtic language.
There are also about 2,000 Gaelic speakers in eastern Canada, where it is a minority language. UNESCO classifies the language as “vulnerable” to extinction because so few people speak it regularly.
Lamb notes that Meta's translations into Scottish Gaelic are “not very good yet”, despite their “heart being in the right place” with the crowdsourced data they are using.
“What they should do … if they really want to improve the translation is to talk to the people, the native Gaelic speakers who still live and breathe the language,” Lamb said.
That's easier said than done, continued Lamb. Most native speakers are in their 70s and don't use computers, and younger speakers “don't habitually use Gaelic the way their grandparents did”.
A good alternative would be for Meta to enter into a licensing agreement with the BBC, which works to preserve the language by creating high-quality, online content.
'This needs to be done by experts'
Alberto Bogarández, a professor of AI at Spain's University of Santiago de Compostela, believes linguists like Lamb should work with big tech companies to improve the datasets available to them.
“This needs to be done by experts who can revise the texts, correct them and update them with metadata that we can use,” said Bigren-Dies.
“There is a real need for people from humanities and technical backgrounds like engineers to work together,” he added.
Using Wikipedia has an advantage for Meta, Bugarin-Diz continued, because the data will reflect “almost every aspect of human life,” meaning the quality of the language can be much better than using only formal text.
But, Bugarin-Diz suggests that Meta and other AI companies take the time to find quality data online and then meet the legal requirements necessary to use it, without breaking intellectual property laws.
Lamb, meanwhile, said he wouldn't recommend that people use it because of errors in the data until Meta makes some changes to its dataset.
“I wouldn't say their translation capabilities are at the point where the tools are really useful,” Lamb said.
“I still wouldn't encourage anyone as a reliable language tool; I think they'd be upfront in saying that too”.
Bugarín-Diz takes a different stance.
He believes that, if someone doesn't use meta-translations, they “won't be willing to invest the time and resources to improve them.”
As with other AI tools, Bugarin-Diz believes it's important to know the technology's weaknesses before using it.