Machine learning deciphers lost languages

Machine learning deciphers lost languages

Technology News |
By Rich Pell

Such “dead” languages – where little is known about their grammar, vocabulary, or syntax – can’t be deciphered using machine-translation algorithms, and often don’t have a well-researched “relative” language to be compared to, and may even lack traditional dividers like white space and punctuation. Now, say the researchers, they have developed a new system that has been shown to be able to automatically decipher a lost language without needing advanced knowledge of its relation to other languages.

In addition, the system can itself determine relationships between languages and was even used to corroborate recent scholarship suggesting that the language of Iberian is not actually related to Basque. Ultimately, say the researchers, their goal is for the system to be able to decipher lost languages that have eluded linguists for decades, using just a few thousand words.

The system relies on several principles grounded in insights from historical linguistics, such as the fact that languages generally only evolve in certain predictable ways. For example, while a given language rarely adds or deletes an entire sound, certain sound substitutions are likely to occur – i.e., a word with a “p” in the parent language may change into a “b” in the descendant language, but changing to a “k” is less likely due to the significant pronunciation gap.

By incorporating these and other linguistic constraints, say the researchers, they developed a decipherment algorithm that can handle the vast space of possible transformations and the scarcity of a guiding signal in the input. The algorithm learns to embed language sounds into a multidimensional space where differences in pronunciation are reflected in the distance between corresponding vectors.

This design, say the researchers, enables them to capture pertinent patterns of language change and express them as computational constraints. The resulting model can segment words in an ancient language and map them to counterparts in a related language.

The project builds on research last year that deciphered the dead languages of Ugaritic and Linear B – the latter of which had previously taken decades for humans to decode. However, a key difference with that project was that the team knew that these languages were related to early forms of Hebrew and Greek, respectively.

With the new system, the relationship between languages is inferred by the algorithm. This question, say the researchers, is one of the biggest challenges in decipherment. In the case of Linear B, it took several decades to discover the correct known descendant. For Iberian, the scholars still cannot agree on the related language, with some arguing it is Basque while others claim that Iberian doesn’t relate to any known language.

The proposed algorithm, say the researchers, can assess the proximity between two languages, and in fact, when tested on known languages, it can even accurately identify language families. The researchers applied their algorithm to Iberian considering Basque, as well as less-likely candidates from Romance, Germanic, Turkic, and Uralic families. While Basque and Latin were closer to Iberian than other languages, they were still too different to be considered related.

Looking ahead, the researchers hope to expand their work beyond the act of connecting texts to related words in a known language – an approach referred to as “cognate-based decipherment.” This paradigm assumes that such a known language exists, but the example of Iberian shows that this is not always the case. The researchers’ new approach would involve identifying semantic meaning of the words, even if they don’t know how to read them.

“For instance,” says MIT Professor Regina Barzilay, “we may identify all the references to people or locations in the document which can then be further investigated in light of the known historical evidence. These methods of ‘entity recognition’ are commonly used in various text processing applications today and are highly accurate, but the key research question is whether the task is feasible without any training data in the ancient language.”

Ultimately, say the researchers, lost languages are more than a mere academic curiosity, and without them, we miss an entire body of knowledge about the people who spoke them. For more, see “Deciphering Undersegmented Ancient Scripts Using Phonetic Prior.”

Related articles:
Google AI tool decodes ancient Egyptian hieroglyphs
Language processing breakthroughs promise real-time conversational AI
Multilingual speech understanding for intelligent edge devices


If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News


Linked Articles