Seeking to understand language by learning from translations

Photo: Adobe Stock.

Seeking to understand language by learning from translations

MultiMT, one of CSC's Puhti supercomputer pilot projects, uses deep learning and neural machine translation to discover meaning that is not dependent on any single language.

Ever since the 1950's, serviceable machine translation has been just around the corner, something that would become reality within the next ten years. However, for a long time, it was mostly a tool for preprocessing raw material and facilitating the work of the actual human translator. In the past decade, machine translation finally reached the point where the technology is directly useful for end users.

Translation can be based on a variety of different methods, including linguistic rules, statistics, and neural networks trained on large amounts of data. Deep learning neural applications have recently developed rapidly and are present in services such as Google Translate.

To truly compete with human translation, the machine needs to acquire natural language understanding beyond the capacity of fixed symbolic rules and simple statistics. This is also the goal of MultiMT, one of CSC's Puhti pilot projects based in the University of Helsinki.

Discovering language-independent meaning representations

MultiMT is based on the FoTran project (Found in Translation: Natural Language Understanding with Cross-lingual Grounding) led by Jörg Tiedemann, professor of language technology. It uses parallel corpora (texts translated by humans) to discover meaning representations that are not tied to any single language by interpreting the semantics of over a thousand natural languages.

Linguistic ambiguity is one of the main challenges of translation, human or machine. Cross-lingual grounding achieved by covering such a large portion of the world's linguistic diversity aims to transcend this with the meaning representations that arise from the source data.

Developing this model requires extensive computing resources that CSC's supercomputers can provide, especially the Puhti-AI artificial intelligence partition.

– Without the services by CSC, most of our work would not be possible. We require heavy computing and especially GPU-powered services. CSC is a great resource and enables the large-scale development we do. The IT support is also quick and very much appreciated and also helped in many cases, Tiedemann says.

– We use CSC and its HPC facilities extensively. We run our development mainly on Puhti (formerly also on Taito) and we also rely on the Allas object storage for our data. Furthermore, we facilitate the cPouta cloud services for web services and demos. The Grand Challenge pilot study was a good experience. We managed to run quite a lot of experiments and could start the large-scale development we continue working on now, Tiedemann continues.

Another essential product of the project was the OPUS-MT repository of pre-trained translation models that contains tools and resources for open translation services.

Puhti pilot projects were chosen in CSC's Grand Challenge call. Grand Challenge projects are scientific research projects that require computational resources beyond the usual project allocation. CSC’s Scientific Customer Panel selects the projects based on their impact with a Grand Challenge call.

Read more about Puhti Supercomputer

Read more about Grand Challenge Calls

More about this topic » Go to insights and news »

Tero Aalto