DeepFin: State-of-the-art natural language processing for Finnish

Image: Adobe Stock

DeepFin: State-of-the-art natural language processing for Finnish

BERT (Bidirectional Encoder Representations from Transformers) is a natural language processing method originally developed by Google that enables a variety of language understanding tools. The technique was implemented for Finnish in DeepFin, one of CSC's 2019 Grand Challenge pilot projects.

 

Published in 2018, BERT became a state-of-the-art approach to tasks such as general language understanding, question answering and natural language inference. At first, it was only available for English. Google uses BERT in its search engine to better understand the users' queries.

BERT is a bidirectional method, which means that when processing text, it looks both forward and backward in the sentence on each analysis layer. This allows deeper understanding and better prediction results. Multilingual BERT models have also been developed but single-language ones typically perform better.

The bidirectionality of the model is based on using Google’s Transformer deep learning model to discover the words' contextual relations. BERT is trained with the masked language modeling approach, in the process of which the model is tasked with predicting one or more words in each sentence.

A major challenge in implementing the method for a variety of languages has been the sheer volume of text input and computational power required to train the model properly. Billions of words of text and supercomputer capabilities have made it possible to create a Finnish language model that can compete with models based on other approaches as well as the BERT implementations of other languages.

– Training large-scale language models based on deep learning is computationally intensive, and training the FinBERT models would not have been possible without the GPU partition resources in Puhti, says Sampo Pyysalo, Associate Professor of Language and Speech Technology in the University of Turku.

– Computing the models was done on Puhti using eight Nvidia V100 GPUs across two nodes. The pretraining took approximately 12 days to complete per model variant. CSC's computing resources were also used especially in preprocessing the text in the Finnish Internet Parsebank project that provided the largest single data source for the training with its texts compiled from the Finnish internet, Pyysalo continues.

– The Grand Challenge pilot was a very positive experience. There were very few technical difficulties, and we received fast and professional support. The pilot has also produced a number of other diverging projects the computation of which is also based on Puhti, he concludes.

After the pilot, the project has since extended to training BERT models for many other languages as well.

GPU-based natural language processing will greatly benefit from CSC’s new LUMI supercomputing environment.

The Puhti pilot projects were chosen in CSC's Grand Challenge call. Grand Challenge projects are scientific research projects that require computational resources beyond the usual project allocation. CSC’s Scientific Customer Panel selects the projects based on their impact with a Grand Challenge call.

The TurkuNLP research group

Read more about the Puhti supercomputer

Read more about Grand Challenge calls

More about this topic » Go to insights and news »

Tero Aalto