
OpenEuroLLM
Open European Family of Large Language Models
The OpenEuroLLM project will build an Open European Family of Large Language Models (OpenEuroLLM), publishing data, training sources and the models in an open way. The models will cover all European official languages and also socially and economically important ones.
The models and other outputs will be distributed in such a way that it will be easy and cheap to use them for further development fine-tuning or any other use, especially by small and medium-sized enterprises (SMEs) in Europe. The transparent and compliant open-source models will democratise access to high-quality AI technologies. They will also strengthen the ability of European companies to compete on a global market and public organisations to produce impactful public services.
The OpenEuroLLM consortium combines the best academic as well as commercial researchers and developers with a proven track record to be able to set up procedures and workflows for this ambitious goal.
CSC is participating in the project task which will collect and categorise existing training datasets for all the targeted languages that will be later curated. CSC will also participate in technical optimisation and distributed training on HPC resources. This will set the stage for code base with strong scalability across thousands of GPUs, making sure it works on the LUMI supercomputer as well as other relevant HPC systems.
In addition, CSC will play a role in deriving scaling laws by conducting model training on smaller scales. This involves considering multiple model architectures, such as dense transformers, hybrid models and mixture of experts, and also hyperparameters such as learning rate and the used optimiser. CSC will then contribute to scaling up the selected model architectures to full capacity, aiming to create strong base models with strong generalisation and transfer capabilities as predicted by the derived scaling laws.
