This website uses cookies We use cookies to further personalize and enhance the user experience, conduct analytical research - for example, counting visits and traffic sources - place advertisements and contact third parties. Users can manage their cookie settings by clicking the "Choose your preferences" link.

Cookies legal
Published 2023-06-08

Schibsted’s key role in developing a Norwegian language model

This summer the first version of a generative language model in Norwegian will be ready. Schibsted has been heavily involved in making it happen.

Schibsted Chief Data & Technology Officer Sven Størmer Thaulow.

ChatGPT has been the big buzzword during the last months, especially with its ability to generate natural human-like language – even in Scandinavian languages. ChatGPT is based on a large language model developed by OpenAI, and never before has a new consumer application grown so fast in number of users.

“But now we need a Norwegian language model, built mainly on Norwegian text,” says Schibsted Chief Data & Technology Officer Sven Størmer Thaulow.

Invited media companies to work together

At the Nordic Media Days in Bergen in May, he invited all media companies in Norway to contribute content to the work of building a solid Norwegian language model as a local alternative to ChatGPT. The response was overwhelmingly positive.

Sven is currently the chair of the Norwegian Research Center for AI Innovation (NorwAI) at NTNU (the Norwegian University of Science and Technology) in Trondheim. Schibsted is one of several industrial partners of NorwAI – and contributes both with competence and data.

One of the big projects at NorwAI is to build a generative language model for the Norwegian language. The work has been ongoing for two years and the first version will be launched this summer. Schibsted has contributed thousands of articles for the model to be trained on. This is for now a non-commercial research project, and Schibsted will be among the first to test how well it performs compared to the big American models.

But the ambition is larger – and the plan is now to build a language model that is twice as big – with about 40 billion parameters. This will require two times the amount of content compared to the first version, which has 23 billion parameters.

Graphic illustration of the Norwegian language, created with DALL-E

Why a Norwegian language model is needed

Why is a Norwegian language model needed when ChatGPT apparently works quite well in all the Scandinavian languages?

Sven gave the media leaders in Bergen three main reasons:

Reason 1: Better in Norwegian

Firstly, a model trained primarily on content in the Norwegian language will likely also be much better in Norwegian. To compare, way below 1 % of the content ChatGPT was trained on, was in Norwegian.

Reason 2: Control over our own infrastructure

Secondly, it is important to have control over our own infrastructure. Artificial intelligence is already turning into a global industrial political race. It is not obvious that the technology will be democratised. Therefore we must develop our own infrastructure in the Nordic countries to stay in control. This requires effort and cooperation from private corporations in many sectors as well as government authorities.

Reason 3: Consistent with Norwegian culture

Thirdly, we need language models that are consistent with Norwegian culture and world views, and not dominated by American perspectives. Take our children as an example. For them, ChatGPT is already becoming somewhat of a “personalized textbook”. In the same way as our countries always have taken responsibility for the textbooks we offer our children, it will now be essential to ensure that the language models are consistent with the values our societies are built upon.  Because language models tend to inherit the values – and prejudices – inherent in the content they have been trained on and the people that have trained and aligned the model after the raw model has been created by the machines.

Huge effort to build a language model

“We have much to win by working together across media companies, other private corporations, and the government authorities to develop a Norwegian infrastructure for AI,” was Sven’s main message.

But building a large language model is a huge effort. It requires large amounts of text, specialized competence and enormous computing power.

“We need content that is representative of the full Norwegian society, from news articles, simple chats, government documents, court verdicts – to the most beautiful novel,” Sven said.

Will create a lot of value

A successful Norwegian language model can be a shared resource that will create lots of value both for the Norwegian society as a whole,  for companies like Schibsted and others, as well as for individuals, Sven points out.

And that is also an important reason why Schibsted has chosen to take a proactive role in working together with academic experts in developing an AI infrastructure for Norway.

Schibsted and NorwAI also collaborate with the equivalent team in Sweden called AI Sweden as well as the German research institute Fraunhofer to create an LLM for Germanic languages.