NLP resources for Norwegian

An important priority for LTG in recent years has been to create NLP resources for the Norwegian language, both in terms of modeling and datasets. This page provides an overview of our existing and ongoing projects to support Norwegian NLP.

Language models and embeddings

LTG has released several pre-trained large language models for Norwegian, based on the GPT-like BLOOM and Mistral architectures and with a size of 7B billion parameters. The models are freely available under NORA.LLM at Hugging Face.

Through the NorLM initiative, LTG has previously provided several large-scale language models for Norwegian, including encoder models based on the ELMo and BERT architectures and encoder-decoders like NorT5. Furthermore, the NLPL vector repository contains more than 70 pre-trained word embedding models for Norwegian, in addition to models for several other languages.

Pre-processing and parsing

The Norwegian Dependency Treebank (NDT) contains annotations of lemmas, part-og-speech, dependency relations, morphological features, and more. The treebanks comprises both Nynorsk and Bokmål, and there is also a version adapted to the format of Universal Dependencies. NDT can be used to train and evaluate models for text pre-processing of Norwegian, and pre-trained models are provided in tools like Stanza, UDPipe, and spaCY.

Named Entity Recognition

The NorNE corpus adds annotations of Named Entities to the texts in NDT (both Bokmål and Nynorsk). Pre-trained models based on NorNE are available in spaCY.

Sentiment analysis

The SANT project has created several resources for Sentiment Analysis for Norwegian. The starting point is the Norwegian Review Corpus (NoReC) – a multi-domain corpus of professional reviews, including a rating on a scale of 1–6 which can be used for modeling document-level polarity. A subset of NoReC has been further annotated for fine-grained or structural sentiment analysis in NoReC_fine. This has also been aggregated to the sentence-level in NoReC_sentence. There is also an automatically created sentiment lexicon available; NorSentLex.

Other resources

Negation: The NoReC_neg dataset adds annotations of negation cues and their scopes to the texts in NoReC_fine.
Gender analysis: For the subset of book reviews in NoReC, the datset NoReC_gender annotates the genders of both critics and book authors.
Co-reference: the ongoing NARC project – a collaboration between LTG, the National Library and the Text Laboratory – the text of NDT are annotated for anaphora and co-reference.
Semantic change: The NorDiaChange dataset contains manual annotatation of diachronic semantic change for set of Norwegian words across different time periods.
Dialect modeling: NorDial is a preliminary corpus for Norwegian dialect classification, annotating a set of tweets as Bokmål, Nynorsk, Dialectal, or Mixed.
Parliamentary speeches: The Talk-of-Norway (ToN) corpus comprises parliamentary speeches from 1998–2016, with a rich set of metadata, including party affiliations and gender.
Synonyms: The Norwegian Synonymy Test Set was created by extracting words and synonyms from the digital version of Kunnskapsforlaget's Norske synonymer blåordbok, and can be used for evaluating synonym detection.
Analogies: The Norwegian Analogy Test Set was created by semi-automatically translating and adapting the existing Google analogies test set from English to Norwegian, defined for evaluating analogical reasoning.

Tags: NLP, Natural Language Processing, language technology, language models, corpora, Machine Learning

Published Mar. 28, 2022 1:49 PM - Last modified Mar. 13, 2024 8:51 AM