Sentiment Analysis for Norwegian Text

The SANT project develops resources for Sentiment Analysis for Norwegian Text. While coordinated by the Language Technology Group (LTG) at IFI/UiO, collaborating partners include NRK, Schibsted and Aller Media.

Sentiment Analysis (SA)

Image may contain: Circle, Smile, Font.One of the applications of Language Technology (LT) that has gained most widespread use in recent years is so-called opinion mining or sentiment analysis (SA). In broad terms, SA is the task of automatically identifying opinions or attitudes in text, defined as subjective expressions of positive or negative polarity. 

SANT

The goal of SANT is to create open resources for sentiment analysis for Norwegian. The project is a collaboration between the Language Technology Group (LTG) at the Department of Informatics at the University of Oslo, and three of Norway's largest media groups; the public broadcaster NRK/P3 and the privately held Schibsted Media Group and Aller Media. The media partners provide data in the form of reviews, collected across a range of different domains; music, literature, restaurants, home electronics, and more. As reviews by definition are packed with subjective opinions and evaluations, they're ideally suited for sentiment analysis.

Below we describe some the main resources created in the project.

Document-level SA: Reviews as training data

The Norwegian Review Corpus (NoReC) is a collection of reviews across a wide range of domains. We here suggest taking advantage of a peculiarity of the way reviews and critiques are typically summarized in Norwegian arts- and consumer journalism, viz. by an explicit rating on a scale 1–6, represented as a throw of a die. Treating these ratings as labels of overall text polarity, we can train and evaluate machine-learned models for sentiment analysis on the document-level. You can find further documentation in the associated GitHub and HuggingFace repos.

Fine-grained SA

For some applications, however, it is desirable to have models that can make more granular predictions at the (sub-)sentence-level, by identifying the individual polar expressions as well as the targets and holders of the opinions. To enable such models, a subset of the review corpus has been manually annotated with fine-grained and `structured' in-sentence polarity information, resulting in a dataset dubbed NoReCfine, as described in our paper A Fine-grained Sentiment Dataset for Norwegian by Øvrelid et al. 2020. For more information, please see the dedicated GitHub repository.

We have also made available a pre-trained model on Hugging Face model hub, based on the approach we described in the paper Direct Parsing to Sentiment Graphs, by Samuel et al. 2022. You can also try the model in an online demo at Hugging Face.

We have also created a simplified version of this data where only the target expression and their associated polarities are retained, corresponding to so-called targeted SA. See the corresponding GitHub repository for more information.

Sentence-level SA

We have also created a simplified version of the data set above that allows for training sentence-level polarity classifiers, NoReCsentence. We have made available datasets for three different configurations of labels; binary (`positive'/`negative'), ternary (includes `neutral') and multi-label (like ternary, but taking account of mixed polarity by allowing sentences to be both positive and negative at the same time). For more information, please see the GitHub repo.

Sentiment and gender

NoReCgender comprises only book reviews from NoReC, expanded with annotations of gender of both the book authors and critics (review authors). This has enabled analysis of the role of gender in sentiment and reviewing. This dataset is described in the paper Gender and Sentiment, Critics and Authors: a Dataset of Norwegian Book Reviews by Touileb et al. 2020.

A data analysis shows that female critics review female authors more harshly than what we find for other gender configurations. Moreover, experimental results show that models can be trained to predict the gender of not just the book authors, but also the critics, indicating that there are indeed differences in the language being used (even when controlling for gender cues in the text). These follow-up studies are described in the paper by Samia et al. (2021), Using Gender- and Polarity-Informed Models to Investigate Bias, and the MSc-thesis of by Tellef Seierstad (2023) Analyzing Gender and Sentiment in Norwegian Book Reviews.

Sentiment lexicon

A sentiment lexicon is simply a list of potentially sentiment bearing words and their prior positive/negative polarity. While such context independent polarity values will obviously have several shortcomings, the simplicity and transparency of lexicon-based approaches to SA still makes them attractive for many applications. NorSentLex is a Norwegian sentiment lexicon semi-automatically created on the basis of the English lexicon generated by Hu and Liu (2004). The lexicon was introduced in our paper Lexicon Information in Neural Sentiment Analysis: A Multi-Task Learning Approach by Barnes et al. 2019.
 

Language models

Language Models (LMs) comprise an important cornerstone in current NLP: Rather than starting from scratch when training models for specific applications – like sentiment analysis – we build on the knowledge already embedded in LMs pre-trained on vast amounts of raw text. An important contribution of the SANT project has been the development of the (then) first LMs for Norwegian, based on the well-known transformer-architectures BERT and T5. For more information, see the HuggingFace repositories for NorBERT and NorT5

An essential part of developing LMs is to be able to systematically evaluate and compare their performance on different downstream tasks. The SANT project has also been part of creating NorBench – a test suite for benchmarking Norwegian LMs on a range of different tasks, where several of our NoReC-derived datasets are included.

Negation

Of the many compositional effects in language that can impact sentiment, negation is arguably the most important and well-studied. (To provide a simple example, in a sentence like ‘the food could hardly be called tasty’, the word ‘hardly’ is a negator, scoping over the polar expression ‘tasty’, thereby flipping its polarity from positive to negative.) In NoReCneg we expanded our fine-grained SA dataset with manual annotation of negation cues and their related scopes, comprising the first dataset for negation in Norwegian.

 

Financing

The project is granted funding from the RCN's IKTPLUSS initiative until 2024.
 

 

Tags: Sentiment Analysis, Language Technology, Natural Language Processing, Machine Learning, NLP, AI, Artificial intelligence, deep learning, data science
Published June 12, 2017 11:12 PM - Last modified June 17, 2024 10:05 AM

Contact

Participants

Detailed list of participants
News