Oppgaven er ikke lenger tilgjengelig

Data-Augmentation for Sentiment Analysis

Sentiment analysis (SA) is the task of detecting positive and negative opinions expressed in text and is one of the applications of Natural Language Processing that has found the most widespread use. SA can be carried out at many levels of granularity, e.g., the level of documents, sentences, or individual expressions. The SANT project has created resources for all these levels of analysis for Norwegian texts, based on arts- and consumer-reviews gathered from online news sources.

Bildet kan inneholde: hake, ansiktsuttrykk, smil, uttrykksikon, lykkelig.

One important challenge in SA, as also faced by many other areas of NLP, is that of limited availability of labeled training data. At the same time, we know that the amount of training examples is the most important driver for increasing model performance. The problem, of course, is that the creation of labeled data is typically a manual process requiring human experts, thereby incurring a high cost – in terms of time, effort, and money. The result is that the creation of labeled training data represents a major bottleneck for supervised learning.

In this project we will explore different strategies for data augmentation, i.e. ways to automatically generate additional labeled training data, e.g. by modifying existing examples or by other means creating synthetic data (for instance through the use of generative language models). We will explore the use of data augmentation to improve the performance of SA on several levels of analysis, e.g. the document-level, sentence-level or structured / fine-grained SA. The starting point for experiments will be the Norwegian Review Corpus (NoReC), coomprising more than 43,000 full-text reviews from a range of different domains, including literature, movies, video games, restaurants, music, products, etc. The original ratings assigned by the professional reviewers, on a scale of 1–6 (as represented by the dots of a die), can be used as labels for training supervised classifiers to predict document-level sentiment. NoReC_fine contains a subset of the reviews that have additionally been annotated with fine-grained information like polar expressions, sentiment targets, etc.

Good programming skills, experience with machine learning, and a solid background in NLP are relevant qualifications.

Publisert 16. sep. 2022 13:12 - Sist endret 7. des. 2022 14:54

Veileder(e)

Erik Velldal Universitetet i Oslo

Data-Augmentation for Sentiment Analysis

Veileder(e)

Omfang (studiepoeng)