Oppgaven er ikke lenger tilgjengelig

Norwegian-English machine translation

Bildet kan inneholde: rektangel, azure, font, materiell egenskap, elektrisk blå.

Neural machine translation (NMT or simply MT) is one of the most researched subfields of NLP these days, which might not be surprising given its direct applicability in practice. MT has greatly improved in the recent years and it approaches human performance in some domains. Unfortunately, there is currently no academic initiative focusing specifically on Norwegian (Bokmål nor Nynorsk) machine translation. A good MT system doesn't only fulfill some academic curiosity, but it's valuable for the general public — your work can improve our existing open translation service and even compete with closed services such as Google Translate. This thesis may address either Norwegian-to-English or English-to-Norwegian MT, or both directions simultaneously; preferably for both Norwegian language variations.

The first part of this thesis will focus on gathering, cleaning and analyzing available Norwegian-English parallel data. The Open Parallel Corpus (OPUS) will be an especially valuable resource for this task — however, some of these datasets are of very low-quality and require careful filtering to obtain a reasonable translation quality. In addition, OPUS has 5 different language codes for Norwegian, and the distinction between Bokmål and Nynorsk is not reliable. The analysis of available resources will be an invaluable contribution to Norwegian NLP. An important part of the thesis will also consist in identifying (or compiling) test sets for evaluating Norwegian-English machine translation.

After this initial part, there will be two ways how to further develop this topic:

  1. We can delve deeper into the data question and evaluate the possibility of utilizing monolingual data, perhaps via backtranslation or via fine-tuning monolingual language models.
  2. We will train a translation model using state-of-the-art deep learning techniques and then perform a detailed error analysis of its outputs — are they fluent and do they preserve the original meaning? What are the linguistic phenomena where the model often struggles? How does it perform in different domains? The automatic evaluation metrics are often misleading and it is important to explore the limits of a "black box" neural model. These findings will also be very useful for future improvements of Norwegian translation models.

Please do not hesitate to contact David if you have any questions or you simply want to hear more about this project.
 

Prerequisities: Fluency in Norwegian and English, linguistic curiosity, basic Python skills.

 

Publisert 26. sep. 2022 01:40 - Sist endret 13. nov. 2023 09:59

Veileder(e)

Omfang (studiepoeng)

60