Oppgaven er ikke lenger tilgjengelig

Language models and linguistic representativeness: cross-lingual perspective

This topic is now merged with Computational linguistic typology

Large-scale language models based on deep learning architectures (like ELMo, BERT or GPT-3) are the backbone of modern natural language processing. They are very successful in solving some language-related tasks after training on huge amounts of textual data. However, it is still not clear whether their generalization abilities will ever match those of human children.

Obviously, language models benefit from increasing the amount of data they are trained on. As the size of the training data increases, the model becomes more and more representative of the respective language, more generalizable to different tasks. These processes were studied recently for English, but not for other world's languages. The purpose of this Master thesis is to fill in this gap.

We will rely on language models (LM) for European languages which will be trained within the High-Performance Language Technology (HPLT) project in the forthcoming years. The aim is to compare different languages based on how much textual data is required in this language to train an LM which is linguistically representative. Of course, linguistic representativeness itself is by necessity a subjective concept: one of the tasks within this Master project will be to establish some reasonable approximation to such "representativeness". For this, we will use the existing NLP datasets and benchmarks aligned so that they are inter-linguistically comparable.

Is it the case that the majority of languages require more or less the same size of the training corpora to yield a "representative" LM? Or may be some languages can get away with much less training data, or vice versa - require a significantly larger corpus? Does it have anything to do with genetic or typological differences between languages? Let's find it out!

Prerequisites: None (you will learn most of the required stuff in the IN5550 course).

Recommended reading: see the hyperlinks to the papers in the text above.

Publisert 26. sep. 2022 14:11 - Sist endret 10. okt. 2023 14:52

Veileder(e)

Andrei Kutuzov Universitetet i Oslo
Yves Scherrer Universitetet i Oslo

Language models and linguistic representativeness: cross-lingual perspective

Veileder(e)

Omfang (studiepoeng)