Language and topic identification under distributional shifts

This topic is closely related to the ongoing HPLT [1] and OSCAR [2] projects devoted to collecting Web-crawled multilingual text corpora, analyzing and cleaning it, and finally training large language models (LLMs) that can process hundreds of natural languages. An important sub-task is to ensure that LLMs are trained on data balanced across different languages and topics, otherwise the models learn only a few languages and topics that are most frequent on the Web.

A popular approach to language identification currently employed in both projects is using a character or word n-grams based classifier (e.g. FastText [3]) trained from scratch on the existing language identification datasets [4,5]. This approach shows good results on the test subsets from the same datasets, but suffers when applied in the web-crawled data which covers a wide range of topics and genres [6]. N-gram based classifiers are vulnerable to such distributional shifts between training and real data, and there is no hope to fix that by covering all the diversity of the Web crawled data for all languages in the training set. The goal of this project is to compare the existing language identification methods when applied to texts of different topics and genres missing in the training set, studying the existing ML methods to address the problem of distributional shifts, and applying them to improve the robustness of language identification. An additional goal can be developing a method to sample training examples balanced across languages and topics when no language or topic labels are observed. 

In the scope of this project you will learn the ways to deal with the problem of distributional shifts, which is probably the most fundamental problem when applying ML/DL methods in real applications. You will have a great opportunity to study and apply in practice such techniques as unsupervised and self-supervised training, fine-tuning of multilingual LLMs, metric learning, disentangled representation learning.  You will be able to work with the researchers from the HPLT and OSCAR projects and integrate your results into those projects.

References

  1. https://hplt-project.org/
  2. https://oscar-project.org/
  3. https://fasttext.cc/
  4. Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, and Kenneth Heafield. 2023. An Open Dataset and Model for Language Identification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 865–879, Toronto, Canada. Association for Computational Linguistics.
  5. Tommi Jauhiainen, Heidi Jauhiainen, and Krister Lindén. 2022. HeLI-OTS, Off-the-shelf Language Identifier for Text. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3912–3922, Marseille, France. European Language Resources Association.
  6. Dominic Widdows, Chris Brew. Language Identification with a Reciprocal Rank Classifier. 2021. https://arxiv.org/abs/2109.09862
Publisert 9. okt. 2023 12:04 - Sist endret 9. okt. 2023 12:04

Veileder(e)

Omfang (studiepoeng)

60