Comparing supervised, unsupervised and zero-shot/few-shot machine translation

The standard training setup for neural machine translation systems relies on large amounts of parallel data. However, for some language pairs, parallel corpora are not naturally available in sufficiently large quantities. To deal with this issue, "unsupervised" machine translation has been proposed a few years ago, with the aim of using only monolingual data for training: in a nutshell, word embedding spaces are learned independently for the source and the target languages, and the two embedding spaces are then aligned. More recently, pretrained multilingual language models have been evaluated on their capabilities to produce translations without being trained on parallel data (in zero-shot and few-shot settings).

For this thesis, you will work on a specific language pair, identify both monolingual and parallel resources and train different types of machine translation systems in controlled settings. Easy-to-use toolkits are available for all approaches.

Unsupervised MT: e.g. https://aclanthology.org/P19-1019/
Few-shot MT: e.g. https://aclanthology.org/2023.eamt-1.16/

Publisert 6. okt. 2023 10:23 - Sist endret 6. okt. 2023 10:23

Veileder(e)

Omfang (studiepoeng)

60