Oppgaven er ikke lenger tilgjengelig

What Can Character-level Language Models Tell Us About Themselves?

Some people say the future of language modeling lies in processing text as raw sequences of characters. This future has come closer with the recent introduction of the ByT5 language model. Can we use the task of grammatical error correction to assess its linguistic understanding?

Bildet kan inneholde: ansiktsuttrykk, organisme, gest, finger, gjøre.
The landscape of natural language processing dramatically changed with the advent of rich contextualized embeddings (obtained from ELMo or BERTs) and pre-trained generative language models (T5s or GPTs). However, these massive models are born with a severe limitation — they only work with subword tokens. This means they have no way of knowing what is "inside" the words. As a result, they have problems understanding puns or rhymes and suffer from noisy text with spelling mistakes.
 
On paper, the solution to this subword issue looks simple: just represent the text as a sequence of characters! However, researchers have not been very successful in training language models such long sequences — that is until this May, when a byte-level multilingual language model ByT5 was introduced. ByT5 matches the performance of its subword sibling mT5 while using raw UTF-8 bytes as input. 
 
ByT5 was already successfully used to establish the new state-of-the-art results on lexical normalization, so it seems natural to also try it on a similar (and more challenging) task, Grammatical Error Correction (GEC). The main goal of the thesis would be to shed some light on the differences between byte-level and subword-level language models by investigating the patterns of errors these models make. Are there any mistakes that the subword-level models cannot identify but byte-level models can? GEC is already a mature field of research with a multitude of datasets (and languages) to choose from. Are there any types of languages where the differences are more apparent? For example, can morphologically-rich inflected languages benefit from character-level representation?
 

Prerequisites

Some familiarity with deep learning and NLP is expected; interest in the topic :)
 

Recommended reading

Emneord: NLP, deep learning, language modeling, error correction, GEC, ByT5
Publisert 18. okt. 2021 23:57 - Sist endret 7. des. 2022 14:54

Veileder(e)

Omfang (studiepoeng)

60