Comparing Transformer variants on character-level transduction tasks

Character-level transduction tasks refer to tasks like lemmatization, grapheme-to-phoneme conversion or historical text normalization. These tasks are most sensibly modelled on character level, which gives them some distinct properties from word-level transduction tasks like machine translation:

  • The number of tokens per instance is high (a sentence of a historical text easily consists of several hundred characters, whereas typical sentences for machine translation have less than hundred words or subwords).
  • The tokens are ambiguous (in a sentence, most characters occur several times, whereas only few words appear several times).
  • The tasks are typically monotonic (there are no long-distance reorderings of characters and character sequences).
  • A large proportion of tokens can just be copied.
  • The training data is typically much smaller.

Due to these properties, vanilla Transformer models with parameter settings from NMT have underperformed. The goal of this thesis will be to identify promising Transformer variants (e.g. Longformer, Copy Transformer, Levenshtein Transformer) and evaluate them on a range of character-level transduction tasks.

Publisert 6. okt. 2023 10:25 - Sist endret 6. okt. 2023 10:25

Veileder(e)

Omfang (studiepoeng)

60