Retrieval augmented LLMs for analyzing word senses and generating glosses

Consulting a glossary, i.e. a dictionary describing the meaning of words, is a natural way for a person or a computer to understand the meaning of an unknown word or a familiar word in a novel context. For centuries people have been compiling glossaries aiming at describing word senses for words in a particular text (e.g. a glossary at the end of a book), in some domain (e.g. a glossary of financial terms) or language in general. However, manual construction of such resources is an extremely labor consuming task, and requires expertise in gloss writing,  and good knowledge of the target domain. This is the reason why existing glossaries have poor coverage of words and senses.

With the recent development of Large Language Models (LLMs), generating a definition of a word's meaning in a given context automatically is not a dream anymore. Several methods were proposed, [4] being the most recent one, and even a shared task [2] was organized with 11 teams participating in it. However, one context is hardly enough to generate the definition of a word sense which is not yet known to a machine. So these models mostly rely on their knowledge of word senses previously learnt from the training data and will likely work well only for those words that are well represented in the training corpora and only for a few most frequent senses of those words, which are often already present in the existing glossaries.

The ambitious goal of this project is automating the process of glossary construction with the focus on improving the quality of gloss generation for less frequent words and senses. you will develop and compare methods that for a given text corpus and a word can generate a glossary entry describing all senses of this word that appear in the given corpus. In order to achieve that you will experiment with the very recent retrieval-augmented LLMs developed mostly for the open-domain question answering task, which can mine large text corpora for additional information while generating an answer for a user query ([5] among others). Relying on the multilingual LLMs and their zero-shot cross-lingual transfer ability [6,7], youwill try to make a single model working with many different languages. Combining neural information retrieval models and multilingual LLMs for solving NLP tasks is now becoming an essential skill for a data scientist or NLP practitioner and will likely be helpful also in your future work.

Related work

  1. Gardner, N., Khan, H., & Hung, C. C. (2022). Definition modeling: literature review and dataset analysis. Applied Computing and Intelligence, 2(1), 83-98.
  2. Timothee Mickus, Kees Van Deemter, Mathieu Constant, and Denis Paperno. 2022. Semeval-2022 Task 1: CODWOE – Comparing Dictionaries and Word Embeddings. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 1–14, Seattle, United States. Association for Computational Linguistics.
  3. Artyom Gadetsky, Ilya Yakubovskiy, and Dmitry Vetrov. 2018. Conditional Generators of Words Definitions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 266–271, Melbourne, Australia. Association for Computational Linguistics.
  4. Mario Giulianelli, Iris Luden, Raquel Fernandez, and Andrey Kutuzov. 2023. Interpretable Word Sense Representations via Definition Generation: The Case of Semantic Change Analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3130–3148, Toronto, Canada. Association for Computational Linguistics.
  5. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.
  6. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
    See also: https://ai.meta.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/
    https://slideslive.com/38928776/unsupervised-crosslingual-representation-learning-at-scale
  7. Zewen Chi, Shaohan Huang, Li Dong, Shuming Ma, Bo Zheng, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, Heyan Huang, and Furu Wei. 2022. XLM-E: Cross-lingual Language Model Pre-training via ELECTRA. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6170–6182, Dublin, Ireland. Association for Computational Linguistics.

There are also nice videos explaining retrieval-augmented LLMs, watch them before reading the papers to get the general understanding:

Publisert 9. okt. 2023 11:50 - Sist endret 9. okt. 2023 11:50

Veileder(e)

Omfang (studiepoeng)

60