Outlier detection methods for novel and lost sense detection

Human languages are constantly changing. In particular, words are changing their meanings as time goes by, which is known as Lexical Semantic Change (LSC). Historical linguists study how individual words lose their previously used senses and obtain new senses by searching and analyzing examples of word usage from different epochs. This is a very time consuming process resulting in dictionaries describing the change of several dozen words maximum. It is also extremely difficult to find and study rare senses of words with this manual approach. Lexical Semantic Change Detection (LSCD) [1,2] is a growing area of NLP research aiming at developing tools and models to automate the study of this phenomenon.

The goal of this project is developing methods and tools that for a given word can find in two given text collections representing different epochs the examples of lost and novel senses of this word.To achieve this goal you will experiment with fine-tuning of large language models for various related tasks and loss functions [3-5], then compare existing and/or develop new methods of outlier detection [6-8]. An important sub-task is developing a tool allowing us to conveniently study the examples of lost or novel senses of a word.Relying on the multilingual LLMs and their zero-shot cross-lingual transfer ability [9,10], you will try to make a single model working with many different languages.

Related work

  1. Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski, and Erik Velldal. 2018. Diachronic word embeddings and semantic shifts: a survey. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1384–1397, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  2. Stefano Montanelli, Francesco Periti. 2023. A Survey on Contextualised Semantic Shift Detection.
  3. Maxim Rachinskiy and Nikolay Arefyev. GlossReader at LSCDiscovery: Train to Select a Proper Gloss in English – Discover Lexical Semantic Change in Spanish. In Proc. of the Workshop on Computational Approaches to Historical Language Change (LChange), pages 198–203, Dublin, Ireland, May 2022. Association for Computational Linguistics (ACL).
  4. Maxim Rachinskiy and Nikolay Arefyev. Zeroshot Crosslingual Transfer of a Gloss Language Model for Semantic Change Detection. In Proc. of the Conference on Computational Linguistics and Intellectual Technologies (Dialogue), (online), 2021.
  5. Nikolay Arefyev, Maksim Fedoseev, Vitaly Protastov, Daniil Homiskiy, Adis Davletov, and Alexander Panchenko. DeepMistake: Which Senses are Hard to Distinguish for a Word-in-Context Model. In Proc. of the Conference on Computational Linguistics and Intellectual Technologies (Dialogue), (online), 2021.
  6. https://scikit-learn.org/stable/modules/outlier_detection.html
  7. Katrin Erk, 2006. Unknown word sense detection as outlier detection. Proceedings of Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics.
  8. Wenxuan Zhou, Fangyu Liu, Muhao Chen. 2021. Contrastive Out-of-Distribution Detection for Pretrained Transformers. In Proceedings of EMNLP 2021.
  9. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics. See also this and this.
  10. Zewen Chi, Shaohan Huang, Li Dong, Shuming Ma, Bo Zheng, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, Heyan Huang, and Furu Wei. 2022. XLM-E: Cross-lingual Language Model Pre-training via ELECTRA. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6170–6182, Dublin, Ireland. Association for Computational Linguistics.
Publisert 9. okt. 2023 12:00 - Sist endret 9. okt. 2023 12:00

Veileder(e)

Omfang (studiepoeng)

60