Nikolay Arefev

Postdoctoral Fellow - Research Group for Language Technology

Norwegian version of this page

Email nikolare@ifi.uio.no

Room 4122

Username

Visiting address Gaustadalléen 23B Ole-Johan Dahls hus 0373 Oslo

Postal address Postboks 1080 Blindern 0316 Oslo

Press photo Download business card

Postdoctoral Research Fellow in the Language Technology Group (LTG) at the Department of Informatics. Currently working on the High-Performance Language Technology (HPLT) project.

The preferred transliteration of my name diverging from my passport is Nikolay Arefyev, all my publications are under this name.

de Gibert, Ona; Nail, Graeme; Arefev, Nikolay; Bañón, Marta; van der Linde, Jelmer & Ji, Shaoxiong [Show all 13 contributors for this article] (2024). A New Massive Multilingual Dataset for High-Performance Language Technologies. In Calzolari, Nicoletta; Kan, Min-Yen; Hoste, Veronique; Lenci, Alessandro; Sakti, Sakriani & Xue, Nianwen (Ed.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). European Language Resources Association. ISSN 9782493814104. p. 1116–1128. Full text in Research Archive Show summary
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.
Kutuzov, Andrei; Fedorova, Mariia; Schlechtweg, Dominik & Arefev, Nikolay (2024). Enriching Word Usage Graphs with Cluster Definitions. In Calzolari, Nicoletta; Kan, Min-Yen; Hoste, Veronique; Lenci, Alessandro; Sakti, Sakriani & Xue, Nianwen (Ed.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). European Language Resources Association. ISSN 9782493814104. p. 6189–6198. Full text in Research Archive Show summary
We present a dataset of word usage graphs (WUGs), where the existing WUGs for multiple languages are enriched with cluster labels functioning as sense definitions. They are generated from scratch by fine-tuned encoder-decoder language models. The conducted human evaluation has shown that these definitions match the existing clusters in WUGs better than the definitions chosen from WordNet by two baseline systems. At the same time, the method is straightforward to use and easy to extend to new languages. The resulting enriched datasets can be extremely helpful for moving on to explainable semantic change modeling.
Anwar, Saba; Shelmanov, Artem; Arefyev, Nikolay; Panchenko, Alexander & Biemann, Chris (2023). Text augmentation for semantic frame induction and parsing. Language Resources and Evaluation. ISSN 1574-020X. doi: 10.1007/s10579-023-09679-8. Full text in Research Archive
Bingyu, Zhang & Arefyev, Nikolay (2022). The Document Vectors Using Cosine Similarity Revisited. In Tafreshi, Shabnam; Sedoc, João; Rogers, Anna; Drozd, Aleksandr; Rumshisky, Anna & Akula, Arjun (Ed.), Proceedings of the Third Workshop on Insights from Negative Results in NLP. Association for Computational Linguistics. ISSN 978-1-955917-40-7. p. 129–133. doi: 10.18653/v1/2022.insights-1.17.
Rachinskiy, Maxim & Arefyev, Nikolay (2022). GlossReader at LSCDiscovery: Train to Select a Proper Gloss in English – Discover Lexical Semantic Change in Spanish, Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change. Association for Computational Linguistics. ISSN 978-1-955917-42-1. p. 198–203. doi: 10.18653/v1/2022.lchange-1.22.
Homskiy, Daniil & Arefyev, Nikolay (2022). DeepMistake at LSCDiscovery: Can a Multilingual Word-in-Context Model Replace Human Annotators? Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change. Association for Computational Linguistics. ISSN 978-1-955917-42-1. p. 173–179. doi: 10.18653/v1/2022.lchange-1.18.
Kudisov, Artem & Arefyev, Nikolay (2022). BOS at LSCDiscovery: Lexical Substitution for Interpretable Lexical Semantic Change Detection, Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change. Association for Computational Linguistics. ISSN 978-1-955917-42-1. p. 165–172. doi: 10.18653/v1/2022.lchange-1.17.
Davletov, Adis; Gordeev, Denis; Arefyev, Nikolay & Davletov, Emil (2021). LIORI at SemEval-2021 Task 8: Ask Transformer for measurements. In Palmer, Alexis; Schneider, Nathan; Schluter, Natalie; Emerson, Guy; Herbelot, Aurelie & Zhu, Xiaodan (Ed.), Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). Association for Computational Linguistics. ISSN 978-1-954085-70-1. p. 1249–1254. doi: 10.18653/v1/2021.semeval-1.178.
Davletov, Adis; Arefyev, Nikolay; Gordeev, Denis & Rey, Alexey (2021). LIORI at SemEval-2021 Task 2: Span Prediction and Binary Classification approaches to Word-in-Context Disambiguation. In Palmer, Alexis; Schneider, Nathan; Schluter, Natalie; Emerson, Guy; Herbelot, Aurelie & Zhu, Xiaodan (Ed.), Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). Association for Computational Linguistics. ISSN 978-1-954085-70-1. p. 780–786. doi: 10.18653/v1/2021.semeval-1.103.
Razzhigaev, Anton; Arefyev, Nikolay & Panchenko, Alexander (2021). SkoltechNLP at SemEval-2021 Task 2: Generating Cross-Lingual Training Data for the Word-in-Context Task. In Palmer, Alexis; Schneider, Nathan; Schluter, Natalie; Emerson, Guy; Herbelot, Aurelie & Zhu, Xiaodan (Ed.), Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). Association for Computational Linguistics. ISSN 978-1-954085-70-1. p. 157–162. doi: 10.18653/v1/2021.semeval-1.16.
Rachinskiy, Maxim & Arefyev, Nikolay (2021). GlossReader at SemEval-2021 Task 2: Reading Definitions Improves Contextualized Word Embeddings. In Palmer, Alexis; Schneider, Nathan; Schluter, Natalie; Emerson, Guy; Herbelot, Aurelie & Zhu, Xiaodan (Ed.), Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). Association for Computational Linguistics. ISSN 978-1-954085-70-1. p. 756–762. doi: 10.18653/v1/2021.semeval-1.100.
Arefyev, Nikolay; Kharchev, Dmitrii & Shelmanov, Artem (2021). NB-MLM: Efficient Domain Adaptation of Masked Language Models for Sentiment Analysis, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. ISSN 978-1-955917-09-4. p. 9114–9124. doi: 10.18653/v1/2021.emnlp-main.717.
Arefyev, Nikolay; Sheludko, Boris; Podolskiy, Alexander & Panchenko, Alexander (2020). Always Keep your Target in Mind: Studying Semantics and Improving Performance of Neural Lexical Substitution, Proceedings of the 28th International Conference on Computational Linguistics. Association for Computational Linguistics. ISSN 978-1-952148-27-9. p. 1242–1255. doi: 10.18653/v1/2020.coling-main.107.

View all works in Cristin

Vyacheslav, Stroev; Dmitry, Grechka; Maria, Eliseeva & Arefyev, Nikolay (2022). Kashtanka.pet: employing AI to help lost pets return to their homes. Show summary
Аннотация доклада: The Kashtanka.pet project addresses the problem of searching for lost pets efficiently. There are numerous platforms and groups in social networks collecting ads about missing and found cats and dogs. However, it is often almost impossible for a human to find a specific lost pet among millions of ads about found pets distributed across all those websites. We develop a system that allows pet owners and volunteers to find lost pets efficiently with the help of AI. It crawls websites for ads about lost and found pets, and retrieves the pairs of ads announcing the same pet was lost and then found. The retrieved pairs are then inspected and further processed by humans. On the poster we present the architecture of the Kashtanka.pet system. Then we address the problem of evaluating and improving the quality of underlying AI models of lost pets retrieval. The standard manual annotation of a dataset for our task requires finding the matching pairs of lost and found ads, which makes the annotation process prohibitively difficult. Thus, we generate the matching pairs automatically by splitting sets of photos from the ads containing several photos into two parts. However, the simple random splitting results in both parts sharing some photos made in the same place. This may promote models searching for the same background rather than the same pet, and makes the setup unrealistically simple because in reality a pet is lost and found in different places. To mitigate this we propose a method that finds those ads that contain photos of a pet taken in several places and splits them accordingly. In order to estimate the quality of this method and select its hyperparameters, we additionally annotated pairs of photos from random ads asking annotators if those photos were taken in the same or different places. Several methods solving the task are proposed and compared. The pipeline currently deployed at kashtanka.pet is based on YOLOv4 for pet detection and cropping, EfficientNet for obtaining their embeddings, and GRU for aggregation of those embeddings across several images from the same ad. The model is trained with the triplet loss on the target dataset. Two other methods employ the pre-trained multimodal BLIP and SLIP models, which are the recently introduced improvements over the popular CLIP model. We found that even without fine-tuning on the data from our target domain the image embeddings from the multimodal models significantly outperform the currently deployed pipeline.

View all works in Cristin

Published May 15, 2023 9:24 AM - Last modified May 23, 2023 6:01 PM

Projects

High Performance Language Technologies (HPLT)

Research groups

Language Technology Group (LTG)