I am an associate professor in the Language Technology Group, University of Oslo. In addition, I currently serve as the Norwegian on-site manager of the High-Performance Language Technology (HPLT) project.
I prefer my first name to be spelled as "Andrey". Unfortunately, my current passport disagrees.
Academic interests
Computational linguistics and natural language processing; semantic change detection and diachronically aware language models; distributional semantics, machine learning, large-scale language models.
Among other things, I participated in designing and training NorBERT and NorELMo models and very large-scale NORA.LLM generative models.
In 2022, I received the Norwegian Artificial Intelligence Research Consortium (NORA) award as a Distinguished Early Career Researcher.
You may also want to have a look at WebVectors, the web service we created to play with static and contextualized word embeddings for English and Norwegian languages.
Courses taught
Background
Read full CV
November 13, 2020, I defended my PhD thesis "Distributional word embeddings in modeling diachronic semantic change". The thesis is available here.
I received my Master's degree in Computational Linguistics at National Research University Higher School of Economics (Moscow) in 2014, with the thesis "Semantic clustering of Russian web search results: possibilities and problems".
Below is the list of my recent publications.
Tags:
Machine Learning,
Natural Language Processing,
Computational Linguistics,
Corpus Linguistics,
Word Embeddings,
Distributional Semantics,
Diachronic Word Embeddings,
Semantic Shifts,
Semantic Change Detection,
language models,
NorBERT,
NorELMO,
HPLT
Publications
-
de Gibert, Ona; Nail, Graeme; Arefev, Nikolay; Bañón, Marta; van der Linde, Jelmer & Ji, Shaoxiong
[Show all 13 contributors for this article]
(2024).
A New Massive Multilingual Dataset for High-Performance Language Technologies.
In Calzolari, Nicoletta; Kan, Min-Yen; Hoste, Veronique; Lenci, Alessandro; Sakti, Sakriani & Xue, Nianwen (Ed.),
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).
European Language Resources Association.
ISSN 9782493814104.
p. 1116–1128.
Full text in Research Archive
Show summary
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.
-
Kutuzov, Andrei; Fedorova, Mariia; Schlechtweg, Dominik & Arefev, Nikolay
(2024).
Enriching Word Usage Graphs with Cluster Definitions.
In Calzolari, Nicoletta; Kan, Min-Yen; Hoste, Veronique; Lenci, Alessandro; Sakti, Sakriani & Xue, Nianwen (Ed.),
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).
European Language Resources Association.
ISSN 9782493814104.
p. 6189–6198.
Full text in Research Archive
Show summary
We present a dataset of word usage graphs (WUGs), where the existing WUGs for multiple languages are enriched with cluster labels functioning as sense definitions. They are generated from scratch by fine-tuned encoder-decoder language models. The conducted human evaluation has shown that these definitions match the existing clusters in WUGs better than the definitions chosen from WordNet by two baseline systems. At the same time, the method is straightforward to use and easy to extend to new languages. The resulting enriched datasets can be extremely helpful for moving on to explainable semantic change modeling.
-
-
Giulianelli, Mario; Luden, Iris; Fernandez, Raquel & Kutuzov, Andrei
(2023).
Interpretable Word Sense Representations via Definition Generation: The Case of Semantic Change Analysis,
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
Association for Computational Linguistics.
ISSN 978-1-959429-72-2.
p. 3130–3148.
Full text in Research Archive
Show summary
We propose using automatically generated natural language definitions of contextualised word usages as interpretable word and word sense representations.Given a collection of usage examples for a target word, and the corresponding data-driven usage clusters (i.e., word senses), a definition is generated for each usage with a specialised Flan-T5 language model, and the most prototypical definition in a usage cluster is chosen as the sense label. We demonstrate how the resulting sense labels can make existing approaches to semantic change analysis more interpretable, and how they can allow users — historical linguists, lexicographers, or social scientists — to explore and intuitively explain diachronic trajectories of word meaning. Semantic change analysis is only one of many possible applications of the ‘definitions as representations’ paradigm. Beyond being human-readable, contextualised definitions also outperform token or usage sentence embeddings in word-in-context semantic similarity judgements, making them a new promising type of lexical representation for NLP.
-
-
Samuel, David; Kutuzov, Andrei; Øvrelid, Lilja & Velldal, Erik
(2023).
Trained on 100 million words and still in shape: BERT meets British National Corpus.
In Vlachos, Andreas & Augenstein, Isabelle (Ed.),
Findings of the Association for Computational Linguistics: EACL 2023.
Association for Computational Linguistics.
ISSN 978-1-959429-47-0.
p. 1954–1974.
Show summary
While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source – the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.
-
Aksenova, Anna; Gavrishina, Ekaterina; Rykov, Elisei & Kutuzov, Andrei
(2022).
RuDSI: Graph-based Word Sense Induction Dataset for Russian,
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing.
Association for Computational Linguistics.
ISSN 978-1-955917-22-3.
p. 77–88.
Show summary
We present RuDSI, a new benchmark for word sense induction (WSI) in Russian. The dataset was created using manual annotation and semi-automatic clustering of Word Usage Graphs (WUGs). RuDSI is completely data-driven (based on texts from Russian National Corpus), with no external word senses imposed on annotators. We present and analyze RuDSI, describe our annotation workflow, show how graph clustering parameters affect the dataset, report the performance that several baseline WSI methods obtain on RuDSI and discuss possibilities for improving these scores.
-
Kutuzov, Andrei; Velldal, Erik & Øvrelid, Lilja
(2022).
Contextualized embeddings for semantic change detection: Lessons learned .
Northern European Journal of Language Technology (NEJLT).
ISSN 2000-1533.
8(1).
doi:
10.3384/nejlt.2000-1533.2022.3478.
Show summary
We present a qualitative analysis of the (potentially erroneous) outputs of contextualized embedding-based methods for detecting diachronic semantic change. First, we introduce an ensemble method outperforming previously described contextualized approaches. This method is used as a basis for an in-depth analysis of the degrees of semantic change predicted for English words across 5 decades. Our findings show that contextualized methods can often predict high change scores for words which are not undergoing any real diachronic semantic shift in the lexicographic sense of the term (or at least the status of these shifts is questionable). Such challenging cases are discussed in detail with examples, and their linguistic categorization is proposed. Our conclusion is that pre-trained contextualized language models are prone to confound changes in lexicographic senses and changes in contextual variance, which naturally stem from their distributional nature, but is different from the types of issues observed in methods based on static embeddings. Additionally, they often merge together syntactic and semantic aspects of lexical entities. We propose a range of possible future solutions to these issues.
-
Barnes, Jeremy; Oberlaender, Laura; Troiano, Enrica; Kutuzov, Andrei; Buchmann, Jan & Agerri, Rodrigo
[Show all 8 contributors for this article]
(2022).
SemEval 2022 Task 10: Structured Sentiment Analysis.
In Emerson, Guy; Schluter, Natalie; Stanovsky, Gabriel; Kumar, Ritesh; Palmer, Alexis; Schneider, Nathan; Singh, Siddarth & Ratan, Shyam (Ed.),
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022).
Association for Computational Linguistics.
ISSN 978-1-955917-80-3.
p. 1280–1295.
-
Kutuzov, Andrei; Touileb, Samia; Mæhlum, Petter; Enstad, Tita & Witteman, Alexandra
(2022).
NorDiaChange: Diachronic Semantic Change Dataset for Norwegian,
Proceedings of the Language Resources and Evaluation Conference.
European Language Resources Association.
ISSN 979-10-95546-72-6.
p. 2563–2572.
Full text in Research Archive
Show summary
We describe NorDiaChange: the first diachronic semantic change dataset for Norwegian. NorDiaChange comprises two novel subsets, covering about 80 Norwegian nouns manually annotated with graded semantic change over time. Both datasets follow the same annotation procedure and can be used interchangeably as train and test splits for each other. NorDiaChange covers the time periods related to pre- and post-war events, oil and gas discovery in Norway, and technological developments. The annotation was done using the DURel framework and two large historical Norwegian corpora. NorDiaChange is published in full under a permissive licence, complete with raw annotation data and inferred diachronic word usage graphs (DWUGs).
-
Giulianelli, Mario; Kutuzov, Andrei & Pivovarova, Lidia
(2022).
Do Not Fire the Linguist: Grammatical Profiles Help Language Models Detect Semantic Change,
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change.
Association for Computational Linguistics.
ISSN 978-1-955917-42-1.
p. 54–67.
Show summary
Morphological and syntactic changes in word usage — as captured, e.g., by grammatical profiles — have been shown to be good predictors of a word’s meaning change. In this work, we explore whether large pre-trained contextualised language models, a common tool for lexical semantic change detection, are sensitive to such morphosyntactic changes. To this end, we first compare the performance of grammatical profiles against that of a multilingual neural language model (XLM-R) on 10 datasets, covering 7 languages, and then combine the two approaches in ensembles to assess their complementarity. Our results show that ensembling grammatical profiles with XLM-R improves semantic change detection performance for most datasets and languages. This indicates that language models do not fully cover the fine-grained morphological and syntactic signals that are explicitly represented in grammatical profiles. An interesting exception are the test sets where the time spans under analysis are much longer than the time gap between them (for example, century-long spans with a one-year gap between them). Morphosyntactic change is slow so grammatical profiles do not detect in such cases. In contrast, language models, thanks to their access to lexical information, are able to detect fast topical changes.
-
Kutuzov, Andrei; Giulianelli, Mario & Pivovarova, Lidia
(2021).
Grammatical Profiling for Semantic Change Detection,
Proceedings of the 25th Conference on Computational Natural Language Learning.
Association for Computational Linguistics.
ISSN 978-1-955917-05-6.
p. 423–434.
Show summary
Semantics, morphology and syntax are strongly interdependent. However, the majority of computational methods for semantic change detection use distributional word representations which encode mostly semantics. We investigate an alternative method, grammatical profiling, based entirely on changes in the morphosyntactic behaviour of words. We demonstrate that it can be used for semantic change detection and even outperforms some distributional semantic methods. We present an in-depth qualitative and quantitative analysis of the predictions made by our grammatical profiling system, showing that they are plausible and interpretable.
-
Iazykova, Tatyana; Kapelyushnik, Denis; Bystrova, Olga & Kutuzov, Andrei
(2021).
Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks.
Komp'yuternaya Lingvistika i Intellektual'nye Tekhnologii.
ISSN 2221-7932.
20,
p. 302–318.
doi:
10.28995/2075-7182-2021-20-302-317.
Show summary
Leaderboards like SuperGLUE are seen as important incentives for active development of NLP, since they provide standard benchmarks for fair comparison of modern language models. They have driven the world's best engineering teams as well as their resources to collaborate and solve a set of tasks for general language understanding. Their performance scores are often claimed to be close to or even higher than the human performance.
These results encouraged more thorough analysis of whether the benchmark datasets featured any statistical cues that machine learning based language models can exploit. For English datasets, it was shown that they often contain annotation artifacts. This allows solving certain tasks with very simple rules and achieving competitive rankings.
In this paper, a similar analysis was done for the Russian SuperGLUE (RSG), a recently published benchmark set and leaderboard for Russian natural language understanding. We show that its test datasets are vulnerable to shallow heuristics. Often approaches based on simple rules outperform or come close to the results of the notorious pre-trained language models like GPT-3 or BERT. It is likely (as the simplest explanation) that a significant part of the SOTA models performance in the RSG leaderboard is due to exploit these shallow heuristics and that has nothing in common with real language understanding. We provide a set of recommendations on how to improve these datasets, making the RSG leaderboard even more representative of the real progress in Russian NLU.
-
Kutuzov, Andrei & Pivovarova, Lidia
(2021).
Three-part diachronic semantic change dataset for Russian,
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change.
Association for Computational Linguistics.
ISSN 978-1-954085-60-2.
p. 7–13.
Show summary
We present a manually annotated lexical semantic change dataset for Russian: RuShiftEval. Its novelty is ensured by a single set of target words annotated for their diachronic semantic shifts across three time periods, while the previous work either used only two time periods, or different sets of target words. The paper describes the composition and annotation procedure for the dataset. In addition, it is shown how the ternary nature of RuShiftEval allows to trace specific diachronic trajectories: ‘changed at a particular time period and stable afterwards’ or ‘was changing throughout all time periods’. Based on the analysis of the submissions to the recent shared task on semantic change detection for Russian, we argue that correctly identifying such trajectories can be an interesting sub-task itself.
-
-
Kutuzov, Andrei; Barnes, Jeremy; Velldal, Erik; Øvrelid, Lilja & Oepen, Stephan
(2021).
Large-Scale Contextualised Language Modelling for Norwegian.
In Dobnik, Simon & Øvrelid, Lilja (Ed.),
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa).
Linköping University Electronic Press.
ISSN 978-91-7929-614-8.
p. 30–40.
Show summary
We present the ongoing NorLM initiative to support the creation and use of very large contextualised language models for Norwegian (and in principle other Nordic languages), including a ready-to-use software environment, as well as an experience report for data preparation and training. This paper introduces the first large-scale monolingual language models for Norwegian, based on both the ELMo and BERT frameworks. In addition to detailing the training process, we present contrastive benchmark results on a suite of NLP tasks for Norwegian. For additional background and access to the data, models, and software, please see: http://norlm.nlpl.eu
-
Rodina, Julia; Trofimova, Yuliya; Kutuzov, Andrei & Artemova, Ekaterina
(2021).
ELMo and BERT in Semantic Change Detection for Russian,
Proceedings of AIST 2020: Analysis of Images, Social Networks and Texts.
Springer.
ISSN 978-3-030-72610-2.
p. 175–186.
Show summary
We study the effectiveness of contextualized embeddings for the task of diachronic semantic change detection for Russian language data. Evaluation test sets consist of Russian nouns and adjectives annotated based on their occurrences in texts created in pre-Soviet, Soviet and post-Soviet time periods. ELMo and BERT architectures are compared on the task of ranking Russian words according to the degree of their semantic change over time. We use several methods for aggregation of contextualized embeddings from these architectures and evaluate their performance. Finally, we compare unsupervised and supervised techniques in this task.
-
Kutuzov, Andrei & Kuzmenko, Elizaveta
(2021).
Representing ELMo embeddings as two-dimensional text online,
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations.
Association for Computational Linguistics.
ISSN 978-1-954085-05-3.
p. 143–148.
Show summary
We describe a new addition to the WebVectors toolkit which is used to serve word embedding models over the Web. The new ELMoViz module adds support for contextualized embedding architectures, in particular for ELMo models. The provided visualizations follow the metaphor of ‘two-dimensional text’ by showing lexical substitutes: words which are most semantically similar in context to the words of the input sentence. The system allows the user to change the ELMo layers from which token embeddings are inferred. It also conveys corpus information about the query words and their lexical substitutes (namely their frequency tiers and parts of speech). The module is well integrated into the rest of the WebVectors toolkit, providing lexical hyperlinks to word representations in static embedding models. Two web services have already implemented the new functionality with pre-trained ELMo models for Russian, Norwegian and English.
-
Kutuzov, Andrei; Fomin, V.; Mikhailov, V. Nikola & Rodina, Julia
(2020).
Shiftry: Web service for diachronic analysis of Russian news.
Komp'yuternaya Lingvistika i Intellektual'nye Tekhnologii.
ISSN 2221-7932.
2020-(19),
p. 500–516.
doi:
10.28995/2075-7182-2020-19-500-516.
-
-
Kutuzov, Andrei & Giulianelli, Mario
(2020).
UiO-UvA at SemEval-2020 Task 1: Contextualised Embeddings for Lexical Semantic Change Detection,
Proceedings of the Fourteenth Workshop on Semantic Evaluation.
Association for Computational Linguistics.
ISSN 978-1-952148-31-6.
p. 126–134.
Show summary
We apply contextualised word embeddings to lexical semantic change detection in the SemEval-2020 Shared Task 1. This paper focuses on Subtask 2, ranking words by the degree of their semantic drift over time. We analyse the performance of two contextualising architectures (BERT and ELMo) and three change detection algorithms. We find that the most effective algorithms rely on the cosine similarity between averaged token embeddings and the pairwise distances between token embeddings. They outperform strong baselines by a large margin (in the post-evaluation phase, we have the best Subtask 2 submission for SemEval-2020 Task 1), but interestingly, the choice of a particular algorithm depends on the distribution of gold scores in the test set.
-
Rodina, Julia & Kutuzov, Andrei
(2020).
RuSemShift: a dataset of historical lexical semantic change in Russian,
Proceedings of the 28th International Conference on Computational Linguistics.
Association for Computational Linguistics.
ISSN 978-1-952148-27-9.
p. 1037–1047.
Show summary
We present RuSemShift, a large-scale manually annotated test set for the task of semantic change modeling in Russian for two long-term time period pairs: from the pre-Soviet through the Soviet times and from the Soviet through the post-Soviet times. Target words were annotated by multiple crowd-source workers. The annotation process was organized following the DURel framework and was based on sentence contexts extracted from the Russian National Corpus. Additionally, we report the performance of several distributional approaches on RuSemShift, achieving promising results, which at the same time leave room for other researchers to improve.
-
-
Kutuzov, Andrei; Fomin, Vadim; Rodina, Julia & Mikhailov, Vladislav
(2020).
ShiftRy: web service for diachronic analysis of Russian news.
Komp'yuternaya Lingvistika i Intellektual'nye Tekhnologii.
ISSN 2221-7932.
19,
p. 485–501.
Show summary
We present the ShiftRy web service. It helps to analyze temporal changes in the usage of words in news texts from Russian mass media. For that, we employ diachronic word embedding models trained on large Russian news corpora from 2010 up to 2019. The users can explore the usage history of any given query word, or browse the lists of words ranked by the degree of their semantic drift in any couple of years. Visualizations of the words’ trajectories through time are provided. Importantly, users can obtain corpus examples with the query word before and after the semantic shift (if any). The aim of ShiftRy is to ease the task of studying word history on short-term time spans, and the influence of social and political events on word usage. The service will be updated with new data yearly.
-
Logacheva, Varvara; Teslenko, Denis; Shelmanov, Artem; Remus, S.; Ustalov, Dmitry & Kutuzov, Andrei
[Show all 10 contributors for this article]
(2020).
Word Sense Disambiguation for 158 Languages using Word Embeddings Only.
In Calzolari, Nicoletta; Béchet, Frédéric; Blache, Philippe; Choukri, Khalid; Cieri, Christopher; Declerck, Thierry; Goggi, Sara; Isahara, Hitoshi; Maegaard, Bente; Mariani, Joseph; Mazo, Hélène; Moreno, Asuncion; Odijk, Jan & Piperidis, Stelios (Ed.),
Proceedings of The 12th Language Resources and Evaluation Conference.
European Language Resources Association.
ISSN 979-10-95546-34-4.
p. 5945–5954.
Show summary
Disambiguation of word senses in context is easy for humans, but is a major challenge for automatic approaches. Sophisticated supervised and knowledge-based models were developed to solve this task. However, (i) the inherent Zipfian distribution of supervised training instances for a given word and/or (ii) the quality of linguistic knowledge representations motivate the development of completely unsupervised and knowledge-free approaches to word sense disambiguation (WSD). They are particularly useful for under-resourced languages which do not have any resources for building either supervised and/or knowledge-based models. In this paper, we present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory, which can be used for disambiguation in context. We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings by Grave et al. (2018), enabling WSD in these languages. Models and system are available online.
-
-
Droganova, Kira; Kutuzov, Andrei; Mediankin, Nikita & Zeman, Daniel
(2019).
ÚFAL-Oslo at MRP 2019: Garage Sale Semantic Parsing.
In Oepen, Stephan; Abend, Omri; Hajic, Jan; Hershcovich, Daniel; Kuhlmann, Marco; O’Gorman, Tim & Nianwen, Xue (Ed.),
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning.
Association for Computational Linguistics.
ISSN 978-1-950737-60-4.
p. 158–165.
doi:
10.18653/v1/K19-2015.
-
Kutuzov, Andrei & Kuzmenko, Elizaveta
(2019).
To Lemmatize or Not to Lemmatize: How Word Normalisation Affects ELMo Performance in Word Sense Disambiguation.
In Nivre, Joakim; Derczynski, Leon; Ginter, Filip; Lindi, Bjørn; Oepen, Stephan; Søgaard, Anders & Tidemann, Jorg (Ed.),
Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing.
Linköping University Electronic Press.
ISSN 978-91-7929-999-6.
p. 22–28.
Show summary
In this paper, we critically evaluate the widespread assumption that deep learning NLP models do not require lemmatized input. To test this, we trained versions of contextualised word embedding ELMo models on raw tokenized corpora and on the corpora with word tokens replaced by their lemmas. Then, these models were evaluated on the word sense disambiguation task. This was done for the English and Russian languages. The experiments showed that while lemmatization is indeed not necessary for English, the situation is different for Russian. It seems that for rich-morphology languages, using lemmatized training and testing data yields small but consistent improvements: at least for word sense disambiguation. This means that the decisions about text pre-processing before training ELMo should consider the linguistic nature of the language in question.
-
Kutuzov, Andrei; Dorgham, Mohammad; Oliynyk, Oleksiy; Biemann, Chris & Panchenko, Alexander
(2019).
Making Fast Graph-based Algorithms with Graph Metric Embeddings.
In Korhonen, Anna; Traum, David & Màrquez, Lluís (Ed.),
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
Association for Computational Linguistics.
ISSN 978-1-950737-48-2.
p. 3349–3355.
doi:
10.18653/v1/P19-1325.
Show summary
Graph measures, such as node distances, are inefficient to compute. We explore dense vector representations as an effective way to approximate the same information. We introduce a simple yet efficient and effective approach for learning graph embeddings. Instead of directly operating on the graph structure, our method takes structural measures of pairwise node similarities into account and learns dense node representations reflecting user-defined graph distance measures, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. We demonstrate a speed-up of several orders of magnitude when predicting word similarity by vector operations on our embeddings as opposed to directly computing the respective path-based measures, while outperforming various other graph embeddings on semantic similarity and word sense disambiguation tasks.
-
Rodina, Julia; Bakshandaeva, Daria; Fomin, Vadim; Kutuzov, Andrei; Touileb, Samia & Velldal, Erik
(2019).
Measuring Diachronic Evolution of Evaluative Adjectives with Word Embeddings: the Case for English, Norwegian, and Russian.
In Tahmasebi, Nina; Borin, Lars; Jatowt, Adam & Xu, Yang (Ed.),
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change.
Association for Computational Linguistics.
ISSN 978-1-950737-31-4.
p. 202–209.
doi:
10.18653/v1/W19-4725.
Full text in Research Archive
Show summary
We measure the intensity of diachronic semantic shifts in adjectives in English, Norwegian and Russian across 5 decades. This is done in order to test the hypothesis that evaluative adjectives are more prone to temporal semantic change. To this end, 6 different methods of quantifying semantic change are used. Frequency-controlled experimental results show that, depending on the particular method, evaluative adjectives either do not differ from other types of adjectives in terms of semantic change or appear to actually be less prone to shifting (particularly, to ‘jitter’-type shifting). Thus, in spite of many well-known examples of semantically changing evaluative adjectives (like ‘terrific’ or ‘incredible’), it seems that such cases are not specific to this particular type of words.
-
Kutuzov, Andrei; Velldal, Erik & Øvrelid, Lilja
(2019).
One-to-X Analogical Reasoning on Word Embeddings: a Case for Diachronic Armed Conflict Prediction from News Texts.
In Tahmasebi, Nina; Borin, Lars; Jatowt, Adam & Xu, Yang (Ed.),
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change.
Association for Computational Linguistics.
ISSN 978-1-950737-31-4.
p. 196–201.
doi:
10.18653/v1/W19-4724.
Full text in Research Archive
Show summary
We extend the well-known word analogy task to a one-to-X formulation, including one-to-none cases, when no correct answer exists. The task is cast as a relation discovery problem and applied to historical armed conflicts datasets, attempting to predict new relations of type ‘location:armed-group’ based on data about past events. As the source of semantic information, we use diachronic word embedding models trained on English news texts. A simple technique to improve diachronic performance in such task is demonstrated, using a threshold based on a function of cosine distance to decrease the number of false positives; this approach is shown to be beneficial on two different corpora. Finally, we publish a ready-to-use test set for one-to-X analogy evaluation on historical armed conflicts data.
-
Kutuzov, Andrei; Dorgham, Mohammad; Oliynyk, Oleksiy; Biemann, Chris & Panchenko, Alexander
(2019).
Learning Graph Embeddings from WordNet-based Similarity Measures.
In Mihalcea, Rada; Shutova, Ekaterina; Ku, Lun-Wei; Evang, Kilian & Poria, Soujanya (Ed.),
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019).
Association for Computational Linguistics.
ISSN 978-1-948087-93-3.
p. 125–135.
doi:
10.18653/v1/S19-1014.
Show summary
We present path2vec, a new approach for learning graph embeddings that relies on structural measures of pairwise node similarities. The model learns representations for nodes in a dense space that approximate a given user-defined graph distance measure, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. Evaluation of the proposed model on semantic similarity and word sense disambiguation tasks, using various WordNet-based similarity measures, show that our approach yields competitive results, outperforming strong graph embedding baselines. The model is computationally efficient, being orders of magnitude faster than the direct computation of graph-based distances.
-
Fomin, Vadim; Bakshandaeva, Daria; Rodina, Julia & Kutuzov, Andrei
(2019).
Tracing Cultural Diachronic Semantic Shifts in Russian Using Word Embeddings: Test Sets and Baselines.
Komp'yuternaya Lingvistika i Intellektual'nye Tekhnologii.
ISSN 2221-7932.
2019-May(18),
p. 213–227.
Show summary
The paper introduces manually annotated test sets for the task of tracing diachronic (temporal) semantic shifts in Russian. The two test sets are complementary in that the first one covers comparatively strong semantic changes occurring to nouns and adjectives from pre-Soviet to Soviet times, while the second one covers comparatively subtle socially and culturally determined shifts occurring in years from 2000 to 2014. Additionally, the second test set offers more granular classification of shifts degree, but is limited to only adjectives. The introduction of the test sets allowed us to evaluate several well-established algorithms of semantic shifts detection (posing this as a classification problem), most of which have never been tested on Russian material. All of these algorithms use distributional word embedding models trained on the corresponding in-domain corpora. The resulting scores provide solid comparison baselines for future studies tackling similar tasks. We publish the datasets, code and the trained models in order to facilitate further research in automatically detecting temporal semantic shifts for Russian words, with time periods of different granularities.
-
-
Bakarov, A; Kutuzov, Andrei & Nikishina, I
(2018).
Russian computational linguistics: Topical structure in 2007-2017 conference papers.
Komp'yuternaya Lingvistika i Intellektual'nye Tekhnologii.
ISSN 2221-7932.
2018-May(17).
Show summary
Russian NLP community exists for at least several decades. However, academic works analyzing it are scarce. The present paper fills in this gap by topical modeling of the proceedings of three major Russian NLP conferences (Dialogue, AIST and AINL) for the years from 2007 to 2017. The resulting corpus consists of about 500 academic papers. We focus on the analysis of developing research trends manifested in topical drift over time. As a result, we show statistically how Russian NLP community interests are moving towards machine learning and how the Dialogue (as the largest venue) influences the whole computational linguistics landscape.
-
-
Kutuzov, Andrei; Øvrelid, Lilja; Szymanski, Terrence & Velldal, Erik
(2018).
Diachronic word embeddings and semantic shifts: a survey,
Proceedings of the 27th International Conference on Computational Linguistics.
Association for Computational Linguistics.
ISSN 978-1-948087-50-6.
p. 1384–1397.
Full text in Research Archive
Show summary
Recent years have witnessed a surge of publications aimed at tracing temporal changes in lexical semantics using distributional methods, particularly prediction-based word embedding models. However, this vein of research lacks the cohesion, common terminology and shared practices of more established areas of natural language processing. In this paper, we survey the current state of academic research related to diachronic word embeddings and semantic shifts detection. We start with discussing the notion of semantic shifts, and then continue with an overview of the existing methods for tracing such time-related shifts with word embedding models. We propose several axes along which these methods can be compared, and outline the main challenges before this emerging subfield of NLP, as well as prospects and possible applications.
-
Ustalov, Dmitry; Panchenko, Alexander; Kutuzov, Andrei; Biemann, Chris & Ponzetto, Simone
(2018).
Unsupervised semantic frame induction using triclustering,
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
Association for Computational Linguistics.
ISSN 978-1-948087-34-6.
p. 55–62.
doi:
10.18653/v1/p18-2010.
Full text in Research Archive
Show summary
We use dependency triples automatically extracted from a Web-scale corpus to perform unsupervised semantic frame induction. We cast the frame induction problem as a triclustering problem that is a generalization of clustering for triadic data.
Our replicable benchmarks demonstrate that the proposed graph-based approach, Triframes, shows state-of-the art results on this task on a FrameNet-derived dataset and performing on par with competitive methods on a verb class clustering task.
-
Kutuzov, Andrei
(2018).
Russian Word Sense Induction by Clustering Averaged Word Embeddings.
Komp'yuternaya Lingvistika i Intellektual'nye Tekhnologii.
ISSN 2221-7932.
2018-May(17),
p. 391–403.
Full text in Research Archive
Show summary
The paper reports our participation in the shared task on word sense induction and disambiguation for the Russian language (RUSSE’2018). Our team was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th for the bts-rnc and active-dict datasets (containing mostly polysemous words) among all 19 participants. The method we employed was extremely naive. It implied representing contexts of ambiguous words as averaged word embedding vectors, using off-the-shelf pre-trained distributional models. Then, these vector representations were clustered with mainstream clustering techniques, thus producing the groups corresponding to the ambiguous word’ senses. As a side result, we show that word embedding models trained on small but balanced corpora can be superior to those trained on large but noisy data—not only in intrinsic evaluation, but also in downstream tasks like word sense induction.
-
Kutuzov, Andrei & Kunilovskaya, Maria
(2018).
Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus.
Lecture Notes in Computer Science (LNCS).
ISSN 0302-9743.
10716 LNCS,
p. 47–58.
doi:
10.1007/978-3-319-73013-4_5.
Show summary
In this paper, we present a distributional word embedding model trained on one of the largest available Russian corpora: Araneum Russicum Maximum (over 10 billion words crawled from the web). We compare this model to the model trained on the Russian National Corpus (RNC). The two corpora are much different in their size and compilation procedures. We test these differences by evaluating the trained models against the Russian part of the Multilingual SimLex999 semantic similarity dataset. We detect and describe numerous issues in this dataset and publish a new corrected version.
Aside from the already known fact that the RNC is generally a better training corpus than web corpora, we enumerate and explain fine differences in how the models process semantic similarity task, what parts of the evaluation set are difficult for particular models and why. Additionally, the learning curves for both models are described, showing that the RNC is generally more robust as training material for this task.
-
Kunilovskaya, Maria & Kutuzov, Andrei
(2017).
Universal Dependencies-based syntactic features in detecting human translation varieties.
In Hajič, Jan (Eds.),
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories.
Association for Computational Linguistics.
ISSN 978-80-88132-04-2.
p. 27–36.
Show summary
In this paper, syntactic annotation is used to reveal linguistic properties of translations. We employ the Universal Dependencies framework to represent learner and professional translations of English mass-media texts into Russian (along with non-translated Russian texts of the same genre) with the aim to discover and describe syntactic specificity of translations produced at different levels of competence. The search for differences between varieties of translation and the native texts is augmented with the results obtained from a series of machine learning classifications experiments. We show that syntactic structures have considerable predictive power in translationese detection, on par with the known low-level lexical features.
-
-
Kutuzov, Andrei & Kuzmenko, Elizaveta
(2017).
Two centuries in two thousand words: Neural embedding models in detecting diachronic lexical changes,
Quantitative Approaches to the Russian Language.
Routledge.
ISSN 9781138097155.
doi:
10.4324/9781315105048-5.
Show summary
In this paper, we show how Continuous Bag-of-Words (Mikolov et al., 2013) models trained on time-separated sub-corpora of the Russian National Corpus can be used to detect automatically words that may have undergone semantic changes. Our central assumption is that online training of such models with new textual data results in a “drift” of word vectors in the semantic space. Given that vectors represent the “meaning” of entities, this drift can be taken to reflect semantic shifts in the words experiencing it. As a result, we were able to closely replicate manually compiled lists of semantically changed Russian words from the existing body of research and substantially extend them in a largely unsupervised way. This idea is one of the reasons for the title of this paper, which in a way serves as a complement to the “20 words” in (Daniel & Dobrushina, 2016).
-
Kutuzov, Andrei; Velldal, Erik & Øvrelid, Lilja
(2017).
Temporal dynamics of semantic relations in word embeddings: an application to predicting armed conflict participants,
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.
Association for Computational Linguistics.
ISSN 978-1-945626-83-8.
p. 1825–1830.
doi:
10.18653/v1/D17-1194.
Full text in Research Archive
Show summary
This paper deals with using word embedding models to trace the temporal dynamics of semantic relations between pairs of words. The set-up is similar to the well-known analogies task, but expanded with a time dimension. To this end, we apply incremental updating of the models with new training texts, including incremental vocabulary expansion, coupled with learned transformation matrices that let us map between members of the relation.
The proposed approach is evaluated on the task of predicting insurgent armed groups based on geographical locations. The gold standard data for the time span 1994--2010 is extracted from the UCDP Armed Conflicts dataset. The results show that the method is feasible and outperforms the baselines, but also that important work still remains to be done.
-
Kutuzov, Andrei; Velldal, Erik & Øvrelid, Lilja
(2017).
Tracing armed conflicts with diachronic word embedding models.
In Caselli, Tommaso (Eds.),
Proceedings of the Events and Stories in the News Workshop.
Association for Computational Linguistics.
ISSN 978-1-945626-63-0.
p. 31–36.
doi:
10.18653/v1/W17-2705.
Full text in Research Archive
Show summary
Recent studies have shown that word embedding models can be used to trace time-related (diachronic) semantic shifts for particular words. In this paper, we evaluate some of these approaches on the new task of predicting the dynamics of global armed conflicts on a year-to-year basis, using a dataset from the field of conflict research as the gold standard and the Gigaword news corpus as the training data. The results show that much work still remains in extracting `cultural' semantic shifts from diachronic word embedding models. At the same time, we present a new task complete with an evaluation set and introduce the `anchor words' method which outperforms previous approaches on this data.
-
Kunilovskaya, Maria & Kutuzov, Andrei
(2017).
Testing target text fluency: A machine learning approach to detecting syntactic translationese in English-Russian translation,
New perspectives on cohesion and coherence: Implications for translation.
Language Science Press.
ISSN 978-3-946234-72-2.
p. 75–103.
doi:
10.5281/zenodo.814452.
Show summary
This research is aimed at the semi-automatic detection of divergences in sentence structures between Russian translated texts and non-translations. We focus our attention on atypical syntactic features of translations, because they have a greater negative impact on the overall textual quality than lexical translationese. Inadequate syntactic structures bring about various issues with target text fluency, which reduces readability and the reader's chances to get to the text message. From a procedural viewpoint, faulty syntax implies more post-editing effort.
In the framework of this research, we reveal cases of syntactic translationese as dissimilarities between patterns of selected morphosyntactic and syntactic features (such as part of speech and sentence length) in the context of sentence boundaries observed in comparable monolingual corpora of learner translated and non-translated texts in Russian.
To establish these syntactic differences we resort to a machine learning approach as opposed to the usual statistical significance analyses. To this end we employ models that predict unnatural sentence boundaries in translations and highlight factors that are responsible for their `foreignness'.
For the first stage of the experiment, we train a decision tree model to describe the contextual features of sentence boundaries in the reference corpus of Russian texts. At the second stage, we use the results of the first multifactorial analysis as indicators of learner translators' choices that run counter to the regularities of the standard language variety. The predictors and their combinations are evaluated as to their efficiency for this task. As a result we are able to extract translated sentences whose structure is atypical against Russian texts produced without the constraints of the translation process and which, therefore, can be tentatively considered less fluent. These sentences represent cases of translationese.
-
-
Kutuzov, Andrei
(2017).
Arbitrariness of Linguistic Sign Questioned: Correlation between Word Form and Meaning in Russian.
Komp'yuternaya Lingvistika i Intellektual'nye Tekhnologii.
ISSN 2221-7932.
1(16),
p. 109–120.
Full text in Research Archive
Show summary
In this paper, we present the results of preliminary experiments on finding the link between the surface forms of Russian nouns (as represented by their graphic forms) and their meanings (as represented by vectors in a distributional model trained on the Russian National Corpus). We show that there is a strongly significant correlation between these two sides of a linguistic sign (in our case, word). This correlation coefficient is equal to 0.03 as calculated on a set of 1 729 mono-syllabic nouns, and in some subsets of words starting with particular two-letter sequences the correlation raises as high as 0.57. The overall correlation value is higher than the one reported in similar experiments for English (0.016).
Additionally, we report correlation values for the noun subsets related to different phonaesthemes, supposedly represented by the initial characters of these nouns.
-
Lison, Pierre & Kutuzov, Andrei
(2017).
Redefining Context Windows for Word Embedding Models: An Experimental Study.
In Tiedemann, Jörg (Eds.),
Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa).
Linköping University Electronic Press.
ISSN 978-91-7685-601-7.
p. 284–288.
Full text in Research Archive
Show summary
Distributional semantic models learn vector
representations of words through the
contexts they occur in. Although the
choice of context (which often takes the
form of a sliding window) has a direct influence
on the resulting embeddings, the
exact role of this model component is still
not fully understood. This paper presents
a systematic analysis of context windows
based on a set of four distinct hyperparameters.
We train continuous Skip-
Gram models on two English-language
corpora for various combinations of these
hyper-parameters, and evaluate them on
both lexical similarity and analogy tasks.
Notable experimental results are the positive
impact of cross-sentential contexts
and the surprisingly good performance of
right-context windows.
-
Kutuzov, Andrei; Fares, Murhaf; Oepen, Stephan & Velldal, Erik
(2017).
Word vectors, reuse, and replicability: Towards a community repository of large-text resources.
In Tiedemann, Jörg (Eds.),
Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa).
Linköping University Electronic Press.
ISSN 978-91-7685-601-7.
p. 271–276.
Full text in Research Archive
Show summary
This paper describes an emerging shared repository of large-text resources for creating word vectors, including pre-processed corpora and pre-trained vectors for a range of frameworks and configurations. This will facilitate reuse, rapid experimentation, and replicability of results.
-
Kutuzov, Andrei; Kuzmenko, Elizaveta & Pivovarova, Lidia
(2017).
Clustering of Russian Adjective-Noun Constructions using Word Embeddings.
In Pivovarova, Lidia; Piskorski, Jakub & Erjavec, Tomaž (Ed.),
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing.
Association for Computational Linguistics.
ISSN 978-1-945626-45-6.
p. 3–13.
doi:
10.18653/v1/W17-1402.
Full text in Research Archive
Show summary
This paper presents a method of automatic construction extraction from a large corpus of Russian. The term `construction' here means a multi-word expression in which a variable can be replaced with another word from the same semantic class, for example, 'a glass of [water/juice/milk]'. We deal with constructions that consist of a noun and its adjective modifier. We propose a method of grouping such constructions into semantic classes via 2-step clustering of word vectors in distributional models. We compare it with other clustering techniques and evaluate it against A Russian-English Collocational Dictionary of the Human Body that contains manually annotated groups of constructions with nouns denoting human body parts.
The best performing method is used to cluster all adjective-noun bigrams in the Russian National Corpus. Results of this procedure are publicly available and can be used to build a Russian construction dictionary, accelerate theoretical studies of constructions as well as facilitate teaching Russian as a foreign language.
-
-
-
-
-
-
-
Kutuzov, Andrei & Kuzmenko, Elizaveta
(2015).
Semi-automated typical error annotation for learner English essays: integrating frameworks,
Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015.
Linköping University Electronic Press.
ISSN 978-91-7519-036-5.
p. 35–41.
-
Kutuzov, Andrei & Kuzmenko, Elizaveta
(2015).
Comparing Neural Lexical Models of a Classic National Corpus and a Web Corpus: The Case for Russian.
In Gelbukh, Alexander (Eds.),
Computational Linguistics and Intelligent Text Processing.
Springer Publishing Company.
ISSN 978-3-319-18111-0.
p. 47–58.
doi:
10.1007/978-3-319-18111-0_4.
-
View all works in Cristin
-
-
-
Kutuzov, Andrei; van der Aalst, Wil M. P.; Batagelj, Vladimir & Ignatov, Dmitry
(2021).
Proceedings of AIST 2020: Analysis of Images, Social Networks and Texts.
Springer.
ISBN 978-3-030-72610-2.
480 p.
View all works in Cristin
Published
Oct. 14, 2015 6:11 PM
- Last modified
Mar. 19, 2024 3:51 PM