LTG research seminar

NLP researchers both from and outside LTG are presenting their findings in an informal environment, followed by questions and discussions.

LTG research seminar

The language technology group research seminar is a biweekly event.  The regular time slot for the seminar is Tuesdays, 12:15 - 13:00 CEST. Talks usually consist of a 30-minute presentation and 15 minutes of discussion with the audience.

The seminar is conducted in a hybrid form: both offline at Ole-Johan Dahls hus, UiO, and online in Zoom (link available by request). With questions or suggestions related to the LTG seminar, please contact Lucas Charpentier.

Forthcoming talks

Past talks

LTG research seminar has a long history, but this page starts from Fall 2021 only.

May 7th, 2024

Topic: A journey from tokenization to instruction-finetuning (NorMistral)
Room: Styrrerom 4118
We will present the whole process of creating the Norwegian Langauge Models, specifically NorMistral, from tokenisation to pretraining. Then passing by evaluating the models, followed by quantization. To finish with instruction fine-tuning, limitations and future plans. 

April 9th, 2024

Topic: Socio-political Events of Conflict and Unrest
There is a large and growing body of literature on datasets created to facilitate the study of socio-political events of conflict and unrest. However, the datasets, and the approaches taken to create them, vary a lot depending on the type of research they are intended to support. For example, while scholars from natural language processing (NLP) tend to focus on annotating specific spans of text indicating various components of an event, scholars from the disciplines of political science and conflict studies tend to focus on creating databases that code an abstract but structured representation of the event, less tied to a specific source text. The survey aims to map out the current landscape of available event datasets within the domain of social and political conflict and unrest – both from the NLP and political science communities – offering a unified view of the work done across different disciplines.

March 26th, 2024

Topic: Emotion Analysis of Tweets Banning Education in Afghanistan
We introduce the first emotion-annotated dataset for the Dari variant of Persian spoken in Afghanistan. The LetHerLearn dataset contains 7,600 tweets posted in reaction to the Taliban’s ban of women’s rights to education in 2022 and has been manually annotated according to Ekman’s emotion categories. We here detail the data collection and annotation process, present relevant dataset statistics as well as initial experiments on the resulting dataset, benchmarking a number of different neural architectures for the task of Dari emotion classification.

March 12th, 2024

Topic: Third-semester report

February 27th, 2024

Topic: Consistent anomalies in the grammatical number of nouns: an overview
Elena Spaziani (La Sapienza)
Grammatical number is one of the most studied categories in linguistics. Given the apparently straightforward relationship between grammatical values and the numerical properties they denote, grammatical number can be considered one of the simplest categories. However, there are many difficulties, epitomised by what have long been called “anomalies”, that arise on closer inspection. An overview will be given of two anomalous cases, namely lexical plurals, i.e. plural forms that have a non-grammatical meaning, and singularia and pluralia tanta nouns, i.e. nouns that are used in only one grammatical form.

February 13th, 2023

Topic: More Room For Language: Investigating the Effect of Retrieval on Language Models
Retrieval-augmented language models pose a promising alternative to standard language modeling. During pretraining, these models search in a corpus of documents for contextually relevant information that could aid the language modeling objective. We introduce an `ideal retrieval' methodology to study these models in a fully controllable setting. We conduct an extensive evaluation to examine how retrieval augmentation affects the behavior of the underlying language model. Among other things, we observe that these models: {i) save substantially less world knowledge in their weights, (ii) are better in understanding local context and inter-word dependencies, but {iii) are worse in comprehending global context.

December 14th, 2023

Topic: BabyLM Challenge: What is it? Best Models and Papers
The shared task for the CoNLL workshop was the BabyLM challenge. One of the objectives of this shared task was to democratize pre-training by creating two small datasets (10M and 100M words) of high quality. The shared task showed some interesting results that will be quickly presented. In addition, we will present our model that won both the strict and strict-small tracks as well as the winner of the loose track and the two outstanding papers.

November 30th, 2023

Topic: Dictionary-like definitions for Semantic Change datasets
Slides
Semantic change is a change of the meanings of a language unit (a morpheme, a word, a phrase etc.). The concept of meaning (sense) in computational linguistics is data-driven, namely, the senses may be induced from a large number of word usages without any additional knowledge. The senses may be defined as clusters of target word usages, but one has to explore the word usages manually in order to understand which particular senses have changed. Since a word may have tens of senses, an automated approach to labeling the sense clusters is required. One of the possible approaches is generating dictionary-like definitions of the word usage clusters. We will discuss some definition generation methods and their evaluation.

November 16th, 2023

Topic: Employing AI to Help Lost Pets Return to Their Homes
The Kashtanka.pet (https://kashtanka.pet) project addresses the problem of searching for lost pets efficiently. There are numerous platforms and groups in social networks collecting ads about missing and found cats and dogs. However, it is often almost impossible for a human to find a specific lost pet among millions of ads about found pets distributed across all those websites. The project aims at helping pet owners and volunteers to find lost pets efficiently with the help of AI. It crawls websites for ads about lost and found pets, and retrieves the pairs of ads announcing the same pet was lost and then found. The retrieved pairs are then inspected and further processed by humans.

November 2nd, 2023

Topic: Benchmarking transformer language models on natural language understanding tasks
Benchmarking has found broad acceptance in NLP as a conventional approach to comparing language models LMs with respect to specific evaluation criteria, such as performance, efficiency, and fairness. However, benchmarking suffers from low linguistic diversity and the inappropriateness of the result aggregation procedures that do not account for benchmark complexity. To this end, we propose the first large-scale benchmarks for the Russian language that cover a broad scope of NLU tasks. We develop novel aggregation procedures that rely on rankings in each evaluation criterion and allow aggregating heterogeneous information. We present the results of benchmarking over one hundred LMs, comparing their performance to human-level performance across various experimental setups.

October 19th, 2023

Topic: Graph-based Anomaly detection
Graph-based anomaly detection utilizes graph theory and network analysis to identify abnormal patterns or outliers within complex data structures, representing data elements as nodes and their interactions as edges. This approach is crucial due to the intricate relationships graph structures can capture, which traditional machine learning models may struggle to handle effectively. Moreover, the real-time aspect of graph-based algorithms is vital, providing swift anomaly detection for immediate action in critical domains such as network security and fraud detection. These algorithms swiftly uncover atypical behaviors or events, aiding timely interventions and responses in interconnected data scenarios.

September 28, 2023

Topic: Deep Learning For Unsupervised Relation Extraction

Capturing concepts' interrelations is a fundamental of natural language understanding. It constitutes a bridge between two historically separate approaches of artificial intelligence: the use of symbolic and distributed representations. However, tackling this problem without human supervision poses several issues, and unsupervised models have difficulties echoing the expressive breakthroughs of supervised ones. We address two supervision gaps we identified: the problem of regularization of sentence-level discriminative models and the problem of leveraging relational information from dataset-level structures. The first gap arises following the increased use of discriminative approaches, such as deep neural network classifiers, in the supervised setting. These models tend to collapse without supervision. To overcome this limitation, we introduce two relation distribution losses to constrain the relation classifier into a trainable state. The second gap  arises from the development of dataset-level (aggregate) approaches. We show that unsupervised models can leverage a large amount of additional information from the structure of the dataset, even more so than supervised models. We close this gap by adapting existing unsupervised methods to capture topological information using graph convolutional networks. Furthermore, we show that we can exploit the mutual information between topological (dataset-level) and linguistic (sentence-level) information to design a new training paradigm for unsupervised relation extraction.

September 14, 2023

Topic: Corpus-based computational dialectology: Data, methods and results
OBS: This will be held in Kristen Nygaards sal, Room 5370

The CorCoDial (corpus-based computational dialectology) project aims to infer dialect classifications from variation-rich corpora, focusing in particular on the dialect-to-standard normalization task to introduce comparability between different texts. I will start by presenting a multilingual collection of phonetically transcribed and orthographically normalized corpora. This collection forms the data basis of several case studies. In the first study, we investigate to what extent topic models can find dialectological rather than semantic topics. In the second experiment, we evaluate character alignment methods from different research traditions on a range of desirable and undesirable characteristics. In the last study, we focus on neural dialect-to-standard normalization and investigate what the embeddings of speaker labels can tell us about the origin of the speakers.

June 15, 2023

Topic: Interpretable Word Sense Representations via Definition Generation: The Case of Semantic Change Analysis

We propose using automatically generated natural language definitions of contextualised word usages as interpretable word and word sense representations. Given a collection of usage examples for a target word, and the corresponding data-driven usage clusters (i.e., word senses), a definition is generated for each usage with a specialized Flan-T5 language model, and the most prototypical definition in a usage cluster is chosen as the sense label.
We demonstrate how the resulting sense labels can make existing approaches to semantic change analysis more interpretable, and how they can allow users (historical linguists, lexicographers, or social scientists) to explore and intuitively explain diachronic trajectories of word meaning. Semantic change analysis is only one of many possible applications of the "definitions as representations" paradigm. Beyond being human-readable, contextualised definitions also outperform token or usage sentence embeddings in word-in-context semantic similarity judgements, making them a new promising type of lexical representation for NLP.

April 27, 2023

Topic: Training a new generation of LTG language models
Can a competitive masked language model be trained on a small text corpus of 100 million words? The short answer is yes! And as a reward, we can also get an optimized training recipe for language models. Can we then apply this recipe on a huge Norwegian text corpus?  Can we apply it on generative language models? And what Norwegian corpus should we use? And how should we evaluate the models? This talk will try to answer all of these questions and much more

April 20, 2023

Topic: The RWKV language model: An RNN with the advantages of a transformer
Johan Sokrates Wind (UiO, Dep. of Mathematics)

Large pretrained transformers such as ChatGPT have demonstrated impressive performance across a diverse range of tasks. Recently, the RWKV architecture was developed as an RNN that can be trained like a transformer, resulting in a model with 14 billion parameters, making it the largest RNN ever. Moreover, unlike traditional RNNs, RWKV performs comparably to transformers on benchmark tests. In this presentation, I will explain how the RWKV architecture enables effective training and inference.

March 30, 2023

Topic: Phonotactics as an Aid in Low Resource Loan Word Detection and Morphological Analysis in Sakha

Obtaining information about loan words and irregular morphological patterns can be difficult for low-resource languages. Using Sakha as an example,  we show that it is possible to exploit known phonemic regularities such as vowel harmony and consonant distributions to identify loan words and irregular patterns, which can be helpful in rule-based downstream tasks such as parsing and POS-tagging. We evaluate phonemically inspired methods for loanword detection, combined with bi-gram vowel transition probabilities to inspect irregularities in the morphology of loanwords. We show that both these techniques can be useful for the detection of such patterns.

March 16, 2023

Topic: Fine-tuning of cross-lingual language models for lexical semantic change detection

Contextualized word embeddings from out-of-the-box neural language and masked language models are known to be bad representations of word meaning due to large orthographic / grammatical bias. Fine-tuning such models on some labeled datasets for the final task is the standard way to achieve good performance. However, it is not obvious how to fine-tune them directly for the lexical semantic change detection (LSCD) task in which examples are single words and targets are some scores corresponding to the change in their meaning between two time periods.
In this talk I will present our experiments on fine-tuning language models for the LSCD task. Our models achieved SOTA results in the RuShiftEval-2021 and the LSCDiscovery-2022 shared tasks on lexical semantic change detection for the Russian and the Spanish language correspondingly. Surprisingly, even models fine-tuned on English data only achieve near SOTA performance due to zero-shot cross-lingual abilities of the underlying cross-lingual masked language model. Another appealing property of the fine-tuned models is their ability to annotate word occurrences with glosses which suggests a way to make the LSCD predictions interpretable.

March 2, 2023

Topic: Language Models and the Discomfort of Not Knowing
Mark Anderson (Norwegian Computing Center)
 
In language technology, we need to evaluate the capabilities of models. Often the apparent concrete nature of quantitative metrics can gives a false sense of certainty. In this talk, I will vaguely, and at times indirectly, discuss my general feeling of discomfort about evaluation in NLP and about claims made about the nature of models.

February 9, 2023

Topic: Meaning Making with Artificial Interlocutors and Risks of Language Technology
Emily M. Bender (University of Washington)
OBS: This will be held in Kristen Nygaards sal, Room 5370

Humans make sense of language in context, bringing to bear their own understanding of the world including their model of their interlocutor's understanding of the world. In this talk, I will explore various potential risks that arise when we as humans bring this sense-making capacity to interactions with artificial interlocutors. That is, I will ask what happens in conversations where one party has no (or extremely limited) access to meaning and all of the interpretative work rests with the other, and briefly explore what this entails for the design of language technology.

February 2, 2023

Topic: Learning Linguistic Tree Structures with Text and Graph Methods
Irina Nikishina (University of Hamburg)
 
Knowledge graphs such as DBpedia, Freebase or Wikidata always contain a taxo- nomic backbone that allows the arrangement and structuring of various concepts in accordance with the hypo-hypernym (“class-subclass”) relationship. With the rapid growth of lexical resources for specific domains, the problem of automatic extension of the existing knowledge bases with new words is becoming more and more widespread. In this talk, we address the problem of Taxonomy Enrichment which aims at adding new words to the existing taxonomy.
We formulate two task settings on the automatic taxonomy extension. The first one aims at predicting hypernyms (“parents, words with broader meanings”) from taxonomy given a predefined list of new words with no definition. The second task setting considers taxonomy enrichment with no predefined candidates. We assume that the the compressed information from pre-trained language models like BERT or T5 can be leveraged to predict new words missing in taxonomic resources.

January 19, 2023

Topic: Developing a Conversational Agent for Engineering Design Collaboration
Joseph Makokha (UiO, DIG group)

Intelligent systems incorporating artificial intelligence (AI) have been around for decades and have steadily gained capabilities approaching humans in several domains ranging from medicine where they aid in detection of fractures and tumours; data science where they automate tasks; in art generation; robotics and self-driving cars, and many others. As digital technologies like artificial intelligence become increasingly pervasive around us, we find humans frequently collaborating with machines in Human-AI (HAI) teams in diverse contexts. There is therefore a need for researchers, practitioners, decision makers, and others to understand ways that AI will influence everyday processes and outcomes. Many questions arise from prospective scenarios, such as what will happen when an AI outperforms humans on common tasks at work - and researchers are attempting to answer these questions regarding technical, practical, philosophical and other aspects of AI. However, there exist few replicable examples - where an AI outperforms humans - beyond strategy games like AlphaGo, chess and similar contests. Thus we seek to narrow this gap. 

We propose a conceptual model of an AI collaborative system comprising a human-AI team similar to how such teams might operate; then demonstrate a way of developing a class of AI systems, the Disruptive Interjector (DI), which observe what a human is doing, then interject with suggestions that aid in idea generation or problem solving in human-AI (HAI) teams. We note that this kind of system goes beyond current creativity support systems by replacing one teammate in a human-human (HH) team with an AI, to create a HAI team. The proposed DI diverges from a solution by encouraging consideration of other possibilities, and is therefore distinct from tutors, chatbots, recommenders and other similar systems that seek to converge towards a “correct” solution. To this end, we develop a conceptual design of the DI system, then apply deep Convolution Neural Networks (CNNs) - in form of an LSTM network and two datasets - to generate new conversations that can be used in collaborating with a human. This talk will highlight barriers as well as possible solutions that will enable the successful development of such collaborative systems.

December 5, 2022

Topic: Entity-level sentiment analysis (3rd semester evaluation)
Slides (TBA)

November 21, 2022

Topic: Graph neural networks and how they are used in NLP

A lot of data fit nicely into a graph formalism. Social networks, biological systems, and maps can all be thought of as relationships between objects. The Graph Neural Network (GNN) architecture has become the most expressive way to do representation learning over such structures, providing a unified framework for generating embeddings for nodes, relations, and even entire graphs. These embeddings can be used for tasks like predicting fraudulent transactions in a financial network, drug discovery, and shortest path computation for services like Google Maps. The first half of this talk will introduce the most common GNN models, focusing on how they differ from "standard" NNs. The second half will present and discuss how GNNs are currently being used to solve tasks within NLP, particularly for integrating external knowledge for tasks like QA and language modeling.

November 7, 2022

EventGraph: Event Extraction as Semantic Graph Parsing

Event extraction involves the detection and extraction of both the event triggers and corresponding event arguments. Existing systems often decompose event extraction into multiple subtasks, without considering their possible interactions. We propose EventGraph, a joint framework for event extraction, which encodes events as graphs. We represent event triggers and arguments as nodes in a semantic graph. Event extraction therefore becomes a graph parsing problem, which provides the following advantages: 1) performing event detection and argument extraction jointly; 2) detecting and extracting multiple events from a piece of text; and 3) capturing the complicated interaction between event arguments and triggers. Experimental results on ACE2005 show that our model is competitive to state-of-the-art systems and has substantially improved the results on argument extraction.

October 24, 2022

Towards more robust text anonymization through adversarial models

This PhD project will aim at developing adversarial models to text anonymization (i.e. models that re-identify masked identifiers). To do this we plan to use retrieval-based transformers/models and various databases that act as levels of background knowledge available to an attacker.

September 26, 2022

Introducing two datasets: NARC and a small PoS-tagged dataset for dialectal tweets

The completion of NARC (Norwegian Anaphora Resolution Corpus) is getting close. We present a corpus for coreference resolution and bridging for Norwegian. Our forthcoming publication focuses on the Bokmål part of the corpus, with basic corpus statistics and preliminary modelling results. The PoS-tagged tweets were made in order to evaluate commonly used PoS-taggers on informal Norwegian data in three categories: Bokmål, Nynorsk and dialectal tweets. The dataset is small, but allows us to highlight some problems seen when trying to PoS-tag informal text and written dialect.

June 13, 2022

Entity-Level Sentiment Analysis ELSA: What we have found and where to go from here
Egil Rønningstad (LTG)
Work in progress

We have recently created an exploratory dataset to find similarities and differences between Entity-Level Sentiment Analysis (ELSA) and other sentiment analysis tasks, in particular Targeted Sentiment Analysis (TSA). We see that ELSA can (partially) be derived from NER, coreferences and TSA, but error propagation is an issue. Our next steps are 1) To create annotation guidelines for a proper ELSA dataset, 2) Explore alternative, probably summarization-related approaches to ELSA. After the presentation we will have room for discussion and brainstorming.

May 16, 2022

Language technology tools to support low-resource languages: case study of Sakha
Sardana Ivanova (University of Helsinki)
Paper 1, Paper 2
Slides

This presentation gives an overview on language technology tools for supporting low-resource languages, in particular, the Sakha language. Tools include: a morphological analyser, a computer-assisted language learning (CALL) platform, and two natural language generation (NLG) systems.

We extended an earlier, preliminary version of the morphological analyser, built on the Apertium rule-based machine translation platform. The transducer, developed using Helsinki Finite-State Toolkit (HFST), has coverage of solidly above 90%, and high precision. Based on the morphological analyser, we implemented a language learning environment for Sakha in the Revita CALL platform. Revita is a freely available online platform for learners beyond the beginner level.

Currently we have implemented two NLG systems for Finnish and a few other languages: a transformer-based poetry generation system and a template-based news generation system. We plan to extend those systems to support Sakha.

May 9, 2022

Graph-Based Entity Models for Dialogue Management
Nicholas Walker (University of Oslo)

Modern task-oriented spoken dialogue systems often rely on dialogue management modules, which keep track of information for an autonomous dialogue agent to complete tasks. Dialogue systems are also frequently deployed with physical robotic agents for Human-Robot Interaction (HRI).

This seminar will detail completed and planned work in dialogue management and HRI which comprise the titular doctoral project as part of the third semester evaluation of the work. The present state-of-the-art and proposed graph-based approaches for dialogue management will be discussed in addition to methodological challenges. We will present an overview of the current theory, investigations into methodologies and prototypes, data collection, and an outline of future work for the project. Discussion will also include description of the work in the context of HRI and experiments with the project’s Pepper robot.

April 25, 2022

NorDiaChange: diachronic semantic change dataset for Norwegian

We describe NorDiaChange: the first diachronic semantic change dataset for Norwegian. NorDiaChange comprises two novel subsets, covering about 80 Norwegian nouns manually annotated with graded semantic change over time. Both datasets follow the same annotation procedure and can be used interchangeably as train and test splits for each other. NorDiaChange covers the time periods related to pre- and post-war events, oil and gas discovery in Norway, and technological developments. The annotation was done using the DURel framework and two large historical Norwegian corpora. NorDiaChange is published in full under a permissive license, complete with raw annotation data and inferred diachronic word usage graphs (DWUGs).

March 31, 2022 (12:30)

Parsing into semantic dependency graphs
Maja Buljan (University of Oslo)

Natural language processing encompasses a spectrum of tasks whose goal, on a superficial level, is to structure the information contained in "raw" human language input. In this talk, we focus on meaning representation parsing -- i.e. mapping from natural language utterances to graph-based encodings of semantic structure. As the halfway-point progress report (third semester evaluation) of the titular doctoral project, we will give an overview of the methodological challenges of the task, review the current state-of-the-art, and summarise completed and ongoing work that comprises the project. This includes an in-depth dive into different meaning representation frameworks, parsing architectures, diagnostic evaluation of systems, and framework-specific error analysis, as well as a look forward to (currently) unsolved challenges in model development, e.g. multitask learning for cross-framework and cross-lingual parsing.

March 14, 2022

Efficient Strategies of Language Production: An Information-Theoretic Analysis
Mario Giulianelli (University of Amsterdam)

Speakers are thought to use efficient information transmission strategies for effective communication. For example, they transmit information at a constant rate in written text and they use repetitions extensively in spoken dialogue. We analyze these strategies in monologue and dialogue datasets, combining information-theoretic measures with probability estimates obtained from Transformer-based language models.

We find (i) that information density decreases overall in spoken open domain and written task-oriented dialogues, while it remains uniform in written texts; (ii) that speakers’ choices are oriented towards global, rather than local, uniformity of information; (iii) that uniform information density strategies are at play in dialogue when we zoom in on topically and referentially coherent contextual units; (iv) and that repetitions of non-topical and non-referential expressions, too, can be interpreted as an efficient production strategy.

Besides providing new empirical evidence on written and spoken language production, we believe that our studies can directly inform the development of more human-like natural language generation models.

February 28, 2022

Internet Protocol Standardization in the IETF: An Introduction to the Textual Archive and Potential NLP Applications
Michael Welzl (University of Oslo)

Internet protocols are standardized in the Internet Engineering Task Force (IETF). The standardization process in this organization involves a large amount of freely accessible textual artifacts: discussions in mailing lists, on GitHub, as well as recorded meeting minutes of online and in-person meetings. The results are “RFCs” - prose documents which lend themselves to NLP analysis just as well as the textual body that precedes their finalization. This talk will introduce the IETF process, the available means to access the text archives, and share some ideas on NLP analyses that might be useful to the Internet community or computer scientists in general.

February 14, 2022

Case Studies in BERTology: shallow heuristics or verbal reasoning?
Anna Rogers (University of Copenhagen)

What does BERT learn from the current Natural Language Understanding datasets - verbal reasoning skills, or shallow heuristics? This talk discusses available evidence and presents a case study of generalization in NLI (from MNLI to the adversarially constructed HANS dataset) in a range of BERT-based architectures (adapters, Siamese Transformers, HEX debiasing), as well as with subsampling the data and increasing the model size. Most strategies are unsuccessful, but they are all providing insights into how Transformer-based models learn to generalize.

January 31, 2022

The DECRYPT project - tools, methods and strategies in historical NLP
Crina Tudor (Uppsala University)

Historical ciphers and keys represent a rich source of information that can provide great insights into our past. The main drawback, however, is that such sources are spread out in archives all over the world, which makes it rather difficult to analyze and compare various manuscripts.

The DECRYPT project started out as an effort to address this issue by building a reliable and comprehensive system that aims to make such sources easily available to the general public. This interdisciplinary endeavor brings together historians, linguists, cryptographers and programmers who work together to develop tools that can facilitate the automatic analysis of encoded manuscripts.

January 17, 2022

Targeted Sentiment Analysis (TSA), and how to make the best use of your data

Targeted Sentiment Analysis aims at for each sentence to detect the words representing that what is spoken positively or negatively about. In the sentence "I admire my dog", "my dog" is spoken positively about. In TSA, we do not include finding the holder/source "I", or the words that express this positivity or negativity, like "admire".

I have done TSA on the Norwegian NoReC-fine dataset. I have compared different word embeddings to use with LSTM, and compared with some pretrained BERT-related models. NoReC-fine consists of newspaper reviews of topics from various domains. We look at the cross-domain effect on the results, and compare them with cross-lingual experiments with same-domain data.

December 13, 2021

Multilingual Language Models for Fine-tuning and Feature Extraction in Word-in-Context Disambiguation

SemEval-2021 Task 2: Multilingual and Crosslingual Word-in-Context Disambiguation (MCL-WiC) is proposed as a benchmark to evaluate context-sensitive word representations. Our main interest is to investigate the usefulness of pre-trained multilingual language models (LMs) in this MCL-WiC task, without resorting to sense inventories, dictionaries, or other resources. As our main method, we fine-tune the language models with a span classification head. We also experiment with using the multilingual language models as feature extractors, extracting contextual embeddings for the target word. We compare three different LMs: XLM-RoBERTa (XLMR), multilingual BERT (mBERT) and multilingual distilled BERT (mDistilBERT).

We find that fine-tuning is better than feature extraction. XLMR performs better than mBERT in the cross-lingual setting both with fine-tuning and feature extraction, whereas these two models give a similar performance in the multilingual setting. mDistilBERT performs poorly with fine-tuning but gives similar results to the other models when used as a feature extractor.

November 29, 2021

Analyzing public sentiment towards wind energy in Norway
The work was done within this project

Wind power has with technological development and cost decrease as of lately become profitable in Norway even without any subsidies, which have led to an increase in new developments across the country. These new developments have not always been welcomed and as development continued to increase, so did the opposition. In addition to that the distribution of benefits and burdens of energy infrastructure and energy policies should be fair and equitable, transitioning to a low-carbon society or energy system requires support for political decisions and policies. 

A traditional way to collect information on public opinions is via questionnaires, surveys or interviews. These methods may however be prone to selection bias and response bias, as well as missing data and incomplete information. There is as such a value in exploring alternative methods to acquire information on public opinion. In this study, we follow the work of  Kim et al.[2020] and assess the public sentiment in Norway towards on- and offshore wind energy via a machine learning approach for natural language processing based on data scrapping from social media sites such as Twitter. We collected about 70 000 Norwegian tweets which we manually annotated a subset of and fed into NorBERT for fine-tuning. We then used the fine-tuned model to classify the rest of the tweets.

November 8, 2021

Text anonymization with explicit measures of disclosure risk

We present a new approach to text anonymization, one that moves past a NER task and incorporates disclosure risk in the process, combining NLP and PPDP. Making use of Wikipedia biographies and background knowledge from Wikidata we propose an automatic annotation method based on k-anonymity that can produce large amounts of labeled data for sensitive information. We train two BERT models on these data following two different approaches of picking sensitive terms to mask. We also manually annotate and release a sample of a 1000 article summaries, and use it to check the performance of our models.

October 25, 2021

What Quantifying Word Order Freedom Reveals about Dependency Corpora
Maja Buljan (LTG)
Slides
 
This is an overview of ongoing work on word order freedom and syntactic annotation, with the goal of differentiating between findings that reveal inherent properties of languages vs. features dependent on annotation styles. Following previous work on defining a quantifiable and linguistically interpretable measure of word order freedom in language, we take a closer look at the robustness of the basic measure (word order entropy) to variations in dependency corpora used in the analysis. We compare measures at three levels of generality, applied to treebanks annotated according to the Universal Dependencies v1 and v2 guidelines, spanning 31 language. Preliminary results show that certain measures, such as subject-object order freedom, are sensitive to changes in annotation guidelines, highlighting aspects of these metrics that should be taken into consideration when using dependency corpora for linguistic analysis.

October 11, 2021

"Improving Multilingual Lexical Normalization by Fine-tuning ByT5"
David Samuel (LTG)

Slides, paper

We present the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021, which evaluates lexical normalization systems on 12 social media datasets in 11 languages.

Our system is based on a pre-trained byte-level language model, ByT5, which we further pre-train on synthetic data and then fine-tune on authentic normalization data. It achieves the best performance by a wide margin in intrinsic evaluation, and also the best performance in extrinsic evaluation through dependency parsing. We release both the source code and the fine-tuned models.

September 27, 2021

"Grammatical Profiling for Semantic Change Detection"
Andrey Kutuzov (LTG)
https://arxiv.org/abs/2109.10397

Slides

Semantics, morphology and syntax are strongly interdependent. However, the majority of computational methods for semantic change detection use distributional word representations which encode mostly semantics. We investigate an alternative method, grammatical profiling, based entirely on changes in the morphosyntactic behaviour of words.

We demonstrate that it can be used for semantic change detection and even outperforms some distributional semantic methods. We present an in-depth qualitative and quantitative analysis of the predictions made by our grammatical profiling system, showing that they are plausible and interpretable.

Tags: language technology, Natural Language Processing, Computational Linguistics, Seminar
Published Sep. 26, 2021 5:19 PM - Last modified June 10, 2024 11:45 AM