Tracing semantic shifts in Norwegian news texts

Bildet kan inneholde: fotografi, ansiktsuttrykk, katt, gest, kjøtteter.

Words in human languages change their meaning over time. For example, the English word "cell" originally had only one sense of "solitary dwelling, as in monastery or prison". Later, it acquired an additional sense of "a small usually microscopic mass of protoplasm bounded externally by a semipermeable membrane". In the last two decades, the word is used more and more in a  third sense of "mobile phone". In another example, the Norwegian word "stryk" has acquired a new "fail" sense, in addition to the older sense of "rapids".

Changes of word meaning over time are called diachronic semantic shifts. They can be captured automatically by analyzing changes in the behavior of large-scale neural language models: either static or contextualized ones.  This is now a solid sub-field within NLP, with several survey papers on the topic, ACL workshop in 2022 (and a forthcoming workshop in 2023), a SemEval shared task in 2020 and other shared tasks for various languages. Two manually annotated datasets of semantic change  for Norwegian were published in 2022.

Interestingly, cultural changes often manifest in the shift of typical associations or attitude towards a word (can be a proper name, by the way). Cf. the word "coronavirus" which definitely started triggering a very different set of associations a couple of years ago, although its dictionary definition arguably remained the same. This can also be related to different stances or points of view in mass media, which of course are also fluid over time.

In this Master thesis, you will build a web service (or services) visualizing semantic/cultural changes in Norwegian media and using large language models under the hood. As our data source, we will start with the Norsk Aviskorpus (1998-2019), but other news sources can be added.  The  main research question in this thesis will be "What method of automatic semantic change detection yields most insightful output on this data?".

In particular, you will compare more traditional methods based on word2vec-like static word embeddings (see a somewhat similar web service for Russian) and those using pre-trained contextualized language models like ELMo, BERT, T5, GPT etc. Will it be possible to find more interesting "bursts" in news streams with these architectures? We'll see.

Prerequisites: fluency in Norwegian, linguistic curiosity, basic Python skills

Publisert 24. sep. 2022 01:46 - Sist endret 29. sep. 2023 08:49

Veileder(e)

Omfang (studiepoeng)

60