Oppgaven er ikke lenger tilgjengelig

Topic Modeling at Schibsted

About Topic Modeling

Topic modeling, and in particular LDA (http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf), are (unsupervised) algorithms for discovering topics in a document collection.

In Schibsted, we have already experimented with using LDA for structuring newspaper articles from Aftenposten and VG, in order to develop tools for journalists, editors, recommender systems etc. This has been fairly successful, but there is still room for improvement, particularly within two areas:

Determining the optimal input document set for a general topic model
Evaluating the resulting topic model

As all text is in Norwegian, the student should be fairly fluent in Norwegian.

Part 1: Document set

The output of topic modeling is sensitive to the input documents it gets, like any clustering algorithm. If one feeds LDA with medical texts, the topics discovered will all be medical topics. Thus, in order to develop a general news topic model, one needs to find a balanced document set, which represents the news domain, but possibly also the general domain in addition (e.g. by blending in Wikipedia data or another general data source). The effect of mixing in additional data needs to be measured, of course. This leads us to the second area.

Part 2: Evaluation

Evaluation of Topic Modeling is considered quite hard. As topic modeling is an unsupervised algorithm, like clustering algorithms and language models, once cannot naturally evaluate against a labeled test set. Thus, it is common to use perplexity to evaluate a particular model. Perplexity and other evaluation metrics are studied here: http://dirichlet.net/pdf/wallach09evaluation.pdf.

Unfortunately, http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2009_0125.pdf has shown that perplexity does not correspond well with human judgements. For Schibsted’s newspaper data, we have additional metadata (sections, categories, tags) that may serve as a “proxy” for topics in evaluation. Finding out how to best evaluate the topic models from part 1 would be a significant finding.

Summary

To sum it up, the goals of the thesis are as follows:

Create a decent quality general topic model over Schibsted’s n last years of news, optionally blending on documents from other sources
Annotate a small part of the corpus with topics, and evaluate the models against these metrics
Experiment with using metadata as “proxies” for evaluation
Compare evaluation metrics

Emneord: Topic modeling

Publisert 11. okt. 2016 17:44 - Sist endret 25. okt. 2019 12:17

Veileder(e)

Jan Tore Lønning Universitetet i Oslo
Fredrik Jørgensen, Schibsted