Oppgaven er ikke lenger tilgjengelig

Parsing Scholarly Literature

Text mining of scholarly literature (i.e. various types of research publications, MSc and PhD theses or conference articles, say) has a range of applications. For full-text search, for example, it is necessary to extract and index the exact linguistic content, repair any publication-related ‘damage’ (see below), tokenize, and maybe lemmatize or break into sentences. Advanced information retrieval (IR) may call for additional content analysis, for example so-called word sense disambiguation (teasing apart the different meanings of words like motion, in physics vs. law), the extraction of structural relations (‘Who did what to whom?’), or finding and tracking bibliographical references. Information extraction applications, finally, may aim to acquire structured knowledge from unstructured text.

Common to all these tasks is the need to access and analyse actual content, including some treatment of logical and rethorical structure (an IR system may want to weigh term occurences in headings higher than ones in footnoes). This project will establish the software infrastructure for high-precision content extraction from PDF files, which is a surprisingly challenging digital format. Based on earlier work at UiO, a collaboration partner (DFKI), and the larger research community (Singapore; Michigan), the project will assemble an end-to-end extraction and parsing pipeline, evaluate system performance at various levels, and then identify critical sub-components to improve, lower-level text extraction and correction or higher-level processing (e.g. sentence segmentation or tracking references). Please see the links above and contact Stephan Oepen for details.

Publisert 14. mars 2011 11:27 - Sist endret 1. des. 2017 10:55

Veileder(e)

Student(er)

Omfang (studiepoeng)

60