Oppgaven er ikke lenger tilgjengelig

Domain-Specific Document Structure Analysis

In most current documents, published on the Web or otherwise, interpretation is in no small part driven by layout properties. Even though there are efforts in the recent years to structure on-line content formally and thus making it more easily accessible,modern Web pages still rely on a basic structure of paragraphs and headers, that corresponds to the semantic contents of the documents.

Automatically recognizing and analyzing document structures, especially section headings, paragraphs, and tabular data, is vital for extracting relevant information precisely for use in other domain applications. The motivation of this project is to support knowledge extraction from unstructured documents, and in particular to exploit the actual semantics of the document structure for improving further linguistic interpretation.

In the domain of on-line news, for example, when extracting information from a Web page that describes a news item, we are interested in identifying the author(s), date of publication, headline, summary, and possibly readers comments. Although this information is evident to the human reader of the page, it is often not explicity coded for computational analysis. Therefore, it will be useful for an information extraction system to be able to derive the logical structure of a document from layout analysis, e.g. have relevant parts of a web page correctly segmented and semantically classified.

In the domain of recruitment, resumes (aka curricula vitae, CVs) can be relatively free in structure. The CVS from people in the private sector put emphasis on the work experience and place it before the section that describes education, for example. Conversely, this is not the typical ordering in CVs from the academic sector. Correctly identified segments of a CV can help a classification system to align it with potential job offerings, as is the goal in a collaborative European project in which LTG currently participates.

The proposed MSc project should develop an intelligent approach to the automated identification and recognition of logical document structure, combining layout analysis and domain-specific knowledge. The project can build on an existing, in-house platform for low-level layout analysis and content extraction from PDF documents (implemented in Java) as well as on access to a very large domain ontology for the recruitment application. Techniques to be applied will likely combine heuristic, rule-based processing with supervised machine learning and statistical classification.

Emneord: language technology
Publisert 2. okt. 2013 04:40 - Sist endret 1. des. 2017 11:02

Veileder(e)

Student(er)

Omfang (studiepoeng)

60