Resource Creation

Language technology R&D is inherently data-driven: most development and experimentation are organized around electronic language data. The project advances the fitness of semantic parsing technology for diverse types of user-generated Web content. Very few existing corpora to date exemplify the linguistic complexity and variation of this kind of language. Hence, this track develops and applies tools to harvest and preprocess substantial samples of on-line content.

Materials will be drawn from: hard- and software user forums, reviews of consumer electronics, science and technology blogs, Wikipedia, and open-access research literature. This selection reflects linguistic variation (ranging from edited, formal language to dynamic and informal language), establishes the broad domain of information technology, and connects to prior and ongoing work in the field. Prima facie maybe unexpectedly, there are unsolved scientific questions in this task related to (a) the interface of text-level markup and grammar, as well as to (b) ‘noise’ characteristic of some channels of on-line communication.

Markup elements like headings, bulleted lists, or italics can indicate specialized grammar, foreign words, or a so-called use–mention contrast. Linguistically relevant markup should be available to parsing, while purely design-oriented markup elements must be suppressed. Different markup standards are in use (e.g. (X)HTML, LaTEX, Wiki Syntax, and various ASCII conventions), but the grammatical rules of English (or another language) should rather refer to relevant text-level properties in logical terms—for example an abstract italics property. The project will extend earlier work by Ytrestøl, Flickinger, & Oepen (2009) and identify relevant markup elements in various types of Web content and define a suitably general markup language, used to annotate parser inputs.  The resulting text collections will be automatically pre-processed (including sentence boundary detection) to provide inputs for our research on semantic parsing and interface corroboration; where possible, materials will be chosen that allow the redistribution of these corpora to the general public.

When extracting text from PDF files, scraping a Web site, or analyzing the archives of a user forum, there is a certain level of non-linguistic ‘noise’, for example hyphenation at line breaks, intervening navigation elements or advertisement, and other artifacts of recovering linguistic content by ‘reversing’ display presentation. Working together with DFKI, the project will investigate automated techniques for the detection, removal, or correction of such artifacts. Processing noisy language data is an HLT task of emerging importance, and the data and methods developed in this track will therefore create immediate international interest.

Finally, for project-internal evaluation and interface corroboration, parts of our UGC corpora will be annotated with gold-standard syntactic and semantic structures, resulting in what is commonly referred to as treebanks (even though semantic structures need not take the form of trees). Although such annotation is generally expensive to construct, the project setup facilitates the use of
semi-automatic treebanking techniques. The Redwoods approach (Oepen, Flickinger, Toutanova, & Manning, 2004) allows the construction of comparatively detailed treebanks at greatly reduced cost, leveraging the parsers and providing a specialized annotation tool to identify the correct analyses among the space of possible parser outputs. Following an initial round of parser adaptation to the various texts, treebanks of sizable subsets of the data will be prepared and then be maintained through semi-automated updates until project completion.

By Stephan Oepen, Lilja Øvrelid
Published Sep. 27, 2012 11:37 AM - Last modified Sep. 27, 2012 11:41 AM