Semantic Parsing for English Wikipedia

The WeScience project develops semantic parsing technology for Wikipedia, i.e. software that provides abstract meaning representations for all sentences of Wikipedia. This work builds on an existing computational grammar for English (the LinGO English Resource Grammar (ERG).

However, the preliminary experiments of Ytrestøl et al. (2009) suggest that the task of extracting linguistically relevant content from Wikipedia sources is a non-trivial problem. Some mark-up (e.g. in-text comments or code blocks) is clearly irrelevant and would obstruct semantic parsing. However, mark-up that could be important for linguistic analysis (e.g. (sub)headings or hyperlinks) needs to be preserved.

This project will re-design and re-implement the preliminary infrastructure of Ytrestøl et al. (2009), building on a standard wiki mark-up parser, for example mwlib. Transformations like removal of irrelevant mark-up or template expansion will then be applied at the level of wiki parse trees, where a central aspect of this work will be the definition and implementation of a description language for these transformations. Finally, preprocessing needs to be applied to the full Wikipedia and interfaced to the existing semantic parser, taking advantage of the UiO high-performance computing facilities.

The project requires good programming skills and preferably some experience with large-scale software systems. The implementation language is not pre-determined but should interface easily with the wiki mark-up parser chosen, and be suitable for efficient, large-scale use. Please discuss further details with Stephan Oepen.

