Oppgaven er ikke lenger tilgjengelig

Processing the Common Crawl

The non-profit Common Crawl Foundation seeks to help restore ‘Internet democracy’ by making available to the public very large Web snapshots, i.e. tens and hundreds of terabytes of documents crawled off the World-Wide Web at regular intervals. Web-scale frequency lists, language models, and word embeddings are beginning to play a role in current NLP research, but even though the ‘raw’ data is available freely, there are a number of technological barriers for the average university researcher to take into use the Common Crawl.

Specifically, a large collection of Web documents needs to be ‘refined’ before the data can serve for statistical modeling in NLP, for example in terms of language identification, extraction of relevant, ‘clean’ linguistic content, boilerplate and duplicate removal, and such. Although there are known solutions to each of these pre-processing tasks, their application at the scale of the Common Crawl requires a non-trivial sophistication in scientific computing. The Language Technology Group at IFI has acquired storage and processing allocations on the national supercomputing infrastructure to work with the Common Crawl; through this MSc project, we are seeking to develop the technological infrastructure and in-house expertise to distill various very large samples of linguistic data from the crawls, for example a Web-scale corpus of Norwegian text, or to scale up and evaluate various techniques for computing word embeddings.

This project will be conducted on the national ABEL compute cluster and the NorStore storage infrastructure, possibly in collaboration with colleagues at the University of Turku (Finland), where a similar initiative is underway. Good Un*x and coding experience are a prerequisite for this project, and some prior knowledge of distributed computing or very large-scale stream processing might be helpful. However, there are various ways in which the project can be adapted to individual interests and background, so please see Stephan and Erik to discuss possible directions for this work.

Emneord: language technology

Publisert 1. des. 2017 09:46 - Sist endret 1. des. 2017 09:48

Veileder(e)

Stephan Oepen Universitetet i Oslo
Erik Velldal Universitetet i Oslo

Student(er)

Kjetil Bugge Kristoffersen

Processing the Common Crawl

Veileder(e)

Student(er)

Omfang (studiepoeng)