Human-in-the-loop Text Sanitization

Bildet kan inneholde: tekst, skrift, blå, linje.

Text sanitization is the task of editing a text document in order to mask all text spans whose occurrence may lead (directly or indirectly) to the identification of the individual being mentioned in that text. It is a task of great practical importance for many types of documents that may include sensitive personal information, such as court cases or electronic patient records.

The problem of automated text sanitization is an active area of research (see here and there for two of our recent papers on this topic). However, due to the imperfect nature of current sanitization models (and the serious privacy implications that may arise if the sanitization is not done properly), text sanitization is rarely performed in a fully automated fashion. Human experts are typically required to verify the output of text sanitization output and edit what needs to be corrected.

What is the best way to frame this collaboration between software tool and human expert? One solution is to use text sanitization tools to provide masking suggestions to a human expert who would make the final decision. But one single human decision may have consequences that "spread" on multiple text spans: for instance, if the human expert decides to leave "P. Lison" in clear text, it also means that other text spans including the same words (or similar sequences, such as "Pierre Lison") should also be left in clear text in the document.

The master thesis will explore how to adapt existing text sanitization techniques to operate with a "human-in-the-loop". One solution would be, for instance, to train a machine learning model that provides masking suggestions based on a document and a partial set of human decisions recorded so far. The model would then be rerun after each human decision to update the masking suggestions.

Prerequisites: Good programming skills & experience developing and evaluating NLP models. Interest in the topic of privacy-enhancing NLP. Knowledge of Norwegian is a plus (one possibility would be to collaborate with Lovdata on this topic).

Emneord: Natural Language Processing, privacy

Publisert 4. okt. 2023 09:26 - Sist endret 4. okt. 2023 09:26

Veileder(e)

Pierre Lison Universitetet i Oslo

Human-in-the-loop Text Sanitization

Veileder(e)

Omfang (studiepoeng)