Developing a Tool for Annotating Threat Reports

Threat Intelligence consists in large of collecting data from different sources. Many sources are providing threat intelligence in English prose which requires a human analyst to read and extract relevant information. There are several attempts of using Natural Language Processing (NLP) for extraction of relevant information of larger corpus of text, also in use for Threat Intelligence. However, in order to train a machine learning model to extract the correct information, there is a need for annotated data to train the models on. These annotated data are time-consuming to create and few are published.

ACT is a threat intelligence platform resulting from a BIA project funded by The Norwegian Research Council and mnemonic. The platform has a data model which consists of the most relevant concepts in the world of threat intelligence and provides information on not only the object types but also the relationships between them. In the platform, all information is included as relationships. Including data from the use of NLP hence requires extraction of not only relevant objects from text but also the relationship between these objects.

Related to the ACT project, SCIO is an NLP engine extracting relevant data from English prose. The engine is not extracting relationships today, and the NLP engine can potentially be more effective if provided with more training data.

In this assignment, the goal is to provide training data for use within NLP in CTI and with the help of mnemonic investigate if these training sets can improve the NLP engine running today. The suggestion is to create a tool that can ease the annotation process and annotate 30-40 different threat reports.

The results should be open-sourced and available through GitHub.
 

Publisert 31. aug. 2021 09:38 - Sist endret 31. aug. 2021 09:44

Veileder(e)

Omfang (studiepoeng)

60