AI-based Data Enrichment and Analysis for Hard-to-Treat Diseases

The main goal of the thesis is to investigate the potential of AI-based data enrichment and analysis technologies and techniques in the context of hard-to-treat diseases. An example of such a disease is primary sclerosing cholangitis (PSC) – a chronic inflammatory liver disease without effective treatment, standing as the leading indication for liver transplantation in Norway.

Note: this topic is offered as part of a collaboration project between SINTEF AS, Oslo University Hospital, and OsloMet – Oslo Metropolitan university.

In order to develop potential treatments for hard-to-treat diseases, a systemic understanding of the current knowledge on their pathogenesis is fundamental. AI-based data enrichment and analysis techniques could help in various ways. For example, collecting and analyzing publicly available research (e.g., scientific papers, dataset, etc.) using large language models could facilitate tasks related to summarizing and categorizing existing theories for why such diseases occur or why specific features of the diseases develop. This information can be further integrated with information from extensive biological databases to refine disease pathway analysis and optimize druggable candidate selection. Work to be done includes:

Collect and integrate relevant data for hard-to-treat diseases from publicly available sources: Identify relevant publicly available databases such as published literature (e.g., PubMed, bioRxiv/medRxiv), grant databases (e.g., ERC dashboard, NIH projects, PSC partner grants), and clinical trials (clinicaltrials.gov), and then develop a strategy for data extraction for hard-to-treat diseases, including a semantic schema (what kind of information is to be extracted) and identify techniques and technologies (such as LLMs with prompt engineering and data linking) that can facilitate and extract the data. This involves the development of a generic pipeline for data extraction and knowledge graph to integrate and enrich the collected data.

Identify and analyze various disease theories: Devise mechanisms to find explanations in the extracted data for why hard-to-treat diseases occur and why specific features of such diseases develop. This task will include the use of LLMs and data enrichment techniques to further extract and analyze data and Explainable AI approaches to reason about the identified theories.

Recommend potential druggable candidates for therapeutic options: Based on biological data relevant to the identified disease theories (e.g., specific cell types or molecules), use AI-based techniques to facilitate filtering of disease pathways and assessing the potential relevance of any existing drugs. This task will include identification and assessment of biological datasets such as protein-protein interaction (STRING), pathway databases (KEGG, GO), extraction of data from such databases, integration with the identified disease theories and drug databases, and use of AI techniques to identify possible recommendations for druggable candidates based on the collected data.

Emneord: AI, health, data science, data enrichment

Publisert 17. mai 2024 10:36 - Sist endret 17. mai 2024 12:29

Veileder(e)

Ahmet Soylu Universitetet i Oslo
Espen Melum Universitetet i Oslo
Dumitru Roman
Xiaojun Jiang
Hui Song

AI-based Data Enrichment and Analysis for Hard-to-Treat Diseases

Veileder(e)

Omfang (studiepoeng)