In order to develop potential treatments for hard-to-treat diseases, a systemic understanding of the current knowledge on their pathogenesis is fundamental. AI-based data enrichment and analysis techniques could help in various ways. For example, collecting and analyzing publicly available research (e.g., scientific papers, dataset, etc.) using large language models could facilitate tasks related to summarizing and categorizing existing theories for why such diseases occur or why specific features of the diseases develop. This information can be further integrated with information from extensive biological databases to refine disease pathway analysis and optimize druggable candidate selection. Work to be done includes:
Collect and integrate relevant data for hard-to-treat diseases from publicly available sources: Identify relevant publicly available databases such as published literature (e.g., PubMed, bioRxiv/medRxiv), grant databases (e.g., ERC dashboard, NIH projects, PSC partner grants), and clinical trials (clinicaltrials.gov), and then develop a strategy for data extraction for hard-to-treat diseases, including a semantic schema (what kind of information is to be extracted) and identify techniques and technologies (such as LLMs with prompt engineering and data linking) that can facilitate and extract the data. This involves the development of a generic pipeline for data extraction and knowledge graph to integrate and enrich the collected data.
Identify and analyze various disease theories: Devise mechanisms to find explanations in the extracted data for why hard-to-treat diseases occur and why specific features of such diseases develop. This task will include the use of LLMs and data enrichment techniques to further extract and analyze data and Explainable AI approaches to reason about the identified theories.
Recommend potential druggable candidates for therapeutic options: Based on biological data relevant to the identified disease theories (e.g., specific cell types or molecules), use AI-based techniques to facilitate filtering of disease pathways and assessing the potential relevance of any existing drugs. This task will include identification and assessment of biological datasets such as protein-protein interaction (STRING), pathway databases (KEGG, GO), extraction of data from such databases, integration with the identified disease theories and drug databases, and use of AI techniques to identify possible recommendations for druggable candidates based on the collected data.