Oppgaven er ikke lenger tilgjengelig

Developing a reproducible, flexible data analysis pipeline for Next Generation Sequencing (NGS) data with Snakemake

Background

High throughput sequencing (HTS) technologies have not only changed molecular biology research, but they have also changed several clinical areas related to genomic analysis. The fast growth of genomic data has presented additional challenges, primarily in the transformation of raw sequencing data into outcomes that can be interpreted by researchers and clinicians. In addition to computational challenges, operatives must contend with a wide range of bioinformatics tools that are frequently integrated into computational pipelines. Development and testing of pipelines typically require a substantial amount of time, and the process is sensitive to errors that are challenging to identify from the output files only. One more concern is assuring the reproducibility of the analysis across multiple computational centers or platforms with inconsistent software versions.

Project Goals

The processing of vast numbers of samples with high sequencing depths produced by high throughput sequencing methods demands efficient, flexible, and reproducible bioinformatics workflows. This raises several difficulties in selecting the most effective tools and structuring them in a manner that produces the desired output in a systematic manner. This project's goal is to develop a pipeline/workflow in Snakemake (a bioinformatics tool for developing pipelines, primarily for NGS data sets) as a Docker container for processing sequencing reads, including quality control, mapping, assembly, transcriptomics, metagenomics, methylation analysis, and ChIP seq analysis. This pipeline will significantly minimize the amount of work required to execute commands and eliminate the ambiguity produced by the accumulation of analysis results derived from testing several parameters. This pipeline will be designed to execute the systematic analysis on a large number of samples in a single pass, ideally for researchers that intend to deploy the pipeline on their local servers i.e., Saga at UiO. The pipeline scripts and user guide will be publicly available on GitHub.

Practical information

The candidate will be co-supervised by Adnan Hashim (adnan.hashim@ncmm.uio.no), John Arne Dahl (j.a.dahl@medisin.uio.no) and Torbjørn Rognes (torognes@ifi.uio.no). The supervisors are affiliated with UiO and Oslo University Hospital and have strong expertise in bioinformatics, computer science and biology.

Requirements

Students must have proficiency with Python, Bash, and High-Performance Computing (HPC) under Linux/Unix environment, which are widely used in bioinformatics. Knowledge about Snakemake is an advantage, but not required. No prior knowledge of biology is required, though it may be useful depending on the challenge. The candidate must be hard-working and eager to learn new skills.

This master project is offered as a long master project (60 study points).

Publisert 27. sep. 2022 12:38 - Sist endret 28. nov. 2023 13:14

Veileder(e)

Omfang (studiepoeng)

60