Oppgaven er ikke lenger tilgjengelig

Rapid DNA sequence comparison

The similarity of two DNA sequences can be accurately determined using optimal alignment methods, but these are usually very time-consuming. When comparing a huge number of sequences, or very long sequences, this becomes computationally too expensive. Many different heuristics have been developed to speed up this process and allow the similarity of two sequences to be estimated more quickly without the use of a full alignment. These approaches are often called alignment-free methods. Such general comparison methods are very important in many areas of bioinformatics when we want to quickly and accurately identify similar sequences. It is used in clustering, taxonomic assignment, and various predictions.

Identifying shared k-mers, i.e. words consisting of k consecutive nucleotides, is a common theme for many of these approaches. The number of shared k-mers and their distance is usually considered. K-mers with one or a few mismatching positions are also sometimes used. In some cases reduced alphabets are used. Several different metrics for measuring sequence similarity based on k-mers have been proposed. Different data structures are used to quickly identify the interesting sequences, including hash tables, bitmaps, suffix trees, suffix indices, FM-indices and more. These approaches are common in old programs like FASTA and BLAST as well as more recent tools like BLAT, RAPsearch, LAST, Diamond, and USEARCH. Programs like USEARCH and VSEARCH tries to identify the DNA sequences with the best global alignment by looking for sequences with the highest number of shared k-mers (default k is 8).

The aim of this project is to explore approaches to identify similar sequences based on shared k-mers between DNA sequences. What is the optimal k-mer length in various applications? How should repeated k-mers be handled? How well does the k-mer based metrics correspond to the alignment based metrics? Which data structures are efficient for storing the indices?

The project is suitable for anyone interested in bioinformatics and has considerable programming experience. Some statistics is recommended.

Supervisor: Torbjørn Rognes (BMI/IFI)

Publisert 3. okt. 2016 15:55 - Sist endret 24. okt. 2016 10:46

Veileder(e)

Omfang (studiepoeng)

60