Oppgaven er ikke lenger tilgjengelig

Understanding genetic variation with genome graphs

Every human cell has a DNA sequence consisting of aproximately 3 billion base pairs, each of which we usually refer to as A, C, T or G, meaning that the genome can be represented as a very long string, e.g. ACCAACTGAG.. and so on. While this string is quite similar for two individuals, small differences (e.g. an A instead of a T in a single location) can have a huge impact. It is thus interesting and important to study the differences between genomes in order to understand how disease and traits are connected may be rooted in genetic variation. Luckily, we are today able to relatively cheaply sequence the genome of an individual, meaning that we can from e.g. a blood sample obtain the full DNA sequence of the individual. We are, however, not able to directly read the full 3 billion long sequence directly. Instead, we typically get the genome sequence divided into millions of overlapping 150 character (nucleotides) long strings. These strings are usually called reads, and these reads need to be puzzled together in order to make sense.

"Gluing" these short strings together (known as genome assembly) is a complicated task requring a lot of computational resources. Thus, it is instead much more common to use a reference genom (a genome of a standard and typical human being) and compare each short read to the this reference, by finding the position in the reference string where each read seem to match well. This works well when the reference is quite similar to the individual we have sequenced, and makes us able to detect differences (e.g. when the read has an A and the reference a T). However, when the individual differs a lot from the reference, matching reads from the individual to one standard reference does not always work well.

An approach that has started to become more popular recently is to abandon the idea of comparing reads to a single human reference. As researchers have sequenced more and more individuals worldwide (there are thousands of data sets available online), we now know pretty well where in the genome and how individuals typically differ from each other. This variation data can be represented by genome graphs where nodes represent sequences and edges connect the sequences together into paths that represent individual geonomes. This datastructure can for instance represent all known human genomic variation, and when a new individual is sequenced the reads can be compared to this graph instead of being compared to a single human reference.

We are developing tools for building such graph structures and analysing genetic variation using these graphs. This master task will continue this work, by developing tools and/or algorithms for analysing genetic variation using real sequence data that is available online. The work will preferrably be carried out using Python, and an interest for high performance computing (using numpy and Cython) can be a benefit. This master thesis may take one of several paths, depending on the interests of the students. Some possibilitiies are: 1) Focusing on algorithm optimization using Cython or other techniques for high-performance computing in Python (include multithreading), 2) Focusing on statistical methods for determining how likely genetic variants are in an individual given a read data set, or 3) Focusing more on tool development and method implementation in Python (one relevant direction may be to build on already developed code to generalize it to other similar problems). No prior knowledge of biology/genomics is needed.

Publisert 16. okt. 2021 10:53 - Sist endret 5. des. 2021 10:38

Veileder(e)

Omfang (studiepoeng)

60