Epilogue: Genome Assembly Faces Real Sequencing Data
Understand the complexities of assembling genomes from real sequencing data by learning about challenges like imperfect k-mer coverage, sequencing errors, and bubble removal in de Bruijn graphs. Discover how contigs are generated and the impact of read errors on genome assembly, equipping you to tackle practical sequencing problems.
We'll cover the following...
Breaking reads into k-mers
Our discussion of genome assembly has thus far relied upon various assumptions. Accordingly, applying de Bruijn graphs to real sequencing data is not a straightforward procedure. Below, we describe practical challenges introduced by quirks in modern sequencing technologies and some computational techniques that have been devised to address these challenges. In this discussion, we’ll first assume that reads are generated as contiguous substrings of a genome instead of read-pairs for the sake of simplicity.
Illumina sequencing technology
Given a k-mer substring of a genome, we define its coverage as the number of reads to which this k-mer belongs. We’ve taken for granted that a sequencing machine can generate all k-mers present in the genome, but this assumption of perfect k-mer coverage doesn’t hold in practice. For example, the popular Illumina sequencing technology generates reads that are approximately 300 nucleotides long, but this technology still misses many 300-mers present in the genome (even if the average coverage is very high), and nearly all the reads that it does generate have sequencing errors.
STOP and Think: Given a set of reads having imperfect k-mer coverage, can you find a parameter so that the same reads have perfect l-mer coverage? What is the maximum value of this parameter?
The figure below (left) shows four 10-mer reads that capture some but not all of the 10-mers in an example genome. However, if we take the counterintuitive step of breaking these reads into shorter 5-mers (Figure 3.37, right), then these 5-mers exhibit perfect coverage. This read breaking approach, in which we break reads into shorter k-mers, is used by many modern assemblers.
Read breaking must deal with a practical trade-off. On the one hand, the smaller the value of k, the larger the chance that the k-mer coverage is perfect. On the other hand, smaller values of k result in a more tangled de Bruijn graph, making it difficult to infer the genome from this graph.
Splitting the genome into contigs
Even after read breaking, most assemblies ...