Epilogue: Genome Assembly Faces Real Sequencing Data

Understand the complexities of assembling genomes from real sequencing data by learning about challenges like imperfect k-mer coverage, sequencing errors, and bubble removal in de Bruijn graphs. Discover how contigs are generated and the impact of read errors on genome assembly, equipping you to tackle practical sequencing problems.

We'll cover the following...

Breaking reads into k-mers
- Illumina sequencing technology
Splitting the genome into contigs
- Solution explanation
- Non-branching
Charging Station: Maximal Non-Branching Paths in a Graph
Assembling error-prone reads
- Bubble removal
Inferring multiplicities of edges in de Bruijn graphs
- Practical considerations

Breaking reads into k-mers

Our discussion of genome assembly has thus far relied upon various assumptions. Accordingly, applying de Bruijn graphs to real sequencing data is not a straightforward procedure. Below, we describe practical challenges introduced by quirks in modern sequencing technologies and some computational techniques that have been devised to address these challenges. In this discussion, we’ll first assume that reads are generated as contiguous substrings of a genome instead of read-pairs for the sake of simplicity.

Illumina sequencing technology

Given a k-mer substring of a genome, we define its coverage as the number of reads to which this k-mer belongs. We’ve taken for granted that a sequencing machine can generate all k-mers present in the genome, but this assumption of perfect k-mer coverage doesn’t hold in practice. For example, the popular Illumina sequencing technology generates reads that are approximately 300 nucleotides long, but this technology still misses many 300-mers present in the genome (even if the average coverage is very high), and nearly all the reads that it does generate have sequencing errors.

STOP and Think: Given a set of reads having imperfect k-mer coverage, can you find a parameter $l < k$ so that the same reads have perfect l-mer coverage? What is the maximum value of this parameter?

The figure below (left) shows four 10-mer reads that capture some but not all of the 10-mers in an example genome. However, if we take the counterintuitive step of breaking these reads into shorter 5-mers (Figure 3.37, right), then these 5-mers exhibit perfect coverage. This read breaking approach, in which we break reads into shorter k-mers, is used by many modern assemblers.

1.Before Getting Started

2.Where in the Genome Does DNA Replication Begin?

3.DNA Replication: Open Problems, Charging Stations, and Detours

4.How Do We Assemble Genomes?

5.Assemble Genomes: Charging Stations, and Detours

6.How Do We Compare Biological Sequences?

7.Biological Sequences: Detours

8.Conclusion

Epilogue: Genome Assembly Faces Real Sequencing Data

Breaking reads into k-mers

Illumina sequencing technology