Search⌘ K
AI Features

The String Reconstruction Problem

Understand the complexities of reconstructing a genome string from its k-mer composition. Learn to apply computational strategies to assemble sequences while considering real-world challenges like repeats and overlaps in DNA reads.

Genome assembly is more difficult than you think

Before we introduce a computational problem modeling genome assembly, we’ll take a moment to discuss a few practical complications that make genome assembly more difficult than the Newspaper Problem.

  • First, DNA is double-stranded, and we have no way of knowing a priori which strand a given read derives from, meaning that we’ll not know whether to use a read or its reverse complement when assembling a particular strand of a genome.
  • Second, modern sequencing machines are not perfect, and the reads that they generate often contain errors. Sequencing errors complicate genome assembly because they prevent us from identifying all overlapping reads.
  • Third, some regions of the genome may not be covered by any reads, making it impossible to reconstruct the entire genome.

Since the reads generated by modern sequencers often have the same length, we may safely assume that reads are all k-mers for some value of k. The first part of this chapter will assume an ideal — and unrealistic — situation in which all reads come from the same strand, have no errors, and exhibit ...

String Composition Problem

Problem overview:
Generate the k-mer composition of a string.

Input:
A string Text and an integer k.
Output:
Compositionk​(Text), where the k-mers are arranged in lexicographic order.


Sample dataset:

5
CAATCCAAC

Sample output:

AATCC
ATCCA
CAATC
CCAAC
TCCAA

Python
from itertools import product
import copy
def composition(k, text):
# Write code here
return

Solution explanation

  • Line 5: We define kmers as an empty array.
  • Lines 6–7: We iterate over text.
    • Line 7: We add every k-mer with length k using the i:i+k to kmers.

Solving the String Composition Problem is a straightforward exercise, but in order to model genome assembly, we need to solve its inverse problem.

String Reconstruction Problem Problem overview:

Reconstruct a string from its k-mer composition.

Input: An integer k and a collection Patterns of k-mers.

Output: A string Text with k-mer composition equal to Patterns (if such a string exists).

Before we ask you to solve the String Reconstruction Problem, let’s consider the following example of a 3-mer composition:

AAT ATG GTT TAA TGT

The most natural way to solve the String Reconstruction Problem is to mimic ...