Solution: Repeated DNA Sequences

Let's solve the Repeated DNA Sequences problem using the Sliding Window pattern.

We'll cover the following

Statement
Solution
- Naive approach
- Optimized approach using sliding window

Statement

Given a string, dna, that represents a DNA subsequence, and a number $k$ , return all the contiguous subsequences (substrings) of length $k$ that occur more than once in the string. The order of the returned subsequences does not matter. If no repeated substring is found, the function should return an empty set.

The DNA sequence is composed of a series of nucleotides abbreviated as $A$ , $C$ , $G$ , and $T$ . For example, $ACGAATTCCG$ is a DNA sequence. When studying DNA, it is useful to identify repeated subsequences in it.

Constraints:

$1$ $\leq$ dna.length $\leq$ $10^3$
dna[i] is either A, C, G, or T.

$1 \leq k \leq 10$

Solution

So far, you’ve probably brainstormed some approaches on how to solve this problem. Let’s explore some of these approaches and figure out which one to follow while considering time complexity and any implementation constraints.

Naive approach

A naive approach would be to iterate through the input DNA sequence and add all the unique substrings of length $k$ to a set. If a substring is already present in a set, it is a repeated substring.

Here’s how the algorithm works:

We iterate the string using a pointer $i$ , ranging from $0$ to $(n - k + 1)$ . This is the number of $k$ -length substrings present in the sequence.
At each iteration, we generate the current $k$ -length substring, i.e., dna[i]…dna[i + k - 1].
Next, we check if this substring is already present in the set.
- If it is, the current substring is a repeated sequence, so we add it to our output.
- Otherwise, the current substring has not yet been repeated, so we just add it to the set.
We repeat the above process for all $k$ -length substrings.
Once all $k$ -length substrings have been evaluated, we return the output.

The time complexity of this approach is $O((n - k) \times k)$ , where $n$ is the length of the input sequence and $k$ is the size of each contiguous subsequence we consider. This is because we iterate over $(n - k + 1)$ substrings of length $k$ , and at each iteration, the time taken to generate a $k$ -length substring is $O(k)$ .

The space complexity of this approach is $O((n - k) \times k)$ , since in the worst case, our set can contain $(n - k + 1)$ elements, and at each iteration of the traversal, we are allocating memory to generate a new $k$ -length substring.

Optimized approach using sliding window

The problem can be optimized using a sliding window approach. We use the Rabin-Karp algorithm that utilizes a sliding window with rolling hashRolling hash is used to prevent rehashing the whole string while calculating hash values of the substrings of a given string. for pattern matching.

Here’s the basic idea of the algorithm:

We traverse the string by using a sliding window of length $k$ , which slides one character forward on each iteration.
On each iteration, we compute the hash of the current $k$ -length substring in the window.
We check if the hash is already present in the set.
- If it is, the substring is repeated, so we add it to the output.
- Otherwise, the substring has not yet been repeated, so we add the computed hash to the set.
We repeat the above process for all $k$ -length substrings by sliding the window one character forward on each iteration.
After all $k$ -length substrings have been evaluated, we return the output.

There are multiple approaches for computing hash values, and the choice of the hash function can impact the algorithm’s time complexity. Let’s look at some approaches below.

Hashing and comparison in linear time

Let’s use a simple hashing method that sums the ASCII code of characters present in a window.

Consider the sequence $ACTCT$ with $k = 2$ .

Initially, the sequence in the window is $AC$ and its hash value is:
$H(AC) = 65 + 67 = 132$
Since the above hash value has not been repeated yet, we add this hash value to the set and slide the window one character forward.
The sequence in the window is now $CT$ . To compute the hash value of $CT$ , the ASCII of $A$ will be removed from the previous hash value and the ASCII of $T$ will then be added:
$H(CT) = 132 - 65 + 84 = 151$
Since the above hash value has not been repeated yet, we add this hash value to the set and slide the window one character forward.
The sequence in the window is now $TC$ . To compute the hash value of $TC$ , the ASCII of $C$ will be removed from the previous hash value and then again added:
$H(TC) = 151 - 67 + 67 = 151$
Here, we have the same hash value but different sequences— $CT$ and $TC$ . This means that if a hash value is already present in the set, we need to compare the corresponding sequences as well to confirm if they are identical. In this case, they are not, so we add this hash value to the set and slide the window one character forward.
The sequence in the window is now $CT$ . To compute the hash value of $CT$ , the ASCII of $T$ will be removed from the previous hash value and then again added:
$H(CT) = 151 - 84 + 84 = 151$
Here we have the same hash value, so we compare the two sequences. Since they are identical, the sequence has been repeated and is therefore added to the output.

Computing the hash value and then comparing the strings if the hashes are equal will take linear time, $O(k)$ . In the worst case, the comparisons will occur after each slide, which will make the running time the same as that of the naive approach, which is $O((n-k+1)\times k)$ .

Hashing and comparison in constant time

We need a hash function that helps us achieve constant-time hashing. For this purpose, we use the polynomial rolling hash technique:

H = c_1 a^{k-1} + c_2 a^{k-2} + ... + c_i a^{k-i} + ...+ c_{k-1} a^{1} + c_ka^{0}

Here, $a$ is a constant, $c_1, \ldots, c_k$ are the characters in a sequence, and $k$ is the substring length. Since we only have $4$ possible nucleotides, our $a$ would be $4$ . We also assign numeric values to the nucleotides, as shown in the table below:

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.

Getting Started

Two Pointers

Fast and Slow Pointers

Sliding Window

Merge Intervals

In-Place Manipulation of a Linked List

Two Heaps

K-way merge

Top K Elements

Modified Binary Search

Subsets

Greedy Techniques

Backtracking

Dynamic Programming

Cyclic Sort

Topological Sort

Matrices

Stacks

Graphs

Tree Depth-First Search

Tree Breadth-First Search

Trie

Hash Maps

Knowing What to Track

Union Find

Custom Data Structures

Bitwise Manipulation

Challenge Yourself

Conclusion