Statement

Given a string, s, that represents a DNA subsequence, and a number kk, return all the contiguous subsequences (substrings) of length kk that occur more than once in the string. The order of the returned subsequences does not matter. If no repeated substring is found, the function should return an empty set.

The DNA sequence is composed of a series of nucleotides abbreviated as AA, CC, GG, and TT. For example, ACGAATTCCGACGAATTCCG is a DNA sequence. When studying DNA, it is useful to identify repeated subsequences in it.

Constraints:

  • 11 ≤\leq s.length ≤\leq 10410^4
  • s[i] is either A, C, G, or T.
  • 1≤k≤101 \leq k \leq 10

Solution

So far, you’ve probably brainstormed some approaches on how to solve this problem. Let’s explore some of these approaches and figure out which one to follow while considering time complexity and any implementation constraints.

Naive approach

A naive approach would be to iterate through the input DNA sequence and add all the unique substrings of length kk to a set. If a substring is already present in a set, it is a repeated substring.

Here’s how the algorithm works:

  • We iterate the string using a pointer ii, ranging from 00 to (n−k+1)(n - k + 1). This is the number of kk-length substrings present in the sequence.

  • At each iteration, we generate the current kk-length substring, i.e., s[i]…s[i + k - 1].

  • Next, we check if this substring is already present in the set.

    • If it is, the current substring is a repeated sequence, so we add it to our output.

    • Otherwise, the current substring has not yet been repeated, so we just add it to the set.

  • We repeat the above process for all kk-length substrings.

  • Once all kk-length substrings have been evaluated, we return the output.

The time complexity of this approach is O((n−k)×k)O((n - k) \times k), where nn is the length of the input sequence and kk is the size of each contiguous subsequence we consider. This is because we iterate over (n−k+1)(n - k + 1) substrings of length kk, and at each iteration, the time taken to generate a kk-length substring is O(k)O(k).

The space complexity of this approach is O((n−k)×k)O((n - k) \times k), since in the worst case, our set can contain (n−k+1)(n - k + 1) elements, and at each iteration of the traversal, we are allocating memory to generate a new kk-length substring.

Optimized approach using sliding window

The problem can be optimized using a sliding window approach. We use the Rabin-Karp algorithm that utilizes a sliding window with rolling hashRolling hash is used to prevent rehashing the whole string while calculating hash values of the substrings of a given string. for pattern matching.

Here’s the basic idea of the algorithm:

  • We traverse the string by using a sliding window of length kk, which slides one character forward on each iteration.

  • On each iteration, we compute the hash of the current kk-length substring in the window.

  • We check if the hash is already present in the set.

    • If it is, the substring is repeated, so we add it to the output.

    • Otherwise, the substring has not yet been repeated, so we add the computed hash to the set.

  • We repeat the above process for all kk-length substrings by sliding the window one character forward on each iteration.

  • After all kk-length substrings have been evaluated, we return the output.

There are multiple approaches for computing hash values, and the choice of the hash function can impact the algorithm’s time complexity. Let’s look at some approaches below.

Hashing and comparison in linear time

Let’s use a simple hashing method that sums the ASCII code of characters present in a window.

Consider the sequence ACTCTACTCT with k=2k = 2.

  1. Initially, the sequence in the window is ACAC and its hash value is:

    H(AC)=65+67=132H(AC) = 65 + 67 = 132

    Since the above hash value has not been repeated yet, we add this hash value to the set and slide the window one character forward.

  2. The sequence in the window is now CTCT. To compute the hash value of CTCT, the ASCII of AA will be removed from the previous hash value and the ASCII of TT will then be added:

    H(CT)=132−65+84=151H(CT) = 132 - 65 + 84 = 151

    Since the above hash value has not been repeated yet, we add this hash value to the set and slide the window one character forward.

  3. The sequence in the window is now TCTC. To compute the hash value of TCTC, the ASCII of CC will be removed from the previous hash value and then again added:

    H(TC)=151−67+67=151H(TC) = 151 - 67 + 67 = 151

    Here, we have the same hash value but different sequences—CTCT and TCTC. This means that if a hash value is already present in the set, we need to compare the corresponding sequences as well to confirm if they are identical. In this case, they are not, so we add this hash value to the set and slide the window one character forward.

  4. The sequence in the window is now CTCT. To compute the hash value of CTCT, the ASCII of TT will be removed from the previous hash value and then again added:

    H(CT)=151−84+84=151H(CT) = 151 - 84 + 84 = 151

    Here we have the same hash value, so we compare the two sequences. Since they are identical, the sequence has been repeated and is therefore added to the output.

Computing the hash value and then comparing the strings if the hashes are equal will take linear time, O(k)O(k). In the worst case, the comparisons will occur after each slide, which will make the running time the same as that of the naive approach, which is O((n−k+1)×k)O((n-k+1)\times k).

Hashing and comparison in constant time

We need a hash function that helps us achieve constant-time hashing.For this purpose, we use the polynomial rolling hash technique:

H=c1ak−1+c2ak−2+...+ciak−i+...+ck−1a1+cka0H = c_1 a^{k-1} + c_2 a^{k-2} + ... + c_i a^{k-i} + ...+ c_{k-1} a^{1} + c_ka^{0}

Here, aa is a constant, c1,…,ckc_1, \ldots, c_k are the characters in a sequence, and kk is the substring length. Since we only have 44 possible nucleotides, our aa would be 44. We also assign numeric values to the nucleotides, as shown in the table below:

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.