Real-world problems

A variant of the “longest substring with maximum K distinct characters” problem comes up while studying DNA samples and detecting viruses.

Detecting viruses

While studying different DNA samples, we observed that a certain virus consists of really long sequences of $k$ distinct nucleotides. The virus infects a species by embedding itself into the species’s DNA. We are working on devising a test to detect the virus. The idea is to analyze the longest string that consists of, at most, $k$ nucleotides from a species’s DNA.

We’ll be provided with a string representing a chromosome from the infected DNA and a $k$ value supplied from a hidden function. Our task will include calculating the longest subsequence from the chromosome string that has $k$ unique nucleotides.

Here is an illustration to better understand this process:

Consider the above characters as nucleotides in a DNA strand. Since $k = 3$ the longest substring with three distinct nucleotides is “ffzzeee”.

We will traverse over the given sequence and keep track of the encountered nucleotides. When the number of distinct nucleotides exceeds the given value of $k$ , we’ll drop the left-most one and move the window forward.

Some questions to think about:

How is the window moving forward?

Is the window size constant or variable?

Is it an individual property we are interested in or an aggregate one?

How can we compute that property?

Press + to interact

Python 3.5

def longest_substring_with_k_distinct(str1, k):
  window_start = 0
  max_length = 0
  char_frequency = {}
  # in the following loop we'll try to extend the range [window_start, window_end]
  for window_end in range(len(str1)):
    right_char = str1[window_end]
    if right_char not in char_frequency:
      char_frequency[right_char] = 0
    char_frequency[right_char] += 1
    # shrink the sliding window, until we are left with 'k' distinct characters in the char_frequency
    while len(char_frequency) > k:
      left_char = str1[window_start]
      char_frequency[left_char] -= 1
      if char_frequency[left_char] == 0:
        del char_frequency[left_char]
      window_start += 1  # shrink the window
    # remember the maximum length so far
    max_length = max(max_length, window_end - window_start + 1)
  return max_length
def main():
  print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))
  print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))
  print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))
main()

Step-by-step solution construction

We use a sliding window approach with two pointers to set the boundary of our window. Here’s what we’ll do:

Store the frequency of our window contents in a dictionary.
Shrink our sliding window until we’re left with k distinct characters.
Compare the max_length variable with our window size and update it with the maximum of the two.

Find more about a Python dictionary here.

When the distinct characters in our dictionary exceed k, an obvious approach would be deleting the elements from the start until we are left with the required number of distinct characters. Let’s see this in action below:

Press + to interact

Python 3.5

def longest_substring_with_k_distinct(str1, k):
  window_start = 0
  max_length = 0
  char_frequency = {}
  # in the following loop we'll try to extend the range [window_start, window_end]
  for window_end in range(len(str1)):
    right_char = str1[window_end]
    if right_char not in char_frequency:
      char_frequency[right_char] = 0
    char_frequency[right_char] += 1
    print("Current dictionary: " + str(char_frequency) + "\n")
    # shrink the sliding window, until we are left with 'k' distinct characters in the char_frequency
    while len(char_frequency) > k:
      left_char = str1[window_start]
      print("About to remove \"", left_char, "\" from dictionary: ", sep="") 
      del char_frequency[left_char]
      print("Dictionary after removal: " + str(char_frequency))
      window_start += 1  # shrink the window
  return max_length
def main():
  #print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))
  print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))
  #print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))
main()

When we run the code above, we find that it throws an error. This is because we can get into situations where we’re trying to delete an element that does not exist in the dictionary. Also, in this approach, we delete elements regardless of their frequency. For example, we might accidentally delete a character that has multiple instances in the given string.

The solution would be to reduce the frequency of the character dropping out of the window by 1 and deleting it from the dictionary only when it becomes 0. This way, we’ll avoid trying to delete a character that’s not present or accidentally deleting a character that has multiple instances in the string. Let’s try out these improvements in the code below:

Press + to interact

Python 3.5

def longest_substring_with_k_distinct(str1, k):
  window_start = 0
  max_length = 0
  char_frequency = {}
  # in the following loop we'll try to extend the range [window_start, window_end]
  for window_end in range(len(str1)):
    right_char = str1[window_end]
    if right_char not in char_frequency:
      char_frequency[right_char] = 0
    char_frequency[right_char] += 1
    print("Current dictionary: ", char_frequency, "\n")
    # shrink the sliding window, until we are left with 'k' distinct characters in the char_frequency
    while len(char_frequency) > k:
      left_char = str1[window_start]
      char_frequency[left_char] -= 1
      print("Updated frequencies as we shrink window: ", char_frequency)
      if char_frequency[left_char] == 0:
        del char_frequency[left_char]
        print("Deleting \"", left_char, "\" since its frequency is 0.", sep="")
      print("Updated dictionary: ", char_frequency, "\n")
      window_start += 1  # shrink the window
  return max_length
def main():
  #print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))
  print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))
  #print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))
main()

Press + to interact

Python 3.5

def longest_substring_with_k_distinct(str1, k):
  window_start = 0
  max_length = 0
  char_frequency = {}
  newstr = ""
  dummystr = ""
  # in the following loop we'll try to extend the range [window_start, window_end]
  for window_end in range(len(str1)):
    right_char = str1[window_end]
    if right_char not in char_frequency:
      char_frequency[right_char] = 0
    char_frequency[right_char] += 1
    newstr += right_char
    dummystr += right_char
    print("Current String: ", dummystr)
    print("Current dictionary: ", char_frequency, "\n")
    # shrink the sliding window, until we are left with 'k' distinct characters in the char_frequency
    while len(char_frequency) > k:
      left_char = str1[window_start]
      char_frequency[left_char] -= 1
      print("Updated frequencies as we shrink window: ", char_frequency)
      if char_frequency[left_char] == 0:
        del char_frequency[left_char]
        print("Deleting till \"", left_char, "\" since its frequency is 0.")
        updatedstring = newstr[window_start+1:]
        print("Updated String: ", updatedstring, "\n")
        dummystr = updatedstring
      window_start += 1  # shrink the window
      
  return max_length
def main():
  print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))
  #print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))
  #print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))
main()

Press + to interact

Python 3.5

def longest_substring_with_k_distinct(str1, k):
  window_start = 0
  max_length = 0
  char_frequency = {}
  newstr = ""
  dummystr = ""
  
  # in the following loop we'll try to extend the range [window_start, window_end]
  for window_end in range(len(str1)):
    star = ""
    right_char = str1[window_end]
    if right_char not in char_frequency:
      char_frequency[right_char] = 0
    char_frequency[right_char] += 1
    newstr += right_char
    dummystr += right_char
    # shrink the sliding window, until we are left with 'k' distinct characters in the char_frequency
    while len(char_frequency) > k:
      left_char = str1[window_start]
      char_frequency[left_char] -= 1
      if char_frequency[left_char] == 0:
        del char_frequency[left_char]
        updatedstring = newstr[window_start+1:]
        dummystr = updatedstring
      window_start += 1  # shrink the window
    # remember the maximum length so far
    if max_length < (window_end - window_start + 1): star = '*' 
    max_length = max(max_length, window_end-window_start + 1)
    print("Current String: ", dummystr + star)
    print("Max length: ", max_length, "\n")
  return max_length
def main():
  print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))
  #print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))
  #print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))
main()

Press + to interact

Python 3.5

def longest_substring_with_k_distinct(str1, k):
  print("Input string: ", str1, ", K=", str(k), sep="", end="\n\n")
  window_start = 0
  max_length = 0
  char_frequency = {}
  newstr = ""
  dummystr = ""
  # in the following loop we'll try to extend the range [window_start, window_end]
  for window_end in range(len(str1)):
    right_char = str1[window_end]
    if right_char not in char_frequency:
      char_frequency[right_char] = 0
    char_frequency[right_char] += 1
    newstr += right_char
    dummystr += right_char
    print("Current String: ", dummystr)
    print("Current dictionary: ", char_frequency)
    # shrink the sliding window, until we are left with 'k' distinct characters in the char_frequency
    while len(char_frequency) > k:
      left_char = str1[window_start]
      char_frequency[left_char] -= 1
      if char_frequency[left_char] == 0:
        del char_frequency[left_char]
        updatedstring = newstr[window_start+1:]
        dummystr = updatedstring
      window_start += 1  # shrink the window
      
    # remember the maximum length so far
    max_length = max(max_length, window_end-window_start + 1)
    print("Max length: ", max_length, "\n") 
  return max_length
def main():
  dna_strings = ["araaci", "araaci", "cbbebi"]
  k_criteria = [2, 1, 3]
  for i in range(len(dna_strings)):
    print("Length of the longest substring: " + str(longest_substring_with_k_distinct(dna_strings[i], k_criteria[i])))
    print(("-"*100), "\n", sep="")
  # print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))
  # print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))
  # print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))
main()

Outer Iteration	Inner Iteration	Line number	window_start	window_end	char_frequency	Window contents	max_length
-	-	2-4	0	-	{}	-	0
1	-	7-11, 21	0	0	{'a': 1}	a	1
2	-	7-11, 21	0	1	{'a': 1, 'r': 1}	a r	2
3	-	7-11, 21	0	2	{'a': 2, 'r': 1}	a r a	3
4	-	7-11, 21	0	3	{'a': 3, 'r': 1}	a r a a	4
5	-	7-11	0	4	{'c': 1, 'a': 3, 'r': 1}	a r a a c	4
5	1	14 - 19	1	4	{'c': 1, 'a': 2, 'r': 1}	r a a c	4
5	2	14 - 19	2	4	{'c': 1, 'a': 2}	a a c	4
6	-	7-11	2	5	{'i': 1, 'c': 1, 'a': 2}	a a c i	4
6	1	14 - 19	3	5	{'i': 1, 'c': 1, 'a': 1}	a c i	4
6	2	14 - 19	4	5	{'i': 1, 'c': 1}	c i	4

Outer Iteration	Inner Iteration	Line number	window_start	window_end	char_frequency	Window contents	max_length
-	-	2-4	0	-	{}	-	0
1	-	7-11, 21	0	0	{'c': 1}	c	1
2	-	7-11, 21	0	1	{'c': 1, 'b': 1}	c b	2
3	-	7-11, 21	0	2	{'c': 1, 'b': 2}	c b b	3
4	-	7-11, 21	0	3	{'c': 1, 'b': 2, 'e': 1}	c b b e	4
5	-	7-11, 21	0	4	{'c': 1, 'b': 3, 'e': 1}	c b b e b	5
6	-	7-11	0	5	{'c': 1, 'b': 3, 'i': 1, 'e': 1}	c b b e b i	5
6	1	14-19	1	5	{'b': 3, 'i': 1, 'e': 1}	b b e b i	5

Pattern: Sliding Window

Pattern: Two Pointers

Pattern: Fast & Slow pointers

Pattern: In-place Reversal of a LinkedList

Pattern: Tree Breadth First Search

Pattern: Tree Depth First Search

Pattern: Modified Binary Search

Pattern: Top 'K' Elements

Pattern: K-way merge

Pattern: Subsets

Pattern : 0/1 Knapsack (Dynamic Programming)

Pattern: Topological Sort

Pattern: Two Heaps

Pattern: Cyclic Sort

Pattern: Bitwise XOR

Pattern: Merge Intervals

Longest Substring with maximum K Distinct Characters (medium)

Real-world problems

Detecting viruses

Dry-run templates

Sample Input #1

Sample Input #2

Step-by-step solution construction