Longest Substring with maximum K Distinct Characters (medium)
Resources for finding the longest substring with maximum K distinct characters.
Real-world problems
A variant of the “longest substring with maximum K distinct characters” problem comes up while studying DNA samples and detecting viruses.
Detecting viruses
While studying different DNA samples, we observed that a certain virus consists of really long sequences of distinct nucleotides. The virus infects a species by embedding itself into the species’s DNA. We are working on devising a test to detect the virus. The idea is to analyze the longest string that consists of, at most, nucleotides from a species’s DNA.
We’ll be provided with a string representing a chromosome from the infected DNA and a value supplied from a hidden function. Our task will include calculating the longest subsequence from the chromosome string that has unique nucleotides.
Here is an illustration to better understand this process:
Consider the above characters as nucleotides in a DNA strand. Since the longest substring with three distinct nucleotides is “ffzzeee”.
We will traverse over the given sequence and keep track of the encountered nucleotides. When the number of distinct nucleotides exceeds the given value of , we’ll drop the left-most one and move the window forward.
Some questions to think about:
- How is the window moving forward?
- Is the window size constant or variable?
- Is it an individual property we are interested in or an aggregate one?
- How can we compute that property?
Dry-run templates
This is the implementation of the solution provided for the problem posed in the lesson:
def longest_substring_with_k_distinct(str1, k):window_start = 0max_length = 0char_frequency = {}# in the following loop we'll try to extend the range [window_start, window_end]for window_end in range(len(str1)):right_char = str1[window_end]if right_char not in char_frequency:char_frequency[right_char] = 0char_frequency[right_char] += 1# shrink the sliding window, until we are left with 'k' distinct characters in the char_frequencywhile len(char_frequency) > k:left_char = str1[window_start]char_frequency[left_char] -= 1if char_frequency[left_char] == 0:del char_frequency[left_char]window_start += 1 # shrink the window# remember the maximum length so farmax_length = max(max_length, window_end - window_start + 1)return max_lengthdef main():print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))main()
Students may be encouraged to run through the provided solution with the following sample inputs:
Sample Input #1
Input string: araaci
K = 2
Outer Iteration | Inner Iteration | Line number | window_start | window_end | char_frequency | Window contents | max_length |
- | - | 2-4 | 0 | - | {} | - | 0 |
1 | - | 7-11, 21 | 0 | 0 | {'a': 1} | a | 1 |
2 | - | 7-11, 21 | 0 | 1 | {'a': 1, 'r': 1} | a r | 2 |
3 | - | 7-11, 21 | 0 | 2 | {'a': 2, 'r': 1} | a r a | 3 |
4 | - | 7-11, 21 | 0 | 3 | {'a': 3, 'r': 1} | a r a a | 4 |
5 | - | 7-11 | 0 | 4 | {'c': 1, 'a': 3, 'r': 1} | a r a a c | 4 |
5 | 1 | 14 - 19 | 1 | 4 | {'c': 1, 'a': 2, 'r': 1} | r a a c | 4 |
5 | 2 | 14 - 19 | 2 | 4 | {'c': 1, 'a': 2} | a a c | 4 |
6 | - | 7-11 | 2 | 5 | {'i': 1, 'c': 1, 'a': 2} | a a c i | 4 |
6 | 1 | 14 - 19 | 3 | 5 | {'i': 1, 'c': 1, 'a': 1} | a c i | 4 |
6 | 2 | 14 - 19 | 4 | 5 | {'i': 1, 'c': 1} | c i | 4 |
Sample Input #2
Input string: cbbebi
K = 3
Outer Iteration | Inner Iteration | Line number | window_start | window_end | char_frequency | Window contents | max_length |
- | - | 2-4 | 0 | - | {} | - | 0 |
1 | - | 7-11, 21 | 0 | 0 | {'c': 1} | c | 1 |
2 | - | 7-11, 21 | 0 | 1 | {'c': 1, 'b': 1} | c b | 2 |
3 | - | 7-11, 21 | 0 | 2 | {'c': 1, 'b': 2} | c b b | 3 |
4 | - | 7-11, 21 | 0 | 3 | {'c': 1, 'b': 2, 'e': 1} | c b b e | 4 |
5 | - | 7-11, 21 | 0 | 4 | {'c': 1, 'b': 3, 'e': 1} | c b b e b | 5 |
6 | - | 7-11 | 0 | 5 | {'c': 1, 'b': 3, 'i': 1, 'e': 1} | c b b e b i | 5 |
6 | 1 | 14-19 | 1 | 5 | {'b': 3, 'i': 1, 'e': 1} | b b e b i | 5 |
Step-by-step solution construction
We use a sliding window approach with two pointers to set the boundary of our window. Here’s what we’ll do:
- Store the frequency of our window contents in a dictionary.
- Shrink our sliding window until we’re left with
k
distinct characters. - Compare the
max_length
variable with our window size and update it with the maximum of the two.
Find more about a Python dictionary here.
When the distinct characters in our dictionary exceed k
, an obvious approach would be deleting the elements from the start until we are left with the required number of distinct characters. Let’s see this in action below:
def longest_substring_with_k_distinct(str1, k):window_start = 0max_length = 0char_frequency = {}# in the following loop we'll try to extend the range [window_start, window_end]for window_end in range(len(str1)):right_char = str1[window_end]if right_char not in char_frequency:char_frequency[right_char] = 0char_frequency[right_char] += 1print("Current dictionary: " + str(char_frequency) + "\n")# shrink the sliding window, until we are left with 'k' distinct characters in the char_frequencywhile len(char_frequency) > k:left_char = str1[window_start]print("About to remove \"", left_char, "\" from dictionary: ", sep="")del char_frequency[left_char]print("Dictionary after removal: " + str(char_frequency))window_start += 1 # shrink the windowreturn max_lengthdef main():#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))main()
When we run the code above, we find that it throws an error. This is because we can get into situations where we’re trying to delete an element that does not exist in the dictionary. Also, in this approach, we delete elements regardless of their frequency. For example, we might accidentally delete a character that has multiple instances in the given string.
The solution would be to reduce the frequency of the character dropping out of the window by 1 and deleting it from the dictionary only when it becomes 0. This way, we’ll avoid trying to delete a character that’s not present or accidentally deleting a character that has multiple instances in the string. Let’s try out these improvements in the code below:
def longest_substring_with_k_distinct(str1, k):window_start = 0max_length = 0char_frequency = {}# in the following loop we'll try to extend the range [window_start, window_end]for window_end in range(len(str1)):right_char = str1[window_end]if right_char not in char_frequency:char_frequency[right_char] = 0char_frequency[right_char] += 1print("Current dictionary: ", char_frequency, "\n")# shrink the sliding window, until we are left with 'k' distinct characters in the char_frequencywhile len(char_frequency) > k:left_char = str1[window_start]char_frequency[left_char] -= 1print("Updated frequencies as we shrink window: ", char_frequency)if char_frequency[left_char] == 0:del char_frequency[left_char]print("Deleting \"", left_char, "\" since its frequency is 0.", sep="")print("Updated dictionary: ", char_frequency, "\n")window_start += 1 # shrink the windowreturn max_lengthdef main():#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))main()
Now, let’s take a look at the strings we’re considering in each iteration:
def longest_substring_with_k_distinct(str1, k):window_start = 0max_length = 0char_frequency = {}newstr = ""dummystr = ""# in the following loop we'll try to extend the range [window_start, window_end]for window_end in range(len(str1)):right_char = str1[window_end]if right_char not in char_frequency:char_frequency[right_char] = 0char_frequency[right_char] += 1newstr += right_chardummystr += right_charprint("Current String: ", dummystr)print("Current dictionary: ", char_frequency, "\n")# shrink the sliding window, until we are left with 'k' distinct characters in the char_frequencywhile len(char_frequency) > k:left_char = str1[window_start]char_frequency[left_char] -= 1print("Updated frequencies as we shrink window: ", char_frequency)if char_frequency[left_char] == 0:del char_frequency[left_char]print("Deleting till \"", left_char, "\" since its frequency is 0.")updatedstring = newstr[window_start+1:]print("Updated String: ", updatedstring, "\n")dummystr = updatedstringwindow_start += 1 # shrink the windowreturn max_lengthdef main():print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))main()
Lastly, let’s start tracking the length of the longest substring with k
distinct characters. We shall use the variable max_length
for this purpose. We compare max_length
with the size of the current window and update it with the larger of the two values.
Run the code below to see how max_length
is updated:
def longest_substring_with_k_distinct(str1, k):window_start = 0max_length = 0char_frequency = {}newstr = ""dummystr = ""# in the following loop we'll try to extend the range [window_start, window_end]for window_end in range(len(str1)):star = ""right_char = str1[window_end]if right_char not in char_frequency:char_frequency[right_char] = 0char_frequency[right_char] += 1newstr += right_chardummystr += right_char# shrink the sliding window, until we are left with 'k' distinct characters in the char_frequencywhile len(char_frequency) > k:left_char = str1[window_start]char_frequency[left_char] -= 1if char_frequency[left_char] == 0:del char_frequency[left_char]updatedstring = newstr[window_start+1:]dummystr = updatedstringwindow_start += 1 # shrink the window# remember the maximum length so farif max_length < (window_end - window_start + 1): star = '*'max_length = max(max_length, window_end-window_start + 1)print("Current String: ", dummystr + star)print("Max length: ", max_length, "\n")return max_lengthdef main():print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))main()
The above code shows how the maximum substring length is being updated as we iterate over the characters. The code on lines 11 and 29 does not affect the program execution. It simply allows us to detect when a new maximum length has been identified, so that we may indicate this in the output statement on line 31.
Let’s combine the steps above and visualize how the string, dictionary, and maximum length are being updated:
def longest_substring_with_k_distinct(str1, k):print("Input string: ", str1, ", K=", str(k), sep="", end="\n\n")window_start = 0max_length = 0char_frequency = {}newstr = ""dummystr = ""# in the following loop we'll try to extend the range [window_start, window_end]for window_end in range(len(str1)):right_char = str1[window_end]if right_char not in char_frequency:char_frequency[right_char] = 0char_frequency[right_char] += 1newstr += right_chardummystr += right_charprint("Current String: ", dummystr)print("Current dictionary: ", char_frequency)# shrink the sliding window, until we are left with 'k' distinct characters in the char_frequencywhile len(char_frequency) > k:left_char = str1[window_start]char_frequency[left_char] -= 1if char_frequency[left_char] == 0:del char_frequency[left_char]updatedstring = newstr[window_start+1:]dummystr = updatedstringwindow_start += 1 # shrink the window# remember the maximum length so farmax_length = max(max_length, window_end-window_start + 1)print("Max length: ", max_length, "\n")return max_lengthdef main():dna_strings = ["araaci", "araaci", "cbbebi"]k_criteria = [2, 1, 3]for i in range(len(dna_strings)):print("Length of the longest substring: " + str(longest_substring_with_k_distinct(dna_strings[i], k_criteria[i])))print(("-"*100), "\n", sep="")# print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))# print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))# print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))main()