...

/

Longest Substring with maximum K Distinct Characters (medium)

Longest Substring with maximum K Distinct Characters (medium)

Resources for finding the longest substring with maximum K distinct characters.

Real-world problems

A variant of the “longest substring with maximum K distinct characters” problem comes up while studying DNA samples and detecting viruses.

Detecting viruses

While studying different DNA samples, we observed that a certain virus consists of really long sequences of kk distinct nucleotides. The virus infects a species by embedding itself into the species’s DNA. We are working on devising a test to detect the virus. The idea is to analyze the longest string that consists of, at most, kk nucleotides from a species’s DNA.

We’ll be provided with a string representing a chromosome from the infected DNA and a kk value supplied from a hidden function. Our task will include calculating the longest subsequence from the chromosome string that has kk unique nucleotides.

Here is an illustration to better understand this process:

Consider the above characters as nucleotides in a DNA strand. Since k=3k = 3 the longest substring with three distinct nucleotides is “ffzzeee”.

We will traverse over the given sequence and keep track of the encountered nucleotides. When the number of distinct nucleotides exceeds the given value of kk, we’ll drop the left-most one and move the window forward.

Some questions to think about:

  • How is the window moving forward?
  • Is the window size constant or variable?
  • Is it an individual property we are interested in or an aggregate one?
  • How can we compute that property?

Dry-run templates

This is the implementation of the solution provided for the problem posed in the lesson:

Press + to interact
Python 3.5
def longest_substring_with_k_distinct(str1, k):
window_start = 0
max_length = 0
char_frequency = {}
# in the following loop we'll try to extend the range [window_start, window_end]
for window_end in range(len(str1)):
right_char = str1[window_end]
if right_char not in char_frequency:
char_frequency[right_char] = 0
char_frequency[right_char] += 1
# shrink the sliding window, until we are left with 'k' distinct characters in the char_frequency
while len(char_frequency) > k:
left_char = str1[window_start]
char_frequency[left_char] -= 1
if char_frequency[left_char] == 0:
del char_frequency[left_char]
window_start += 1 # shrink the window
# remember the maximum length so far
max_length = max(max_length, window_end - window_start + 1)
return max_length
def main():
print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))
print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))
print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))
main()

Students may be encouraged to run through the provided solution with the following sample inputs:

Sample Input #1

Input string: araaci
K = 2

Outer

Iteration

Inner

Iteration

Line number

window_start

window_end

char_frequency

Window contents

max_length

-

-

2-4

0

-

{}

-

0

1

-

7-11, 21

0

0

{'a': 1}

a

1

2

-

7-11, 21

0

1

{'a': 1, 'r': 1}

a r

2

3

-

7-11, 21

0

2

{'a': 2, 'r': 1}

a r a

3

4

-

7-11, 21

0

3

{'a': 3, 'r': 1}

a r a a

4

5

-

7-11

0

4

{'c': 1, 'a': 3, 'r': 1}

a r a a c

4

5

1

14 - 19

1

4

{'c': 1, 'a': 2, 'r': 1}

r a a c

4

5

2

14 - 19

2

4

{'c': 1, 'a': 2}

a a c

4

6

-

7-11

2

5

{'i': 1, 'c': 1, 'a': 2}

a a c i

4

6

1

14 - 19

3

5

{'i': 1, 'c': 1, 'a': 1}

a c i

4

6

2

14 - 19

4

5

{'i': 1, 'c': 1}

c i

4

Sample Input #2

Input string: cbbebi
K = 3

Outer

Iteration

Inner

Iteration

Line number

window_start

window_end

char_frequency

Window contents

max_length

-

-

2-4

0

-

{}

-

0

1

-

7-11, 21

0

0

{'c': 1}

c

1

2

-

7-11, 21

0

1

{'c': 1, 'b': 1}

c b

2

3

-

7-11, 21

0

2

{'c': 1, 'b': 2}

c b b

3

4

-

7-11, 21

0

3

{'c': 1, 'b': 2, 'e': 1}

c b b e

4

5

-

7-11, 21

0

4

{'c': 1, 'b': 3, 'e': 1}

c b b e b

5

6

-

7-11

0

5

{'c': 1, 'b': 3, 'i': 1, 'e': 1}

c b b e b i

5

6

1

14-19

1

5

{'b': 3, 'i': 1, 'e': 1}

b b e b i

5

Step-by-step solution construction

We use a sliding window approach with two pointers to set the boundary of our window. Here’s what we’ll do:

  • Store the frequency of our window contents in a dictionary.
  • Shrink our sliding window until we’re left with k distinct characters.
  • Compare the max_length variable with our window size and update it with the maximum of the two.

Find more about a Python dictionary here.

When the distinct characters in our dictionary exceed k, an obvious approach would be deleting the elements from the start until we are left with the required number of distinct characters. Let’s see this in action below:

Press + to interact
Python 3.5
def longest_substring_with_k_distinct(str1, k):
window_start = 0
max_length = 0
char_frequency = {}
# in the following loop we'll try to extend the range [window_start, window_end]
for window_end in range(len(str1)):
right_char = str1[window_end]
if right_char not in char_frequency:
char_frequency[right_char] = 0
char_frequency[right_char] += 1
print("Current dictionary: " + str(char_frequency) + "\n")
# shrink the sliding window, until we are left with 'k' distinct characters in the char_frequency
while len(char_frequency) > k:
left_char = str1[window_start]
print("About to remove \"", left_char, "\" from dictionary: ", sep="")
del char_frequency[left_char]
print("Dictionary after removal: " + str(char_frequency))
window_start += 1 # shrink the window
return max_length
def main():
#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))
print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))
#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))
main()

When we run the code above, we find that it throws an error. This is because we can get into situations where we’re trying to delete an element that does not exist in the dictionary. Also, in this approach, we delete elements regardless of their frequency. For example, we might accidentally delete a character that has multiple instances in the given string.

The solution would be to reduce the frequency of the character dropping out of the window by 1 and deleting it from the dictionary only when it becomes 0. This way, we’ll avoid trying to delete a character that’s not present or accidentally deleting a character that has multiple instances in the string. Let’s try out these improvements in the code below:

Press + to interact
Python 3.5
def longest_substring_with_k_distinct(str1, k):
window_start = 0
max_length = 0
char_frequency = {}
# in the following loop we'll try to extend the range [window_start, window_end]
for window_end in range(len(str1)):
right_char = str1[window_end]
if right_char not in char_frequency:
char_frequency[right_char] = 0
char_frequency[right_char] += 1
print("Current dictionary: ", char_frequency, "\n")
# shrink the sliding window, until we are left with 'k' distinct characters in the char_frequency
while len(char_frequency) > k:
left_char = str1[window_start]
char_frequency[left_char] -= 1
print("Updated frequencies as we shrink window: ", char_frequency)
if char_frequency[left_char] == 0:
del char_frequency[left_char]
print("Deleting \"", left_char, "\" since its frequency is 0.", sep="")
print("Updated dictionary: ", char_frequency, "\n")
window_start += 1 # shrink the window
return max_length
def main():
#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))
print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))
#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))
main()

Now, let’s take a look at the strings we’re considering in each iteration:

Press + to interact
Python 3.5
def longest_substring_with_k_distinct(str1, k):
window_start = 0
max_length = 0
char_frequency = {}
newstr = ""
dummystr = ""
# in the following loop we'll try to extend the range [window_start, window_end]
for window_end in range(len(str1)):
right_char = str1[window_end]
if right_char not in char_frequency:
char_frequency[right_char] = 0
char_frequency[right_char] += 1
newstr += right_char
dummystr += right_char
print("Current String: ", dummystr)
print("Current dictionary: ", char_frequency, "\n")
# shrink the sliding window, until we are left with 'k' distinct characters in the char_frequency
while len(char_frequency) > k:
left_char = str1[window_start]
char_frequency[left_char] -= 1
print("Updated frequencies as we shrink window: ", char_frequency)
if char_frequency[left_char] == 0:
del char_frequency[left_char]
print("Deleting till \"", left_char, "\" since its frequency is 0.")
updatedstring = newstr[window_start+1:]
print("Updated String: ", updatedstring, "\n")
dummystr = updatedstring
window_start += 1 # shrink the window
return max_length
def main():
print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))
#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))
#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))
main()

Lastly, let’s start tracking the length of the longest substring with k distinct characters. We shall use the variable max_length for this purpose. We compare max_length with the size of the current window and update it with the larger of the two values.

Run the code below to see how max_length is updated:

Press + to interact
Python 3.5
def longest_substring_with_k_distinct(str1, k):
window_start = 0
max_length = 0
char_frequency = {}
newstr = ""
dummystr = ""
# in the following loop we'll try to extend the range [window_start, window_end]
for window_end in range(len(str1)):
star = ""
right_char = str1[window_end]
if right_char not in char_frequency:
char_frequency[right_char] = 0
char_frequency[right_char] += 1
newstr += right_char
dummystr += right_char
# shrink the sliding window, until we are left with 'k' distinct characters in the char_frequency
while len(char_frequency) > k:
left_char = str1[window_start]
char_frequency[left_char] -= 1
if char_frequency[left_char] == 0:
del char_frequency[left_char]
updatedstring = newstr[window_start+1:]
dummystr = updatedstring
window_start += 1 # shrink the window
# remember the maximum length so far
if max_length < (window_end - window_start + 1): star = '*'
max_length = max(max_length, window_end-window_start + 1)
print("Current String: ", dummystr + star)
print("Max length: ", max_length, "\n")
return max_length
def main():
print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))
#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))
#print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))
main()

The above code shows how the maximum substring length is being updated as we iterate over the characters. The code on lines 11 and 29 does not affect the program execution. It simply allows us to detect when a new maximum length has been identified, so that we may indicate this in the output statement on line 31.

Let’s combine the steps above and visualize how the string, dictionary, and maximum length are being updated:

Press + to interact
Python 3.5
def longest_substring_with_k_distinct(str1, k):
print("Input string: ", str1, ", K=", str(k), sep="", end="\n\n")
window_start = 0
max_length = 0
char_frequency = {}
newstr = ""
dummystr = ""
# in the following loop we'll try to extend the range [window_start, window_end]
for window_end in range(len(str1)):
right_char = str1[window_end]
if right_char not in char_frequency:
char_frequency[right_char] = 0
char_frequency[right_char] += 1
newstr += right_char
dummystr += right_char
print("Current String: ", dummystr)
print("Current dictionary: ", char_frequency)
# shrink the sliding window, until we are left with 'k' distinct characters in the char_frequency
while len(char_frequency) > k:
left_char = str1[window_start]
char_frequency[left_char] -= 1
if char_frequency[left_char] == 0:
del char_frequency[left_char]
updatedstring = newstr[window_start+1:]
dummystr = updatedstring
window_start += 1 # shrink the window
# remember the maximum length so far
max_length = max(max_length, window_end-window_start + 1)
print("Max length: ", max_length, "\n")
return max_length
def main():
dna_strings = ["araaci", "araaci", "cbbebi"]
k_criteria = [2, 1, 3]
for i in range(len(dna_strings)):
print("Length of the longest substring: " + str(longest_substring_with_k_distinct(dna_strings[i], k_criteria[i])))
print(("-"*100), "\n", sep="")
# print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 2)))
# print("Length of the longest substring: " + str(longest_substring_with_k_distinct("araaci", 1)))
# print("Length of the longest substring: " + str(longest_substring_with_k_distinct("cbbebi", 3)))
main()