Problem Challenge 3
Resources for finding the smallest substring with all character occurrences of a given pattern.
Outer iteration | Inner iteration | Line number | window_start | window_end | matched | substr_start | char_frequency | min_length |
- | - | 2 - 9 | 0 | - | 0 | 0 | {'a': 1, 'b': 1, 'c': 1} | 7 |
1 | - | 13 - 17 | 0 | 0 | 1 | 0 | {'a': 0, 'b': 1, 'c': 1} | 7 |
2 | - | 13 - 17 | 0 | 1 | 2 | 0 | {'a': 0, 'b': 0, 'c': 1} | 7 |
3 | - | 13 - 17 | 0 | 2 | 2 | 0 | {'a': 0, 'b': 0, 'c': 1} | 7 |
4 | - | 13 - 17 | 0 | 3 | 2 | 0 | {'a': 0, 'b': -1, 'c': 1} | 7 |
5 | - | 13 - 17 | 0 | 4 | 3 | 0 | {'a': 0, 'b': -1, 'c': 1} | 7 |
5 | 1 | 20 - 32 | 1 | 4 | 2 | 0 | {'a': 0, 'b': -1, 'c': 1} | 5 |
6 | - | 13 - 17 | 1 | 5 | 3 | 0 | {'a': 0, 'b': -1, 'c': 1} | 5 |
6 | 1 | 20 - 32 | 2 | 5 | 3 | 2 | {'a': 0, 'b': -1, 'c': 1} | 5 |
6 | 2 | 20 - 32 | 3 | 5 | 3 | 3 | {'a': 0, 'b': -1, 'c': 1} | 5 |
6 | 3 | 20 - 32 | 4 | 5 | 2 | 3 | {'a': 0, 'b': 1, 'c': 0} | 3 |
Problem statement
Given a string and a pattern, find the smallest substring in the given string which has all the character occurrences of the given pattern.
From the definition of the challenge, we can see that we need to find the same number of occurrences of each character in the string, as they occur in the pattern. For example, if the string to search is “aabdec” and the pattern is “abac”, what matters is not the sequence of the characters in the pattern, rather it’s their frequencies that matter.
So, we’re looking for the smallest substring that contains all the characters in the pattern, the exact number of times that they occur in the given pattern. In this case, we’re looking for two a
’s, one b
and one c
. The smallest substring that meets these criteria is “aabdec”, that is, the entire string. Looking at another example, where the string is “abdbca” and the pattern is “abcb”, the smallest substring that meets the matching criteria is “abdbc”.
Real-world problems
A variation of this problem can be used for pattern matching in plagiarism checkers.
Pattern matching
We are given a set of documents. Each document is submitted by a different individual. However, we suspect that some individuals may have copied from others. Given a plagiarised submitted document, we want to identify the number of documents with which there is a potential match. We have converted each document into a set of tokens based on their content.
The students could have added dummy statements between the copied content to avoid identification. We’ll have to match the tokens of two students while taking into account that there can be dummy tokens that might not match. A potential match can occur if a token being checked in the student’s document results in the subsequence of a token in another document. It is not a guarantee that every match is plagiarized content. In this scenario, we’ll discard the matched tokens that have a length less than two.
Dry-run templates
This is the implementation of the solution provided for the problem posed in the lesson:
def find_substring(str1, pattern):window_start, matched, substr_start = 0, 0, 0min_length = len(str1) + 1char_frequency = {}for chr in pattern:if chr not in char_frequency:char_frequency[chr] = 0char_frequency[chr] += 1# try to extend the range [window_start, window_end]for window_end in range(len(str1)):right_char = str1[window_end]if right_char in char_frequency:char_frequency[right_char] -= 1if char_frequency[right_char] >= 0: # Count every matching of a charactermatched += 1# Shrink the window if we can, finish as soon as we remove a matched characterwhile matched == len(pattern):if min_length > window_end - window_start + 1:min_length = window_end - window_start + 1substr_start = window_startleft_char = str1[window_start]window_start += 1if left_char in char_frequency:# Note that we could have redundant matching characters, therefore we'll decrement the# matched count only when a useful occurrence of a matched character is going out of the windowif char_frequency[left_char] == 0:matched -= 1char_frequency[left_char] += 1if min_length > len(str1):return ""return str1[substr_start:substr_start + min_length]def main():print(find_substring("aabdec", "abc"))print(find_substring("aabdec", "abac"))print(find_substring("abdbca", "abc"))print(find_substring("adcad", "abc"))main()
Sample input #1
Input string: aabdec
Pattern: abc
Lines 6 - 9 of the solution simply add the frequencies of characters in the given pattern in a dictionary called char_frequency
.
So, we’ll start tracking variables in our dry-run table from line 12 onwards.
The “Outer iteration” column in the table indicates the iterations of the second for
loop (that starts at line 12) and the “Inner iteration” column indicates the iterations of the nested while
loop (that starts at line 20).
Outer iteration | Inner iteration | Line number | window_start | window_end | matched | substr_start | char_frequency | min_length |
- | - | 2 - 9 | 0 | - | 0 | 0 | {'a': 1, 'b': 1, 'c': 1} | 7 |
1 | - | 13 - 17 | 0 | 0 | 1 | 0 | {'a': 0, 'b': 1, 'c': 1} | 7 |
2 | - | 13 - 17 | 0 | 1 | 2 | 0 | {'a': 0, 'b': 0, 'c': 1} | 7 |
3 | - | 13 - 17 | 0 | 2 | 2 | 0 | {'a': 0, 'b': 0, 'c': 1} | 7 |
4 | - | 13 - 17 | 0 | 3 | 2 | 0 | {'a': 0, 'b': -1, 'c': 1} | 7 |
5 | - | 13 - 17 | 0 | 4 | 3 | 0 | {'a': 0, 'b': -1, 'c': 1} | 7 |
5 | 1 | 20 - 32 | 1 | 4 | 2 | 0 | {'a': 0, 'b': -1, 'c': 1} | 5 |
6 | - | 13 - 17 | 1 | 5 | 3 | 0 | {'a': 0, 'b': -1, 'c': 1} | 5 |
6 | 1 | 20 - 32 | 2 | 5 | 3 | 2 | {'a': 0, 'b': -1, 'c': 1} | 5 |
6 | 2 | 20 - 32 | 3 | 5 | 3 | 3 | {'a': 0, 'b': -1, 'c': 1} | 5 |
6 | 3 | 20 - 32 | 4 | 5 | 2 | 3 | {'a': 0, 'b': 1, 'c': 0} | 3 |
Sample input #2
Input string: abdbca
Pattern: abc
Outer iteration | Inner iteration | Line number | window_start | window_end | matched | substr_start | char_frequency | min_length |
- | - | 2 - 9 | 0 | - | 0 | 0 | {'a': 1, 'b': 1, 'c': 1} | 7 |
1 | - | 13 - 17 | 0 | 0 | 1 | 0 | {'a': 0, 'b': 1, 'c': 1} | 7 |
2 | - | 13 - 17 | 0 | 1 | 2 | 0 | {'a': 0, 'b': 0, 'c': 1} | 7 |
3 | - | 13 - 17 | 0 | 2 | 2 | 0 | {'a': 0, 'b': 0, 'c': 1} | 7 |
4 | - | 13 - 17 | 0 | 3 | 2 | 0 | {'a': 0, 'b': -1, 'c': 1} | 7 |
5 | - | 13 - 17 | 0 | 4 | 3 | 0 | {'a': 0, 'b': -1, 'c': 1} | 7 |
5 | 1 | 20 - 32 | 1 | 4 | 2 | 0 | {'a': 0, 'b': -1, 'c': 1} | 5 |
6 | - | 13 - 17 | 1 | 5 | 3 | 0 | {'a': 0, 'b': -1, 'c': 1} | 5 |
6 | 1 | 20 - 32 | 2 | 5 | 3 | 2 | {'a': 0, 'b': -1, 'c': 1} | 5 |
6 | 2 | 20 - 32 | 3 | 5 | 3 | 3 | {'a': 0, 'b': -1, 'c': 1} | 5 |
6 | 3 | 20 - 32 | 4 | 5 | 2 | 3 | {'a': 0, 'b': 1, 'c': 0} | 3 |
Step-by-step solution construction
The first step in our solution is to calculate and store the frequency of each character in the input pattern in a dictionary. We need these frequencies for matching characters in later steps. Let’s see this in action below:
def find_substring(str1, pattern):print("Input string: ", str1, ", pattern: ", pattern, sep="", end="\n\n")window_start, matched, substr_start = 0, 0, 0min_length = len(str1) + 1char_frequency = {}i = 1for chr in pattern:print("Loop index: ", i)i+=1print("Character: ", chr)if chr not in char_frequency:char_frequency[chr] = 0char_frequency[chr] += 1print("Frequencies:", char_frequency, "\n")def main():find_substring("abdbca", "abc")main()
Above are the target frequencies for each character that we want to match. The next step is updating these frequencies as we traverse over our string. If a character is present in the dictionary, we reduce its frequency by 1. This indicates that we’ve found an occurrence and our target frequency is one less than before. For example, if the frequency of a character b
was and we find a match, we’ll decrement it to , hence, indicating that we now need only more b
's to match the pattern.
We’ll also use a variable matched
that counts every matching of a character. When the frequency of a character becomes 0, that is, we’ve found all the required occurrences of that character, we’ll increment matched
by to indicate our progress in the matching process.
This step is shown in the code below:
def find_substring(str1, pattern):print("Input string: ", str1, ", pattern: ", pattern, sep="", end="\n\n")window_start, matched, substr_start = 0, 0, 0min_length = len(str1) + 1char_frequency = {}for chr in pattern:if chr not in char_frequency:char_frequency[chr] = 0char_frequency[chr] += 1print("Initial target frequencies:", char_frequency, "\n")# try to extend the range [window_start, window_end]for window_end in range(len(str1)):print("Loop iteration: ", window_end)right_char = str1[window_end]print("Current character: ", right_char)if right_char in char_frequency:print("Current character is present in the dictionary.")char_frequency[right_char] -= 1if char_frequency[right_char] >= 0: # Count every matching of a charactermatched += 1else:print("Current character is not present in the dictionary.")print("Updated character frequency targets: ", char_frequency)print("Matched:", str(matched), "character(s) so far...\n")def main():find_substring("abdbca", "abc")main()
When matched
becomes equal to the pattern length, that is, all characters have been matched, we shrink the window from the start. This is to ensure we find the shortest possible substring containing our desired pattern.
A new variable min_length
is used to store the minimum length of a substring with all characters of a given pattern. We initialize it with the length of our input string + 1 and then decrease it as our window size changes until we find a substring where all the characters in the given pattern occur the required number of times. We also keep track of our substring’s start index using a new variable substr_start
. This will later be used to retrieve the matching part of the string.
Before moving our window forward, we store the character leaving the sliding window in the new variable left_char
. If this character is present in our frequency dictionary and if its target frequency has fallen to (meaning that the required number of occurrences of this character had been found), we decrement matched
by , indicating that we have one less character matched now that this character is falling outside our window. Also, we increase the target frequency of the character by , since removing it from our window means that we’re losing a match.
Let’s see how these changes work out in the code below:
def find_substring(str1, pattern):print("Input string: ", str1, ", pattern: ", pattern, sep="", end="\n\n")window_start, matched, substr_start = 0, 0, 0min_length = len(str1) + 1char_frequency = {}for chr in pattern:if chr not in char_frequency:char_frequency[chr] = 0char_frequency[chr] += 1print("Initial target frequencies:", char_frequency, "\n")# try to extend the range [window_start, window_end]for window_end in range(len(str1)):print("Loop iteration: ", window_end)right_char = str1[window_end]print("Current character: ", right_char)if right_char in char_frequency:char_frequency[right_char] -= 1if char_frequency[right_char] >= 0: # Count every matching of a charactermatched += 1# Shrink the window if we can, finish as soon as we remove a matched characterprint("Updated character frequency targets: ", char_frequency)print("Matched:", str(matched), "character(s) so far...")while matched == len(pattern):if min_length > window_end - window_start + 1:min_length = window_end - window_start + 1substr_start = window_startleft_char = str1[window_start]window_start += 1if left_char in char_frequency:print("\tMatched: ", matched)print("\tMatched is equal to pattern length.")print("\tLeft character: ", left_char)# Note that we could have redundant matching characters, therefore we'll decrement the# matched count only when a useful occurrence of a matched character is going out of the windowif char_frequency[left_char] == 0:matched -=1print("Updated Matched: ", matched)char_frequency[left_char] += 1print("Updated character frequency targets: ", char_frequency)print("\n")def main():find_substring("abdbca", "abc")main()
Now that we have the starting index and the minimum length of the substring that meets our criteria, the last step is to slice our string and return the matching portion. The chevrons enclose the matching slice. This is shown in the code below:
def find_substring(str1, pattern):print("Input string: ", str1, ", pattern: ", pattern, sep="", end="\n\n")window_start, matched, substr_start = 0, 0, 0min_length = len(str1) + 1char_frequency = {}for chr in pattern:if chr not in char_frequency:char_frequency[chr] = 0char_frequency[chr] += 1print("Initial target frequencies:", char_frequency, "\n")# try to extend the range [window_start, window_end]for window_end in range(len(str1)):print("Loop iteration: ", window_end)right_char = str1[window_end]print("Current character: ", right_char)if right_char in char_frequency:char_frequency[right_char] -= 1if char_frequency[right_char] >= 0: # Count every matching of a charactermatched += 1# Shrink the window if we can, finish as soon as we remove a matched characterprint("Updated character frequency targets: ", char_frequency)print("Matched:", str(matched), "character(s) so far...")while matched == len(pattern):if min_length > window_end - window_start + 1:min_length = window_end - window_start + 1substr_start = window_startleft_char = str1[window_start]window_start += 1if left_char in char_frequency:print("\tMatched: ", matched)print("\tMatched is equal to pattern length.")print("\tLeft character: ", left_char)# Note that we could have redundant matching characters, therefore we'll decrement the# matched count only when a useful occurrence of a matched character is going out of the windowif char_frequency[left_char] == 0:matched -= 1print("Updated Matched: ", matched)char_frequency[left_char] += 1print("Updated character frequency targets: ", char_frequency)print("")if min_length > len(str1):return ""print("Matching substring start index: ", substr_start)print("Length of smallest matching substring: ", min_length)print("Input string with match: ", str1[:substr_start]+'«'+str1[substr_start:substr_start + min_length] + '»' + str1[substr_start + min_length:])return str1[substr_start:substr_start + min_length]def main():docs = ["aabdec", "aabdec", "abdbca", "adcad"]patterns = ["abc", "abac", "abc", "abc"]for i in range(len(docs)):print("Shortest matching substring: " + str(find_substring(docs[i], patterns[i])))print(("-"*100), "\n", sep="")main()