...

/

Problem Challenge 3

Problem Challenge 3

Resources for finding the smallest substring with all character occurrences of a given pattern.

Outer

iteration

Inner

iteration

Line

number

window_start

window_end

matched

substr_start

char_frequency

min_length

-

-

2 - 9

0

-

0

0

{'a': 1, 'b': 1, 'c': 1}

7

1

-

13 - 17

0

0

1

0

{'a': 0, 'b': 1, 'c': 1}

7

2

-

13 - 17

0

1

2

0

{'a': 0, 'b': 0, 'c': 1}

7

3

-

13 - 17

0

2

2

0

{'a': 0, 'b': 0, 'c': 1}

7

4

-

13 - 17

0

3

2

0

{'a': 0, 'b': -1, 'c': 1}

7

5

-

13 - 17

0

4

3

0

{'a': 0, 'b': -1, 'c': 1}

7

5

1

20 - 32

1

4

2

0

{'a': 0, 'b': -1, 'c': 1}

5

6

-

13 - 17

1

5

3

0

{'a': 0, 'b': -1, 'c': 1}

5

6

1

20 - 32

2

5

3

2

{'a': 0, 'b': -1, 'c': 1}

5

6

2

20 - 32

3

5

3

3

{'a': 0, 'b': -1, 'c': 1}

5

6

3

20 - 32

4

5

2

3

{'a': 0, 'b': 1, 'c': 0}

3

Problem statement

Given a string and a pattern, find the smallest substring in the given string which has all the character occurrences of the given pattern.

From the definition of the challenge, we can see that we need to find the same number of occurrences of each character in the string, as they occur in the pattern. For example, if the string to search is “aabdec” and the pattern is “abac”, what matters is not the sequence of the characters in the pattern, rather it’s their frequencies that matter.

So, we’re looking for the smallest substring that contains all the characters in the pattern, the exact number of times that they occur in the given pattern. In this case, we’re looking for two a’s, one b and one c. The smallest substring that meets these criteria is “aabdec”, that is, the entire string. Looking at another example, where the string is “abdbca” and the pattern is “abcb”, the smallest substring that meets the matching criteria is “abdbc”.

Real-world problems

A variation of this problem can be used for pattern matching in plagiarism checkers.

Pattern matching

We are given a set of documents. Each document is submitted by a different individual. However, we suspect that some individuals may have copied from others. Given a plagiarised submitted document, we want to identify the number of documents with which there is a potential match. We have converted each document into a set of tokens based on their content.

The students could have added dummy statements between the copied content to avoid identification. We’ll have to match the tokens of two students while taking into account that there can be dummy tokens that might not match. A potential match can occur if a token being checked in the student’s document results in the subsequence of a token in another document. It is not a guarantee that every match is plagiarized content. In this scenario, we’ll discard the matched tokens that have a length less than two.

Dry-run templates

This is the implementation of the solution provided for the problem posed in the lesson:

Press + to interact
Python 3.5
def find_substring(str1, pattern):
window_start, matched, substr_start = 0, 0, 0
min_length = len(str1) + 1
char_frequency = {}
for chr in pattern:
if chr not in char_frequency:
char_frequency[chr] = 0
char_frequency[chr] += 1
# try to extend the range [window_start, window_end]
for window_end in range(len(str1)):
right_char = str1[window_end]
if right_char in char_frequency:
char_frequency[right_char] -= 1
if char_frequency[right_char] >= 0: # Count every matching of a character
matched += 1
# Shrink the window if we can, finish as soon as we remove a matched character
while matched == len(pattern):
if min_length > window_end - window_start + 1:
min_length = window_end - window_start + 1
substr_start = window_start
left_char = str1[window_start]
window_start += 1
if left_char in char_frequency:
# Note that we could have redundant matching characters, therefore we'll decrement the
# matched count only when a useful occurrence of a matched character is going out of the window
if char_frequency[left_char] == 0:
matched -= 1
char_frequency[left_char] += 1
if min_length > len(str1):
return ""
return str1[substr_start:substr_start + min_length]
def main():
print(find_substring("aabdec", "abc"))
print(find_substring("aabdec", "abac"))
print(find_substring("abdbca", "abc"))
print(find_substring("adcad", "abc"))
main()

Sample input #1

Input string: aabdec
Pattern: abc

Lines 6 - 9 of the solution simply add the frequencies of characters in the given pattern in a dictionary called char_frequency.

So, we’ll start tracking variables in our dry-run table from line 12 onwards.

The “Outer iteration” column in the table indicates the iterations of the second for loop (that starts at line 12) and the “Inner iteration” column indicates the iterations of the nested while loop (that starts at line 20).

Outer

iteration

Inner

iteration

Line

number

window_start

window_end

matched

substr_start

char_frequency

min_length

-

-

2 - 9

0

-

0

0

{'a': 1, 'b': 1, 'c': 1}

7

1

-

13 - 17

0

0

1

0

{'a': 0, 'b': 1, 'c': 1}

7

2

-

13 - 17

0

1

2

0

{'a': 0, 'b': 0, 'c': 1}

7

3

-

13 - 17

0

2

2

0

{'a': 0, 'b': 0, 'c': 1}

7

4

-

13 - 17

0

3

2

0

{'a': 0, 'b': -1, 'c': 1}

7

5

-

13 - 17

0

4

3

0

{'a': 0, 'b': -1, 'c': 1}

7

5

1

20 - 32

1

4

2

0

{'a': 0, 'b': -1, 'c': 1}

5

6

-

13 - 17

1

5

3

0

{'a': 0, 'b': -1, 'c': 1}

5

6

1

20 - 32

2

5

3

2

{'a': 0, 'b': -1, 'c': 1}

5

6

2

20 - 32

3

5

3

3

{'a': 0, 'b': -1, 'c': 1}

5

6

3

20 - 32

4

5

2

3

{'a': 0, 'b': 1, 'c': 0}

3

Sample input #2

Input string: abdbca
Pattern: abc

Outer

iteration

Inner

iteration

Line

number

window_start

window_end

matched

substr_start

char_frequency

min_length

-

-

2 - 9

0

-

0

0

{'a': 1, 'b': 1, 'c': 1}

7

1

-

13 - 17

0

0

1

0

{'a': 0, 'b': 1, 'c': 1}

7

2

-

13 - 17

0

1

2

0

{'a': 0, 'b': 0, 'c': 1}

7

3

-

13 - 17

0

2

2

0

{'a': 0, 'b': 0, 'c': 1}

7

4

-

13 - 17

0

3

2

0

{'a': 0, 'b': -1, 'c': 1}

7

5

-

13 - 17

0

4

3

0

{'a': 0, 'b': -1, 'c': 1}

7

5

1

20 - 32

1

4

2

0

{'a': 0, 'b': -1, 'c': 1}

5

6

-

13 - 17

1

5

3

0

{'a': 0, 'b': -1, 'c': 1}

5

6

1

20 - 32

2

5

3

2

{'a': 0, 'b': -1, 'c': 1}

5

6

2

20 - 32

3

5

3

3

{'a': 0, 'b': -1, 'c': 1}

5

6

3

20 - 32

4

5

2

3

{'a': 0, 'b': 1, 'c': 0}

3

Step-by-step solution construction

The first step in our solution is to calculate and store the frequency of each character in the input pattern in a dictionary. We need these frequencies for matching characters in later steps. Let’s see this in action below:

Press + to interact
Python 3.5
def find_substring(str1, pattern):
print("Input string: ", str1, ", pattern: ", pattern, sep="", end="\n\n")
window_start, matched, substr_start = 0, 0, 0
min_length = len(str1) + 1
char_frequency = {}
i = 1
for chr in pattern:
print("Loop index: ", i)
i+=1
print("Character: ", chr)
if chr not in char_frequency:
char_frequency[chr] = 0
char_frequency[chr] += 1
print("Frequencies:", char_frequency, "\n")
def main():
find_substring("abdbca", "abc")
main()

Above are the target frequencies for each character that we want to match. The next step is updating these frequencies as we traverse over our string. If a character is present in the dictionary, we reduce its frequency by 1. This indicates that we’ve found an occurrence and our target frequency is one less than before. For example, if the frequency of a character b was 33 and we find a match, we’ll decrement it to 22 , hence, indicating that we now need only 22 more b's to match the pattern.

We’ll also use a variable matched that counts every matching of a character. When the frequency of a character becomes 0, that is, we’ve found all the required occurrences of that character, we’ll increment matched by 11 to indicate our progress in the matching process.

This step is shown in the code below:

Press + to interact
Python 3.5
def find_substring(str1, pattern):
print("Input string: ", str1, ", pattern: ", pattern, sep="", end="\n\n")
window_start, matched, substr_start = 0, 0, 0
min_length = len(str1) + 1
char_frequency = {}
for chr in pattern:
if chr not in char_frequency:
char_frequency[chr] = 0
char_frequency[chr] += 1
print("Initial target frequencies:", char_frequency, "\n")
# try to extend the range [window_start, window_end]
for window_end in range(len(str1)):
print("Loop iteration: ", window_end)
right_char = str1[window_end]
print("Current character: ", right_char)
if right_char in char_frequency:
print("Current character is present in the dictionary.")
char_frequency[right_char] -= 1
if char_frequency[right_char] >= 0: # Count every matching of a character
matched += 1
else:
print("Current character is not present in the dictionary.")
print("Updated character frequency targets: ", char_frequency)
print("Matched:", str(matched), "character(s) so far...\n")
def main():
find_substring("abdbca", "abc")
main()

When matched becomes equal to the pattern length, that is, all characters have been matched, we shrink the window from the start. This is to ensure we find the shortest possible substring containing our desired pattern.

A new variable min_length is used to store the minimum length of a substring with all characters of a given pattern. We initialize it with the length of our input string + 1 and then decrease it as our window size changes until we find a substring where all the characters in the given pattern occur the required number of times. We also keep track of our substring’s start index using a new variable substr_start. This will later be used to retrieve the matching part of the string.

Before moving our window forward, we store the character leaving the sliding window in the new variable left_char. If this character is present in our frequency dictionary and if its target frequency has fallen to 00 (meaning that the required number of occurrences of this character had been found), we decrement matched by 11, indicating that we have one less character matched now that this character is falling outside our window. Also, we increase the target frequency of the character by 11, since removing it from our window means that we’re losing a match.

Let’s see how these changes work out in the code below:

Press + to interact
Python 3.5
def find_substring(str1, pattern):
print("Input string: ", str1, ", pattern: ", pattern, sep="", end="\n\n")
window_start, matched, substr_start = 0, 0, 0
min_length = len(str1) + 1
char_frequency = {}
for chr in pattern:
if chr not in char_frequency:
char_frequency[chr] = 0
char_frequency[chr] += 1
print("Initial target frequencies:", char_frequency, "\n")
# try to extend the range [window_start, window_end]
for window_end in range(len(str1)):
print("Loop iteration: ", window_end)
right_char = str1[window_end]
print("Current character: ", right_char)
if right_char in char_frequency:
char_frequency[right_char] -= 1
if char_frequency[right_char] >= 0: # Count every matching of a character
matched += 1
# Shrink the window if we can, finish as soon as we remove a matched character
print("Updated character frequency targets: ", char_frequency)
print("Matched:", str(matched), "character(s) so far...")
while matched == len(pattern):
if min_length > window_end - window_start + 1:
min_length = window_end - window_start + 1
substr_start = window_start
left_char = str1[window_start]
window_start += 1
if left_char in char_frequency:
print("\tMatched: ", matched)
print("\tMatched is equal to pattern length.")
print("\tLeft character: ", left_char)
# Note that we could have redundant matching characters, therefore we'll decrement the
# matched count only when a useful occurrence of a matched character is going out of the window
if char_frequency[left_char] == 0:
matched -=1
print("Updated Matched: ", matched)
char_frequency[left_char] += 1
print("Updated character frequency targets: ", char_frequency)
print("\n")
def main():
find_substring("abdbca", "abc")
main()

Now that we have the starting index and the minimum length of the substring that meets our criteria, the last step is to slice our string and return the matching portion. The chevrons enclose the matching slice. This is shown in the code below:

Press + to interact
Python 3.5
def find_substring(str1, pattern):
print("Input string: ", str1, ", pattern: ", pattern, sep="", end="\n\n")
window_start, matched, substr_start = 0, 0, 0
min_length = len(str1) + 1
char_frequency = {}
for chr in pattern:
if chr not in char_frequency:
char_frequency[chr] = 0
char_frequency[chr] += 1
print("Initial target frequencies:", char_frequency, "\n")
# try to extend the range [window_start, window_end]
for window_end in range(len(str1)):
print("Loop iteration: ", window_end)
right_char = str1[window_end]
print("Current character: ", right_char)
if right_char in char_frequency:
char_frequency[right_char] -= 1
if char_frequency[right_char] >= 0: # Count every matching of a character
matched += 1
# Shrink the window if we can, finish as soon as we remove a matched character
print("Updated character frequency targets: ", char_frequency)
print("Matched:", str(matched), "character(s) so far...")
while matched == len(pattern):
if min_length > window_end - window_start + 1:
min_length = window_end - window_start + 1
substr_start = window_start
left_char = str1[window_start]
window_start += 1
if left_char in char_frequency:
print("\tMatched: ", matched)
print("\tMatched is equal to pattern length.")
print("\tLeft character: ", left_char)
# Note that we could have redundant matching characters, therefore we'll decrement the
# matched count only when a useful occurrence of a matched character is going out of the window
if char_frequency[left_char] == 0:
matched -= 1
print("Updated Matched: ", matched)
char_frequency[left_char] += 1
print("Updated character frequency targets: ", char_frequency)
print("")
if min_length > len(str1):
return ""
print("Matching substring start index: ", substr_start)
print("Length of smallest matching substring: ", min_length)
print("Input string with match: ", str1[:substr_start]+'«'+str1[substr_start:substr_start + min_length] + '»' + str1[substr_start + min_length:])
return str1[substr_start:substr_start + min_length]
def main():
docs = ["aabdec", "aabdec", "abdbca", "adcad"]
patterns = ["abc", "abac", "abc", "abc"]
for i in range(len(docs)):
print("Shortest matching substring: " + str(find_substring(docs[i], patterns[i])))
print(("-"*100), "\n", sep="")
main()