Problem Challenge 3

Resources for finding the smallest substring with all character occurrences of a given pattern.

Outer iteration	Inner iteration	Line number	window_start	window_end	matched	substr_start	char_frequency	min_length
-	-	2 - 9	0	-	0	0	{'a': 1, 'b': 1, 'c': 1}	7
1	-	13 - 17	0	0	1	0	{'a': 0, 'b': 1, 'c': 1}	7
2	-	13 - 17	0	1	2	0	{'a': 0, 'b': 0, 'c': 1}	7
3	-	13 - 17	0	2	2	0	{'a': 0, 'b': 0, 'c': 1}	7
4	-	13 - 17	0	3	2	0	{'a': 0, 'b': -1, 'c': 1}	7
5	-	13 - 17	0	4	3	0	{'a': 0, 'b': -1, 'c': 1}	7
5	1	20 - 32	1	4	2	0	{'a': 0, 'b': -1, 'c': 1}	5
6	-	13 - 17	1	5	3	0	{'a': 0, 'b': -1, 'c': 1}	5
6	1	20 - 32	2	5	3	2	{'a': 0, 'b': -1, 'c': 1}	5
6	2	20 - 32	3	5	3	3	{'a': 0, 'b': -1, 'c': 1}	5
6	3	20 - 32	4	5	2	3	{'a': 0, 'b': 1, 'c': 0}	3

Problem statement

Given a string and a pattern, find the smallest substring in the given string which has all the character occurrences of the given pattern.

From the definition of the challenge, we can see that we need to find the same number of occurrences of each character in the string, as they occur in the pattern. For example, if the string to search is “aabdec” and the pattern is “abac”, what matters is not the sequence of the characters in the pattern, rather it’s their frequencies that matter.

So, we’re looking for the smallest substring that contains all the characters in the pattern, the exact number of times that they occur in the given pattern. In this case, we’re looking for two a’s, one b and one c. The smallest substring that meets these criteria is “aabdec”, that is, the entire string. Looking at another example, where the string is “abdbca” and the pattern is “abcb”, the smallest substring that meets the matching criteria is “abdbc”.

Real-world problems

A variation of this problem can be used for pattern matching in plagiarism checkers.

Pattern matching

We are given a set of documents. Each document is submitted by a different individual. However, we suspect that some individuals may have copied from others. Given a plagiarised submitted document, we want to identify the number of documents with which there is a potential match. We have converted each document into a set of tokens based on their content.

The students could have added dummy statements between the copied content to avoid identification. We’ll have to match the tokens of two students while taking into account that there can be dummy tokens that might not match. A potential match can occur if a token being checked in the student’s document results in the subsequence of a token in another document. It is not a guarantee that every match is plagiarized content. In this scenario, we’ll discard the matched tokens that have a length less than two.

Press + to interact

Python 3.5

def find_substring(str1, pattern):
  window_start, matched, substr_start = 0, 0, 0
  min_length = len(str1) + 1
  char_frequency = {}
  for chr in pattern:
    if chr not in char_frequency:
      char_frequency[chr] = 0
    char_frequency[chr] += 1
  # try to extend the range [window_start, window_end]
  for window_end in range(len(str1)):
    right_char = str1[window_end]
    if right_char in char_frequency:
      char_frequency[right_char] -= 1
      if char_frequency[right_char] >= 0:  # Count every matching of a character
        matched += 1
    # Shrink the window if we can, finish as soon as we remove a matched character
    while matched == len(pattern):
      if min_length > window_end - window_start + 1:
        min_length = window_end - window_start + 1
        substr_start = window_start
      left_char = str1[window_start]
      window_start += 1
      if left_char in char_frequency:
        # Note that we could have redundant matching characters, therefore we'll decrement the
        # matched count only when a useful occurrence of a matched character is going out of the window
        if char_frequency[left_char] == 0:
          matched -= 1
        char_frequency[left_char] += 1
  if min_length > len(str1):
    return ""
  return str1[substr_start:substr_start + min_length]
def main():
  print(find_substring("aabdec", "abc"))
  print(find_substring("aabdec", "abac"))
  print(find_substring("abdbca", "abc"))
  print(find_substring("adcad", "abc"))
main()

Sample input #1

Input string: aabdec
Pattern: abc

Lines 6 - 9 of the solution simply add the frequencies of characters in the given pattern in a dictionary called char_frequency.

So, we’ll start tracking variables in our dry-run table from line 12 onwards.

The “Outer iteration” column in the table indicates the iterations of the second for loop (that starts at line 12) and the “Inner iteration” column indicates the iterations of the nested while loop (that starts at line 20).

Outer iteration	Inner iteration	Line number	window_start	window_end	matched	substr_start	char_frequency	min_length
-	-	2 - 9	0	-	0	0	{'a': 1, 'b': 1, 'c': 1}	7
1	-	13 - 17	0	0	1	0	{'a': 0, 'b': 1, 'c': 1}	7
2	-	13 - 17	0	1	2	0	{'a': 0, 'b': 0, 'c': 1}	7
3	-	13 - 17	0	2	2	0	{'a': 0, 'b': 0, 'c': 1}	7
4	-	13 - 17	0	3	2	0	{'a': 0, 'b': -1, 'c': 1}	7
5	-	13 - 17	0	4	3	0	{'a': 0, 'b': -1, 'c': 1}	7
5	1	20 - 32	1	4	2	0	{'a': 0, 'b': -1, 'c': 1}	5
6	-	13 - 17	1	5	3	0	{'a': 0, 'b': -1, 'c': 1}	5
6	1	20 - 32	2	5	3	2	{'a': 0, 'b': -1, 'c': 1}	5
6	2	20 - 32	3	5	3	3	{'a': 0, 'b': -1, 'c': 1}	5
6	3	20 - 32	4	5	2	3	{'a': 0, 'b': 1, 'c': 0}	3

Outer iteration	Inner iteration	Line number	window_start	window_end	matched	substr_start	char_frequency	min_length
-	-	2 - 9	0	-	0	0	{'a': 1, 'b': 1, 'c': 1}	7
1	-	13 - 17	0	0	1	0	{'a': 0, 'b': 1, 'c': 1}	7
2	-	13 - 17	0	1	2	0	{'a': 0, 'b': 0, 'c': 1}	7
3	-	13 - 17	0	2	2	0	{'a': 0, 'b': 0, 'c': 1}	7
4	-	13 - 17	0	3	2	0	{'a': 0, 'b': -1, 'c': 1}	7
5	-	13 - 17	0	4	3	0	{'a': 0, 'b': -1, 'c': 1}	7
5	1	20 - 32	1	4	2	0	{'a': 0, 'b': -1, 'c': 1}	5
6	-	13 - 17	1	5	3	0	{'a': 0, 'b': -1, 'c': 1}	5
6	1	20 - 32	2	5	3	2	{'a': 0, 'b': -1, 'c': 1}	5
6	2	20 - 32	3	5	3	3	{'a': 0, 'b': -1, 'c': 1}	5
6	3	20 - 32	4	5	2	3	{'a': 0, 'b': 1, 'c': 0}	3

Press + to interact

Python 3.5

def find_substring(str1, pattern):
  print("Input string: ", str1, ", pattern: ", pattern, sep="", end="\n\n")
  window_start, matched, substr_start = 0, 0, 0
  min_length = len(str1) + 1
  char_frequency = {}
  i = 1
  for chr in pattern:
    print("Loop index: ", i)
    i+=1
    print("Character: ", chr)
    if chr not in char_frequency:
      char_frequency[chr] = 0
    char_frequency[chr] += 1
    print("Frequencies:", char_frequency, "\n")
def main():
  find_substring("abdbca", "abc")
main()

Above are the target frequencies for each character that we want to match. The next step is updating these frequencies as we traverse over our string. If a character is present in the dictionary, we reduce its frequency by 1. This indicates that we’ve found an occurrence and our target frequency is one less than before. For example, if the frequency of a character b was $3$ and we find a match, we’ll decrement it to $2$ , hence, indicating that we now need only $2$ more b's to match the pattern.

We’ll also use a variable matched that counts every matching of a character. When the frequency of a character becomes 0, that is, we’ve found all the required occurrences of that character, we’ll increment matched by $1$ to indicate our progress in the matching process.

This step is shown in the code below:

Press + to interact

Python 3.5

def find_substring(str1, pattern):
  print("Input string: ", str1, ", pattern: ", pattern, sep="", end="\n\n")
  window_start, matched, substr_start = 0, 0, 0
  min_length = len(str1) + 1
  char_frequency = {}
  for chr in pattern:
    if chr not in char_frequency:
      char_frequency[chr] = 0
    char_frequency[chr] += 1
  print("Initial target frequencies:", char_frequency, "\n")
  
  # try to extend the range [window_start, window_end]
  for window_end in range(len(str1)):
    print("Loop iteration: ", window_end)
    right_char = str1[window_end]
    print("Current character: ", right_char)
    if right_char in char_frequency:
      print("Current character is present in the dictionary.")
      char_frequency[right_char] -= 1
      if char_frequency[right_char] >= 0:  # Count every matching of a character
        matched += 1
    else:
      print("Current character is not present in the dictionary.")
    print("Updated character frequency targets: ", char_frequency)
    print("Matched:", str(matched), "character(s) so far...\n")
def main():
  find_substring("abdbca", "abc")
main()

When matched becomes equal to the pattern length, that is, all characters have been matched, we shrink the window from the start. This is to ensure we find the shortest possible substring containing our desired pattern.

A new variable min_length is used to store the minimum length of a substring with all characters of a given pattern. We initialize it with the length of our input string + 1 and then decrease it as our window size changes until we find a substring where all the characters in the given pattern occur the required number of times. We also keep track of our substring’s start index using a new variable substr_start. This will later be used to retrieve the matching part of the string.

Before moving our window forward, we store the character leaving the sliding window in the new variable left_char. If this character is present in our frequency dictionary and if its target frequency has fallen to $0$ (meaning that the required number of occurrences of this character had been found), we decrement matched by $1$ , indicating that we have one less character matched now that this character is falling outside our window. Also, we increase the target frequency of the character by $1$ , since removing it from our window means that we’re losing a match.

Let’s see how these changes work out in the code below:

Press + to interact

Python 3.5

def find_substring(str1, pattern):
  print("Input string: ", str1, ", pattern: ", pattern, sep="", end="\n\n")
  window_start, matched, substr_start = 0, 0, 0
  min_length = len(str1) + 1
  char_frequency = {}
  for chr in pattern:
    if chr not in char_frequency:
      char_frequency[chr] = 0
    char_frequency[chr] += 1
  print("Initial target frequencies:", char_frequency, "\n")
  # try to extend the range [window_start, window_end]
  for window_end in range(len(str1)):
    print("Loop iteration: ", window_end)
    right_char = str1[window_end]
    print("Current character: ", right_char)
    if right_char in char_frequency:
      char_frequency[right_char] -= 1
      if char_frequency[right_char] >= 0:  # Count every matching of a character
        matched += 1
    # Shrink the window if we can, finish as soon as we remove a matched character
    print("Updated character frequency targets: ", char_frequency)
    print("Matched:", str(matched), "character(s) so far...")
    while matched == len(pattern):
      if min_length > window_end - window_start + 1:
        min_length = window_end - window_start + 1
        substr_start = window_start
      left_char = str1[window_start]
      window_start += 1
      if left_char in char_frequency:
        print("\tMatched: ", matched)
        print("\tMatched is equal to pattern length.")
        print("\tLeft character: ", left_char)
        # Note that we could have redundant matching characters, therefore we'll decrement the
        # matched count only when a useful occurrence of a matched character is going out of the window
        if char_frequency[left_char] == 0:
          matched -=1
          print("Updated Matched: ", matched)
        char_frequency[left_char] += 1
        print("Updated character frequency targets: ", char_frequency)
    print("\n")
def main():
  find_substring("abdbca", "abc")
main()

Press + to interact

Python 3.5

def find_substring(str1, pattern):
  print("Input string: ", str1, ", pattern: ", pattern, sep="", end="\n\n")
  window_start, matched, substr_start = 0, 0, 0
  min_length = len(str1) + 1
  char_frequency = {}
  for chr in pattern:
    if chr not in char_frequency:
      char_frequency[chr] = 0
    char_frequency[chr] += 1
  print("Initial target frequencies:", char_frequency, "\n")
  # try to extend the range [window_start, window_end]
  for window_end in range(len(str1)):
    print("Loop iteration: ", window_end)
    right_char = str1[window_end]
    print("Current character: ", right_char)
    if right_char in char_frequency:
      char_frequency[right_char] -= 1
      if char_frequency[right_char] >= 0:  # Count every matching of a character
        matched += 1
    # Shrink the window if we can, finish as soon as we remove a matched character
    print("Updated character frequency targets: ", char_frequency)
    print("Matched:", str(matched), "character(s) so far...")
    while matched == len(pattern):
      if min_length > window_end - window_start + 1:
        min_length = window_end - window_start + 1
        substr_start = window_start
      left_char = str1[window_start]
      window_start += 1
      if left_char in char_frequency:
        print("\tMatched: ", matched)
        print("\tMatched is equal to pattern length.")
        print("\tLeft character: ", left_char)
        # Note that we could have redundant matching characters, therefore we'll decrement the
        # matched count only when a useful occurrence of a matched character is going out of the window
        if char_frequency[left_char] == 0:
          matched -= 1
          print("Updated Matched: ", matched)
        char_frequency[left_char] += 1
        print("Updated character frequency targets: ", char_frequency)
    print("")
  if min_length > len(str1):
    return ""
  print("Matching substring start index: ", substr_start)
  print("Length of smallest matching substring: ", min_length)
  print("Input string with match: ", str1[:substr_start]+'«'+str1[substr_start:substr_start + min_length] + '»' + str1[substr_start + min_length:])
  return str1[substr_start:substr_start + min_length]
def main():
  docs = ["aabdec", "aabdec", "abdbca", "adcad"]
  patterns = ["abc", "abac", "abc", "abc"]
  for i in range(len(docs)):
    print("Shortest matching substring: " + str(find_substring(docs[i], patterns[i])))
    print(("-"*100), "\n", sep="")
main()

Pattern: Sliding Window

Pattern: Two Pointers

Pattern: Fast & Slow pointers

Pattern: In-place Reversal of a LinkedList

Pattern: Tree Breadth First Search

Pattern: Tree Depth First Search

Pattern: Modified Binary Search

Pattern: Top 'K' Elements

Pattern: K-way merge

Pattern: Subsets

Pattern : 0/1 Knapsack (Dynamic Programming)

Pattern: Topological Sort

Pattern: Two Heaps

Pattern: Cyclic Sort

Pattern: Bitwise XOR

Pattern: Merge Intervals

Problem Challenge 3

Problem statement

Real-world problems

Pattern matching

Dry-run templates

Sample input #1

Sample input #2

Step-by-step solution construction