The Knuth-Morris-Pratt (KMP) algorithm is an algorithm that is used to search for a substring (W
), in a given string (S
), in time (where and are the lengths of W
and S
).
The key idea used to achieve this time complexity is to minimize the amount of backtracking when a character of W
does not match with that of S
. This can only be done if we know two things:
W
occurs more than once in S
​after at least one character has been correctly found; if it does, it can be skipped when resuming the process of matching after a mismatch.Before starting the actual algorithm, a one-dimensional array is initialized with the number of characters that can be skipped after a mismatch. arr[i]
represents the number of characters that can be skipped when W[i+1]
does not match with a character in S
.
In the example above, W[4]
(i.e., a
) did not match with the character c
in S
(slide 5), so, the matching was resumed from W[1]
instead of W[0]
(slide 6). This means that character 1 was skipped; therefore, arr[4-1]
= 1.
arr
is initialized using the failure function . This function is based on the fact that:
When a mismatch occurs, all the previous characters match correctly; ​this implies that if a prefix of W
occurred in this set of matching characters, then that prefix is also a suffix of W
.
In other words, arr[i]
will represent the length of the longest proper prefix of W
, which is also a proper suffix of W
(considering W
till the i
th index only).
The following is the code used to initialize arr
:
m = len(W)
i = 0
j = 1
# No proper prefix for string of length 1:
arr[0] = 0
while j < m:
if W[i] == W[j]:
i++
arr[j] = i
j++
# The first character didn't match:
else if i == 0:
arr[j] = 0
j++
# Mismatch after at least one matching character:
else:
i = arr[i - 1]
The following is the KMP algorithm pseudocode used for searching substring W
, in string S
, in :
# Initialising variables:
i = 0
j = 0
m = len(W)
n = len(S)
# Start search:
while (i < m && j < n){
# Last character matches -> Substring found:
if (W[i] == S[j] && i == m - 1):
return true
# Character matches:
else if (W[i] == S[j]):
i++
j++
# Character didn't match -> Backtrack:
else:
i = arr[i - 1]
if i == 0:
j++
}
# Substring not found:
return false
The algorithm above (given in the form of a pseudocode) along with the failure function, has been implemented in Python:
# Initialise array based on the failure function: def init_arr(w): m = len(w) i = 0 j = 1 # No proper prefix for string of length 1: global arr arr[0] = 0 while j < m: if w[i] == w[j]: i += 1 arr[j] = i j += 1 # The first character didn't match: elif i == 0: arr[j] = 0 j += 1 # Mismatch after at least one matching character: else: i = arr[i - 1] def kmp_search(w, s): # Initialise array: init_arr(w) # Initialising variables: i = 0 j = 0 m = len(w) n = len(s) # Start search: while i < m and j < n: # Last character matches -> Substring found: if w[i] == s[j] and i == m - 1: return True # Character matches: elif w[i] == s[j]: i += 1 j += 1 # Character didn't match -> Backtrack: else: if i != 0: i = arr[i-1] else: j += 1 # Substring not found: return False text = "hayhello" substring = "hell" # create array: arr = [None] * len(substring) if kmp_search(substring, text): print("Substring found!") else: print("Could not find substring.")
RELATED TAGS
View all Courses