What is the Knuth-Morris-Pratt algorithm?

The Knuth-Morris-Pratt (KMP) algorithm is an algorithm that is used to search for a substring (W), in a given string (S), in $O(m+n)$ time (where $m$ and $n$ are the lengths of W and S).

Key idea

The key idea used to achieve this time complexity is to minimize the amount of backtracking when a character of W does not match with that of S. This can only be done if we know two things:

Whether or not a proper prefix of W occurs more than once in S after at least one character has been correctly found; if it does, it can be skipped when resuming the process of matching after a mismatch.
Length of the proper prefix.

Using an array

Before starting the actual algorithm, a one-dimensional array is initialized with the number of characters that can be skipped after a mismatch. arr[i] represents the number of characters that can be skipped when W[i+1] does not match with a character in S.

In the example above, W[4] (i.e., a) did not match with the character b in S (slide 5), so, the matching was resumed from S[1] instead of S[0] (slide 6). This means that character 1 was skipped; therefore, arr[4-1] = 1.

Failure function

arr is initialized using the failure function $f(i)$ . This function is based on the fact that:

When a mismatch occurs, all the previous characters match correctly; this implies that if a prefix of W occurred in this set of matching characters, then that prefix is also a suffix of W.

In other words, arr[i] will represent the length of the longest proper prefix of W, which is also a proper suffix of W (considering W till the ith index only).

The following is the code used to initialize arr:

m = len(W)
i = 0
j = 1

# No proper prefix for string of length 1:
arr[0] = 0

while j < m:
    if W[i] == W[j]:
        i++
        arr[j] = i
        j++
    # The first character didn't match:
    else if i == 0:
        arr[j] = 0
        j++
    # Mismatch after at least one matching character:
    else:
        i = arr[i - 1]

Algorithm

The following is the KMP algorithm pseudocode used for searching substring W, in string S, in $O(m+n)$ :

# Initialising variables:
i = 0
j = 0
m = len(W)
n = len(S)

# Start search:
while (i < m && j < n){
    # Last character matches -> Substring found:
    if (W[i] == S[j] && i == m - 1):
        return true

    # Character matches:
    else if (W[i] == S[j]):
        i++
        j++
    
    # Character didn't match -> Backtrack:
    else:
        i = arr[i - 1]
        if i == 0:
            j++
}

# Substring not found:
return false

# Initialise array based on the failure function:
def init_arr(w):
    m = len(w)
    i = 0
    j = 1
    # No proper prefix for string of length 1:
    global arr
    arr[0] = 0
    while j < m:
        if w[i] == w[j]:
            i += 1
            arr[j] = i
            j += 1
        # The first character didn't match:
        elif i == 0:
            arr[j] = 0
            j += 1
        # Mismatch after at least one matching character:
        else:
            i = arr[i - 1]
def kmp_search(w, s):
    # Initialise array:
    init_arr(w)
    # Initialising variables:
    i = 0
    j = 0
    m = len(w)
    n = len(s)
    # Start search:
    while i < m and j < n:
        # Last character matches -> Substring found:
        if w[i] == s[j] and i == m - 1:
            return True
        # Character matches:
        elif w[i] == s[j]:
            i += 1
            j += 1
        # Character didn't match -> Backtrack:
        else:
            if i != 0:
                i = arr[i-1] 
            else:
                j += 1
    # Substring not found:
    return False
text = "hayhello"
substring = "hell"
# create array:
arr = [None] * len(substring)
if kmp_search(substring, text):
    print("Substring found!")
else:
    print("Could not find substring.")

Free Resources

What is the Knuth-Morris-Pratt algorithm?

Key idea

Using an array

Failure function

Algorithm

Implementation