What is the CYK algorithm?

The CYK algorithm is a parsing algorithm for context-free grammar. It is used to check if a particular string can be derived from the language generated by a given grammar. It is also called the membership algorithm as it tells whether the given string is a member of the given grammar or not. It was independently developed by three Russian scientists named Cocke, Younger, and Kasami, hence the name CYK!

Basics

In a CYK algorithm, the structure of grammar should be in Chomsky normal form.

In addition, the CYK algorithm uses the dynamic programming or table filling algorithm.

The grammar will be in CNFChomsky Normal Form if each rule has one of the following forms:

$A\rightarrow BC$ (at most two variables on the right-hand side)
$A\rightarrow$ a (a single terminal on the right-hand side)
$S\rightarrowØ$ (null string)

If the given Grammar is not in the CNF form, convert it to the CNF form before applying the CYK Algorithm.

The algorithm

In the CYK algorithm,

Construct a triangular table of the length of your given string.

Each row corresponds to the length of the substrings of the given word ‘w’:
- The bottom row will have strings of length 1.
- The second row from the bottom will have strings of length 2 and so on.
$X(i,i)$ is a set of variables A such that $A \rightarrow w_{i}$ is a production of grammar G, where $w_{i}$ is part of the given string ‘w’ or the whole string.
Compare at most n pairs of previously computed sets. For strings with a length of 2, compare two pairs, and for strings with a length of 3 compare three sets. In this way, the next sets (A) can be computed, which are derived from the given grammar as per the following formula:

$(X(i, i) , X((i+1), j ), (X(i, i+1) , X(i+2, j) ) … (X(i, j-1), Xj)$

In the equation, $X(i, i)$ refers to the set of variables derived from the production rules.

Rule: If the top row consists of the starting variable of the grammar, then the given string can be derived from it.

Example

Let’s look at an example to get a better understanding of the algorithm.

Consider the following CNF grammar $G$ :

$S \rightarrow AB \space | \space BC$
$A \rightarrow BA \space | \space a$
$B \rightarrow CC \space | \space b$
$C \rightarrow AB \space | \space a$

The given word $w=baaba$ .

Step 1: Constructing the triangular table

The first row is computed simply by looking at the grammar to see which production is producing the string of length 1 for the given word ‘w’.
The second row is computed by comparing the two strings that were computed previously. So we can have $'w'$ : $'ba'$ , $'aa'$ , $'ab'$ , and $'ba'$ .
For the third row, there are three possible substrings of length three in the given word, namely $'w'$ : $'baa'$ , $'aab'$ , and $'aba'.$ For the substring $'baa'$ , there are two possibilities: $['b', 'aa']$ and $['ba', 'a'].$

Finding the sets to compute these: $['b','aa'] = \{ B \}\{ B \}$
$['ba', 'a'] = \{ S, A \}$ $\{ A, C\}$

We take the union of these as follows:

\{ B \}\{ B \} U \{ S, A \}\{ A,C \} \{ BB \} U \{ SA, SC, AA, AC \}

\rightarrow \{ BB, SA, SC, AA, AC \},

Then, we look in the grammar rules for populating the table.

We’ll repeat this for all substrings with lengths greater than $3$ .

The illustration below shows how the table is populated.

What is the CYK algorithm?

Basics

The algorithm

Example

Step 1: Constructing the triangular table

Step 2: Populating the table

Step 3: Check the table

Time complexity