content

share

In the theory of computation, a language represents a computational problem. Languages are classified into different classes based on the inherent complexity and resources required to recognize/decide a language. These classes of languages include:

- Regular languages
- Context-free languages
- Turing-decidable languages
- Turing-recognizable languages
- Undecidable languages
- Unrecognizable languages

In this blog, we focus on context-free languages. A **context-free language (CFL)** is a language that can be recognized by context-free grammar (CFG) or non-deterministic push-down automata (NPDA). CFG and NPDA are equivalent in their computational capability and mutually convertible. This blog focuses only on the design of CFGs.

A context-free grammar (CFG) is defined as follows:

$G:(V, \Sigma, P, S),$

where the grammar $G$ is composed of the 4-tuple: $P$ is a set of productions, $V$ is a set of variables, $\Sigma$ is a set of alphabets, and $S$ is a designated start variable so that $S \in V$. Each production (or rule of replacement) is defined in the following format:

$X \rightarrow Y,$

where $X \in V$, and $Y$ can be any one of the following substitutions:

- Sequence of variables concatenated with each other
- Sequence of alphabets concatenated with each other
- Any finite combination of the above two formats that can be repeated if needed
- $\epsilon$ (empty string whose length is $0$)

Let’s design CFG for the following language:

$L_1=\{w \in \{0,1\}^* : \text{ number of 0's in }w \text{ is 2}\}.$

We decomposed the language into the following logical components:

- 0’s can appear anywhere in the string of this language. Their count needs to be restricted to 2.
- 1’s can appear anywhere in the string in any count (including zero number of 1’s).

The first production of the CFG, starting with the designated start symbol $S$ makes sure that two 0’s are part of any valid string of the language. At the same time, the production allows possible substitutions of 1’s before both 0’s, after both 0’s, and between two 0’s. The CFG for $L_1$ is composed of the following productions:

$S \rightarrow X0X0X$

$X \rightarrow 1X$

$X \rightarrow \epsilon.$

Let’s look at the derivation of an example string: $01101$. A **derivation** is a sequence of substitutions of the productions starting from the designated start symbol $S$. The following steps show the derivation of this string using the CFG:

$S \Rightarrow X0X0X \Rightarrow \epsilon 0X0X \Rightarrow \epsilon 01X0X$

$\Rightarrow \epsilon 011X0X \Rightarrow \epsilon 011 \epsilon 0X$

$\Rightarrow \epsilon011 \epsilon 01X \Rightarrow \epsilon011 \epsilon 01 \epsilon \Rightarrow 01101.$

Note:Derivation is a logical sequence of productions of the CFG being applied to derive a given string. There can be more than one possible derivation for a given string if it belongs to the language of the CFG.

The derivation of a string can be visualized as a derivation tree. In a derivation tree, the root rode is the starting symbol of the CFG, i.e., $S$. If $S$ has to be substituted with a production whose length is $k$, then there are $k$ child nodes of $S$. All terminals, including $\epsilon$, become the leaf nodes. All internal nodes of the tree are variables of the CFG. Reading the leaf nodes from left to right outputs the derived string. A derivation tree for the string $01101$ is shown below:

Let’s design CFG for the following language:

$L_2=\{a^nb^m: n \geq m \text{ and } n,m \geq0\}.$

We decompose the language into the following parts:

- We should maintain the order of the alphabet: $a$’s precede $b$’s.
- With every $b$, there must be a preceding $a$.
- When all $b$’s are done, there should be a provision to add more $a$’s if needed.

The first production of the CFG has two variables. The first variable reads any number of $a$’s including zero number of $a$’s. The second variable maintains a balance and order between $a$’s and $b$’s. The CFG of $L_2$ is composed of the following productions:

$S \rightarrow XY$

$X \rightarrow aX | \epsilon$

$Y \rightarrow aYb | \epsilon.$

Note:This CFG is one possible solution for the given language. Think of a variation of the solution such that your CFG uses only one variable to recognize this language.

Let’s look at the derivation of an example string: $aaabb$. The following steps show the derivation of this string using the CFG:

$S \Rightarrow XY \Rightarrow aXY$

$\Rightarrow a\epsilon Y \Rightarrow a\epsilon aYb \Rightarrow a\epsilon aaYbb$

$\Rightarrow a\epsilon aa\epsilon bb \Rightarrow aaabb.$

Derivation tree of $aaabb$ is shown below:

Let’s design CFG for the following language:

$L_3=\{w=w^R: w\in \{a,b\}^*\}.$

The strings that are the same as that of their reversal, i.e., $w=w^R$, are called **palindromes**. Example strings belonging to this language include $aabaa$, $baab$, $b$, etc. We decompose the language into two types of string:

- Strings of even length
- Strings of odd length

For strings of even length, both ends of the string should be reading the same character. The middle variable should be substituted with $\epsilon$ at the end to maintain even length. For strings of odd length, both ends of the string should be reading the same character and the middle variable should be substituted with a character of length 1 at the end to maintain an odd length. The middle character is going to stay common in $w$ and $w^R$. We use the same method as that of even-length strings and, in the end, insert the required character at the midpoint to make it an odd-length string.

In the CFG, the start symbol $S$ will have five possibilities: two will be recursive possibilities to extend even-length strings. The last three possibilities will substitute $a$, $b$, or $\epsilon$ to conclude the derivation.

$S \rightarrow aSa | bSb|a|b|\epsilon.$

Let’s derive an example string $babab$ from this CFG:

$S \Rightarrow bSb \Rightarrow baSab\Rightarrow babab.$

Derivation tree of $babab$ is shown below:

Let’s design CFG for the following language:

$L_4=\{w: w\in \{0,1\}^* \text{ with 0's and 1's appearing alternatively }\}.$

In this language, all those strings are included that don’t contain consecutive 0’s or 1’s. Example strings include $0$, $101$, $0101$, etc. We decompose the language into four types of strings:

- Strings of odd length starting with a 0
- Strings of odd length starting with a 1
- Strings of even length starting with a 0
- Strings of even length starting with a 1

The CFG takes care of all these possibilities. In each possibility, it has to make sure that 0’s and 1’s alternate. Each possibility is represented by a variable as shown in the following CFG:

$S \rightarrow A | B | C | D$

$A \rightarrow 0W | 0$

$W \rightarrow 1A$

$B \rightarrow 1X | 1$

$X \rightarrow 0B$

$C \rightarrow 0Y | \epsilon$

$Y \rightarrow 1C$

$D \rightarrow 1Z | \epsilon$

$Z \rightarrow 0D.$

Let’s derive an example string using this CFG. We derive $10101$. This is an odd-length string starting with a $1$.

$S \Rightarrow B \Rightarrow 1X \Rightarrow 10B$

$\Rightarrow 101X \Rightarrow 1010B \Rightarrow 10101$

Derivation tree of $10101$ is shown below:

Let’s design a CFG for the following language:

$L_5=\{a^ib^jc^k: k=|i-j|, i, j, k \geq 0\}.$

Example strings belonging to this language include $aabc$, $aabbbc$, $abbbcc$, $bc$, etc.

There are two types of strings in this language:

- $a^ib^jc^k: k=i-j$
- $a^ib^jc^k: k=j-i$

The CFG for this language takes care of these possibilities. Each possibility is represented by a variable:

$S \rightarrow X|Y$

$X \rightarrow aXc | W$

$W \rightarrow aWb|\epsilon$

$Y \rightarrow WZ$

$Z \rightarrow bZc|\epsilon.$

Let’s derive an example string $aabbbc$ from this CFG:

$S \Rightarrow Y \Rightarrow WZ \Rightarrow aWbZ \Rightarrow aaWbbZ$

$\Rightarrow aa\epsilon bbZ \Rightarrow aa\epsilon bbbZc \Rightarrow aa\epsilon bbb\epsilon c \Rightarrow aabbbc.$

The derivation tree of $aabbbc$ is shown below:

CFGs recognize context-free languages. Intuitively, all those languages that require a limited amount of memory in the form of a stack can be represented by a CFG. Stack limits the number of simultaneous comparisons due to its operations restricted to push and pop. At one time, two entities can be compared using a stack. One item is pushed temporarily on the stack. At the time of comparison with the next item, it has to be popped from the stack. Such comparisons are represented by CFG productions of the form $P \rightarrow aPb$. It sets the upper limit on the problem-solving capacity of a CFG.

We hope you enjoyed exploring the CFG with different examples. For a deep dive, please consider the following course:

Theory of Computation

Theory of Computation

What are the mathematics behind computers? What are the theoretical foundations of computer languages? Such questions can be explored by understanding formal languages and automata models. In this comprehensive course, you’ll explore formal languages, regular languages, and how to model them with their associated automata models. Next, you’ll cover regular expressions and grammar, and their equivalents. Then, you’ll explore context-free languages (the foundations of programming languages), pushdown automata, and the relationship between them. Toward the end, you’ll learn about recursively enumerable languages, Turing machines with their variations, context-sensitive languages, and unrestricted grammars. By the end of the course, you’ll have gained a deep understanding of several formal languages and their associated automata. Also, since there are plenty of exercises throughout the course, your understanding and problem-solving skills will be thoroughly reinforced.

Free Resources

TRENDING TOPICS