# Formal Languages

Get an introduction to the concept of formal languages, grammars, and fundamental operations.

## Introduction to formal languages

A **formal language** is a set of strings formed with symbols from some finite alphabet. When studying formal languages, we are interested in the many ways symbols can be validly combined to form valid strings for a given context. We talk about *tokens (*or* words), sentences (*or* phrases)*, which are sequences of tokens, and **language** **structure**, which defines how strings may be properly classified. We do not concern ourselves here with the semantics of the words—only the structure of the strings.

Remember:A formal language is a set of strings formed with symbols from some finite alphabet.

To illustrate this, let’s consider the simple alphabet $\Sigma=\{a,b\}$. Strings such as $bab$ and $abbaab$ are among the infinite number of strings that can be formed with these two symbols. To represent an *empty string* (which has length 0), we can choose *none* of the symbols. It can also be represented by the string literal `""`

in programming languages, and in the theory of computation, we denote it by the lowercase Greek letter lambda: $\lambda$.
The infinite set of all possible strings using this alphabet is depicted like this:

$\{ \lambda,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,aaaa,...\}$

Observe the natural way to enumerate this set, which we call **canonical order**. Strings are grouped according to length, the groups are arranged in increasing length, and the strings within each group are alphabetized, i.e., they appear in lexicographical order based on the alphabet used ($a$ comes before $b$, etc.).

### Operations in formal languages

The set of all possible strings is formed by considering all possible ways of concatenating any number of $a$’s and $b$’s together. The set of all possible **concatenations** of elements of an arbitrary set of symbols, $S$, is called the **Kleene star** of that set and is denoted by $S^*$. Concatenation of strings is a fundamental operation in formal languages. Since a formal language is a set of strings over some alphabet, a language is, therefore, a subset of its alphabet’s Kleene star.

Note:A formal language over the alphabet $\Sigma$ is a subset of $\Sigma^*$.

Suppose the variable $x$ represents the string $\bold{bab}$ and $y$ stands for $\bold{abbaab}$. The table below uses these variables to illustrate common operations on strings in a formal language.

Name |
Operation |
Example |
---|---|---|

Length | $\vert x \vert$ | $\vert x \vert = 3$ |

Concatenation | $xy$ | $xy=bababbaab$ |

Repetition | $x^n$ | $x^3=babbabbab$, $x^0=\lambda$ |

Kleene star | $x^*$ | $x^*=\{\lambda,bab,$ $babbab,...\}$ |

Reversal | $x^R$ | $x^R=aba$ |

Since languages are sets, the usual set operations, such as union, intersection, difference, complement, etc., also apply.

Suppose $S = \{bab, abbaab\}$, and we want to form $\{bab, abbaab\}^*$, which is the Kleene star of the strings denoted by $x$ and $y$ above. As we always have the option to choose none, lambda is the first element of a Kleene star. This is followed by all possible concatenations of the original strings in canonical order:

$S^* = \{ \lambda, bab, abbaab, babbab, abbaabbab, bababbaab, babbabbab, \cdots \}$

The string $bab$ precedes $abbaab$ in $S^*$ because of its length, not because it appears first in the original set (sets, generally, are unordered, but a canonical ordering is by nature an ordered set). We use the alphabet to determine the lexicographical ordering within each length-group.

### Generating strings in a language using formal grammar

Consider the language, $L_e$, consisting of all even-length strings over the alphabet $\Sigma=\{a,b\}$:

$L_e=\{ \lambda, aa, ab, ba, bb, aaaa, aaab, aaba, aabb, abaa,\cdots \}$

Another way to describe this language is the Kleene star of the strings from the set $S_e=\{aa,ab,ba,bb\}$. In other words, $L_e=S_e^*$. It is obvious that any string in $S_e^*$ has an even length. Take a moment to verify that any even-length string of $a$’s and $b$’s comes from repeated concatenations of elements of $S_e$.

It is possible to generate all the strings in a formal language by using a set of “substitution rules” (aka “rewrite rules”) known as a **formal grammar**. A formal grammar consists of substitution rules that stand for substrings that form part of a complete string in a language. A grammar for the language $L_e$ is:

Since the variable $E$ appears first, it is the **start variable**. The vertical bar is an alternation symbol that separates the various ways to substitute for the variable on the left-hand side, so $E$ has three substitution rules and $O$ has two. To generate a string, begin by choosing a right-hand side to replace the start variable, $E$, and then continue substituting for variables at will, finally ending by choosing a rule containing no variables (in this case, the rule $E\to \lambda$). We can, for example, use this grammar to derive the string $baaa$ as

$E \Rightarrow bO \Rightarrow baE \Rightarrow baaO \Rightarrow baaaE \Rightarrow baaa \lambda \Rightarrow baaa$

Since lambda is the empty string, it disappears in concatenation. When all we have left are symbols of the alphabet, we are done generating a string.

Another grammar for $L_e$ comes from using the elements of the set $S_e$ from the first example, $\{aa,ab,ba,bb\}$:

$S_e \to aaS_e \: | \: abS_e \: | \: baS_e \: | \: bbS_e \: | \: \lambda$

Using this grammar, we can derive the string $baaa$ as:

$S_e \Rightarrow baS_e \Rightarrow baaaS_e \Rightarrow baaa$

## Quiz on formal languages

Test your knowledge of formal languages.

Take $\Sigma =\{a,b\}$.

Take two strings $x$ and $y$ from $\Sigma^*$ such that $x = abbab$ and $y= bbaba$. Which option is the string $xy$?

$bbabaabbab$

$abbabbbaba$

$babbaababb$

$ababbbabba$