Formal Languages

Get an introduction to the concept of formal languages, grammars, and fundamental operations.

Introduction to formal languages

A formal language is a set of strings formed with symbols from some finite alphabet. When studying formal languages, we are interested in the many ways symbols can be validly combined to form valid strings for a given context. We talk about tokens (or words), sentences (or phrases), which are sequences of tokens, and language structure, which defines how strings may be properly classified. We do not concern ourselves here with the semantics of the words—only the structure of the strings.

Remember: A formal language is a set of strings formed with symbols from some finite alphabet.

To illustrate this, let’s consider the simple alphabet Σ={a,b}\Sigma=\{a,b\}. Strings such as babbab and abbaababbaab are among the infinite number of strings that can be formed with these two symbols. To represent an empty string (which has length 0), we can choose none of the symbols. It can also be represented by the string literal "" in programming languages, and in the theory of computation, we denote it by the lowercase Greek letter lambda: λ\lambda. The infinite set of all possible strings using this alphabet is depicted like this:

{λ,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,aaaa,...}\{ \lambda,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,aaaa,...\}

Observe the natural way to enumerate this set, which we call canonical order. Strings are grouped according to length, the groups are arranged in increasing length, and the strings within each group are alphabetized, i.e., they appear in lexicographical order based on the alphabet used (aa comes before bb, etc.).

Arrange the first seven strings in canonical order.

Drag and drop the cards in the blank spaces.

Operations in formal languages

The set of all possible strings is formed by considering all possible ways of concatenating any number of aa’s and bb’s together. The set of all possible concatenations of elements of an arbitrary set of symbols, SS, is called the Kleene star of that set and is denoted by SS^*. Concatenation of strings is a fundamental operation in formal languages. Since a formal language is a set of strings over some alphabet, a language is, therefore, a subset of its alphabet’s Kleene star.

Note: A formal language over the alphabet Σ\Sigma is a subset of Σ\Sigma^*.

Suppose the variable xx represents the string bab\bold{bab} and yy stands for abbaab\bold{abbaab}. The table below uses these variables to illustrate common operations on strings in a formal language.

Name Operation Example
Length x\vert x \vert x=3\vert x \vert = 3
Concatenation xyxy xy=bababbaabxy=bababbaab
Repetition xnx^n x3=babbabbabx^3=babbabbab, x0=λx^0=\lambda
Kleene star xx^* x={λ,bab,x^*=\{\lambda,bab, babbab,...}babbab,...\}
Reversal xRx^R xR=abax^R=aba

Since languages are sets, the usual set operations, such as union, intersection, difference, complement, etc., also apply.

Suppose S={bab,abbaab}S = \{bab, abbaab\}, and we want to form {bab,abbaab}\{bab, abbaab\}^*, which is the Kleene star of the strings denoted by xx and yy above. As we always have the option to choose none, lambda is the first element of a Kleene star. This is followed by all possible concatenations of the original strings in canonical order:

S={λ,bab,abbaab,babbab,abbaabbab,bababbaab,babbabbab,}S^* = \{ \lambda, bab, abbaab, babbab, abbaabbab, bababbaab, babbabbab, \cdots \}

The string babbab precedes abbaababbaab in SS^* because of its length, not because it appears first in the original set (sets, generally, are unordered, but a canonical ordering is by nature an ordered set). We use the alphabet to determine the lexicographical ordering within each length-group.

Generating strings in a language using formal grammar

Consider the language, LeL_e, consisting of all even-length strings over the alphabet Σ={a,b}\Sigma=\{a,b\}:

Le={λ,aa,ab,ba,bb,aaaa,aaab,aaba,aabb,abaa,}L_e=\{ \lambda, aa, ab, ba, bb, aaaa, aaab, aaba, aabb, abaa,\cdots \}

Another way to describe this language is the Kleene star of the strings from the set Se={aa,ab,ba,bb}S_e=\{aa,ab,ba,bb\}. In other words, Le=SeL_e=S_e^*. It is obvious that any string in SeS_e^* has an even length. Take a moment to verify that any even-length string of aa’s and bb’s comes from repeated concatenations of elements of SeS_e.

It is possible to generate all the strings in a formal language by using a set of “substitution rules” (aka “rewrite rules”) known as a formal grammar. A formal grammar consists of substitution rules that stand for substrings that form part of a complete string in a language. A grammar for the language LeL_e is:

Since the variable EE appears first, it is the start variable. The vertical bar is an alternation symbol that separates the various ways to substitute for the variable on the left-hand side, so EE has three substitution rules and OO has two. To generate a string, begin by choosing a right-hand side to replace the start variable, EE, and then continue substituting for variables at will, finally ending by choosing a rule containing no variables (in this case, the rule EλE\to \lambda). We can, for example, use this grammar to derive the string baaabaaa as followsThese intermediate substrings are analogous to phrases of a sentence, so the grammars we study are also known as phrase-structure grammars.:

EbObaEbaaObaaaEbaaaλbaaaE \Rightarrow bO \Rightarrow baE \Rightarrow baaO \Rightarrow baaaE \Rightarrow baaa \lambda \Rightarrow baaa

Since lambda is the empty string, it disappears in concatenation. When all we have left are symbols of the alphabet, we are done generating a string.

Another grammar for LeL_e comes from using the elements of the set SeS_e from the first example, {aa,ab,ba,bb}\{aa,ab,ba,bb\}:

SeaaSeabSebaSebbSeλS_e \to aaS_e \: | \: abS_e \: | \: baS_e \: | \: bbS_e \: | \: \lambda

Using this grammar, we can derive the string baaabaaa as:

SebaSebaaaSebaaaS_e \Rightarrow baS_e \Rightarrow baaaS_e \Rightarrow baaa

Quiz on formal languages

Test your knowledge of formal languages.


Take Σ={a,b}\Sigma =\{a,b\}.

Take two strings xx and yy from Σ\Sigma^* such that x=abbabx = abbab and y=bbabay= bbaba. Which option is the string xyxy?









Question 1 of 40 attempted