# Regular Expressions

Learn how to use regular expressions to describe regular languages.

We'll cover the following

Regular expressions are a convenient notation for representing strings that match simple text patterns. The expression $a^*ac^*b^*$ illustrates two of the four regular expression operations, namely concatenation (via juxtaposition) and Kleene star. The others appear in the following recursive definition.

### Formal definition

1. These are regular expressions:

a) $\phi$, representing the empty language/set, $\{ \}$

b) $\lambda$, representing the one-string language/set, $\{\lambda\}$

c) $c$, for each character, $c$ in the alphabet $\Sigma$, representing the language/set $\{c\}$

2. For regular expressions $r, r_1, r_2$, the following are also regular expressions:

a) $r_1 + r_2$ (union)

b) $r_1r_2$ (concatenation)

c) $r^*$ (Kleene star)

d) $(r)$ (grouping)

Note: Some authors (and most programming environments) use the notation $r_1 \mid r_2$ for union.

Each regular expression represents a language, which is the set of all strings matching its pattern. Formally, we denote the language associated with a regular expression, $r$, as $L(r)$, so $L(\phi)=\{ \}$, $L(\lambda)=\{\lambda\}$, and $L(c)=\{c\}$ for each symbol $c$ from the alphabet in use. The following expressions also hold:

• $L(r_1+r_2)=L(r_1)\cup L(r_2)$
• $L(r_1r_2)=L(r_1)L(r_2)$
• $L(r^*)=(L(r))^*$
• $L((r))=L(r)$

The above definition is a constructive definition. For example, using $\Sigma =\{a,b,c\}$, we can construct the regular expression $aa^*c^*b^*$ as:

• $a$, $b$, and $c$ are regular expressions by rule 1–c.
• $a^*$, $b^*$, and $c^*$ are regular expressions by rule 2–c.
• $aa^*c^*b^*$ is a regular expression by rule 2–b (applied three times).

We now take $\Sigma =\{a,b\}$ and construct a regular expression representing the language of $a$’s and $b$’s that contain the substring $aba$. We don’t care what precedes or follows the substring $aba$, so a regular expression for this language is $(a+b)^*aba(a+b)^*$.

The formal construction satisfying the definition is:

• $a$ and $b$ are regular expressions by rule 1–c.
• $aba$ is a regular expression by rule 2–b (applied twice).
• $a+b$ is a regular expression by rule 2–a.
• $(a+b)$ is a regular expression by rule 2–d.
• $(a+b)^*$ is a regular expression by rule 2–c.
• $(a+b)^*aba(a+b)^*$ is a regular expression by rule 2–b (applied twice).

Here are some more examples of regular expressions over the alphabet $\Sigma=\{a,b\}$.

Regular Language Regular Expression
Strings ending in $ab$ $(a+b)^*ab$
Strings that begin and end with different symbols $a(a+b)^*b+b(a+b)^*a$
String of even length $((a+b)(a+b))^*$
Strings with an even number of $a$’s $(ab^*a+b)^*$

The precedence of regular expression operators parallels that of algebra:

1. Grouping is the highest-precedence operator.
2. Followed by Kleene star (like exponentiation).
3. Followed by concatenation (like multiplication).
4. Followed by union (like addition).

Note the subexpression $ab^*a$ in the last example in the above table. We know that two $a$’s must be introduced at a time to have an even number, but they don’t have to be consecutive, so we insert a $b^*$ between the two $a$’s.

Remember: Regular expressions constitute a pattern language for text processing.

While we are accustomed to using exponents as a shorthand for repetition, using variables as exponents is not allowed in regular expressions. For example, $a^3$ as a shorthand for $aaa$ is a regular expression because $aaa$ is a regular expression, but $a^n$ is not a regular expression.

Remember: The expression $a^k$, where $k$ is a constant, is a regular expression; but $a^n$, where $n$ is a variable, is not a regular expression!

Regular expressions also observe certain algebraic rules, such as a distributive law, wherein $aa+ab=a(a+b)$ and $b+ba=b(\lambda+a)$. Order matters here, because what looks like multiplication is actually concatenation, so $a(a+b) \neq (a+b)a$. The following table summarizes common rules for rewriting regular expressions.

Note: The quality of regular expressions is used to represent the equality of the corresponding languages. When we write $r_1 = r_2$, instead of $r_1\equiv r_2$, we mean that $L(r_1) = L(r_2)$.

Rule Description
$r+s=s+r$ Union is commutative
$(r+s)+t=r+(s+t)$ Union is associative
$r+r=r$ Union is idempotent
$r+\phi=r$ Union’s identity element is $phi$
$(rs)t=r(st)$ Concatenation is associative
$r\lambda=\lambda r=r$ Concatenation’s identity element is $\lambda$
$r\phi=\phi r=\phi$ Concatenation’s nullity element is $\phi$
$r(s+t)=rs+rt$ Concatenation over union is distributive
$(r+s)t=rt+rs$ Concatenation over union is distributive
$(r^*)^*=r^*$ Kleene star is idempotent

Apart from the above-mentioned rules, there are some additional rules for the regular expressions mentioned below:

Get hands-on with 1200+ tech skills courses.