Some Real Grammars

Learn about real grammars in ANTLR g4.

This lesson explores how real-world languages are defined in ANTLR using regular expressions, SQL, and JSON. These are widely used in text processing, databases, and data exchange, making them essential for parsing and validation. By understanding their grammar, you will learn how to define, analyze, and implement structured language rules in ANTLR.

We begin with regular expressions (regex), which are used for pattern matching in text. Understanding their grammar helps in building parsers for text validation and search operations. Next, we study SQL (Structured Query Language), which defines how queries are structured for database operations. Finally, we examine JSON (JavaScript Object Notation), a widely used data exchange format, to see how structured data is parsed and validated.

For each language, we present its ANTLR grammar, break it down into key components, and provide example code to demonstrate its usage. This structured approach ensures a clear understanding of how to define and work with a formal grammar. Let’s start with regular expressions.

Regular expressions

Regular expressions are used for pattern matching within strings. They provide a way to describe and match string patterns using a concise syntax.

Regular expressions grammar example

The following is the ANTLR grammar defining regular expressions, which specifies the syntax and structure for pattern matching within strings:

grammar Regex;
// Start rule
expression : alternation;
// Alternation
alternation : concatenation ('|' concatenation)*;
// Concatenation
concatenation : repetition*;
// Repetition
repetition : atom ('*' | '+' | '?')?;
// Atom
atom : CHAR | '[' CHAR* ']' | '(' expression ')';
// Tokens
CHAR : [a-zA-Z0-9];
WS : [ \t\n\r]+ -> skip;
Regex.g4

Breakdown of the grammar

In this section, we provide a detailed breakdown of the ANTLR grammar for regular expressions. We analyze its structure, explaining the key components, rules, and how they contribute to parsing and recognizing string patterns.

  1. Start rule (expression): It is the entry point of the grammar, expression, and represents a full regex pattern. It is defined as an alternation, which is the topmost operation in regex syntax.

  2. Alternation (alternation): Alternation represents the “or” operation, where a pattern can match one of multiple options. It consists of one or more concatenation expressions, separated by the | symbol. For example, in regex a|b, the alternation allows matching either a or b.

  3. Concatenation (concatenation): Concatenation defines the sequence of elements without any specific operator, allowing patterns to appear in sequence. It consists of zero or more repetition elements, meaning patterns can be combined directly (e.g., ab matches a followed by b). Concatenation is implicit in regex syntax, where patterns are placed next to each other. ...