Modeling Gherkin

Learn about how to write a Gherkin parser using Gherkin keywords.

Applied techniques: Writing a Gherkin parser

Gherkin is an indentation-based language that allows developers to write software tests in a way that reads like a natural language, such as English or French. We will not be looking to explain how to use Gherkin to run tests but rather explore the structure of the language and write a parser in PHP that will handle it. While we do not want to get into deep discussions on how tests written in Gherkin are eventually used, we need to look at quite a few language examples to get a sense of what we are dealing with before we start writing our parser. Let’s take a look at the following code example from the Gherkin reference:

Press + to interact
Feature: Guess the word
# The first example has two steps
Scenario: Maker starts a game
When the Maker starts a game
Then the Maker waits for a Breaker to join
# The second example has three steps
Scenario: Breaker joins a game
Given the Maker has started a game with the word "silky"
When the Breaker joins the Maker's game
Then the Breaker must guess a word with 5 characters

We can already see a few notable things in the code snippet above before we get into our parser implementation. The first thing to note is that a Gherkin file always begins with a Feature block and can contain multiple children. We also have two other block types called scenarios.

Gherkin uses a scenario to express a testable behavior within the feature test. Every Gherkin scenario belongs to a feature, and a scenario can have many steps.

Gherkin’s parent-child relationships are indicated by the indentation of the file. In the code snippet above, we have the following program structure:

  1. Feature
    1. Scenario
      1. Step
      2. Step
    2. Scenario
      1. Step
      2. Step
      3. Step

The empty lines on lines 2 and 7 can largely be ignored for our purposes. Lines 3 and 8 contain an interesting construct we need to consider: the Gherkin comment.

The Gherkin comment

Gherkin comments can appear on any line, with any number of leading whitespace, but will always start with the # character. These few facts will make it relatively painless for us to parse these later.

We can update our mental program structure to:

  1. Feature
    1. Comment
    2. Scenario
      1. Step
      2. Step
    3. Comment
    4. Scenario
      1. Step
      2. Step
      3. Step

The next question is whether these structures are identified exclusively by their indentation or whether Gherkin provides a different way to distinguish a scenario from a step. Continuing to peruse the documentation, we find that every non-blank line must begin with a Gherkin keyword. They call out that the exception to this rule is free-form descriptions, but we will get into that later.

Gherkin keywords

The list of keywords that we need to be aware of is as follows:

Gherkin Keywords

Keyword

Description

Feature:

Provides a high-level description of a testable software feature

Rule:

Provides an organizational abstraction around a single business rule to be tested

Background:

Provides a way to define steps that apply to all scenarios within a Feature test suite

Scenario:

Alias of Example

Scenario Outline:

Used to define a Scenario/Example that can be repeated with dynamic values defined in an attached data table

Scenario Template:

Alias of Scenario Outline

Example:

An example application of a business rule; defines some series of steps that can be used to test a behaviour

Examples:

A container for a data table that appears below a Scenario Outline

Scenarios:

Alias of Examples

Given

Identifies the initial state of a software system before a test is executed

When

Represents a description of an action or event taken place within the software system

Then

Represents a description of an expected outcome after actions are taken within a software system.

But

Alias of Then

And

Alias of And

*

Repeats the last step keyword

The Gherkin language constructs

In addition to the keywords, we also need to be mindful of the following additional language constructs:

Gherkin Language Constructs

Construct

Description

Free form description

Optional descriptions that can appear below a Feature, Example, Scenario, Background, Scenario Outline, or Rule block.

Comment

A non-blank line that will not be parsed as any other Gherkin block type

Doc strings

Used to pass large pieces of text to step definition; can be multiline, and can be defined in two different ways

Data tables

Provides a syntax to define tables of data, similar to Markdown

Parameters

Parameters can be embedded in step definitions to dynamically replace data from rows defined in a data table

Tags

Provides a way to group related Gherkin scenarios

The number of keywords and constructs can ...