Search⌘ K
AI Features

A Simple PHP Tokenizer

Explore the process of tokenizing PHP code by combining PHP's built-in tokenizer with Laravel's collection features. Learn to implement cursors like VariableCursor, NumberCursor, and OpenTagCursor to iterate and parse PHP variables, numbers, and opening tags. This lesson guides you through creating a simple lexer to produce token arrays similar to PHP's token_get_all function, enhancing your string splitting skills in Laravel.

Tokenizing PHP code

Let’s consider the example in the code below. Our example uses PHP’s token_get_all function to return the results of PHP’s tokenizer on the input string. We then use Laravel’s collection features to provide a friendlier name for each of the returned tokens:

PHP
<?php
$code = <<<'PHP'
<?php
$value = 5 + (532_323) - $total;
PHP;
collect(token_get_all($code))->map(function ($token) {
if (!is_string($token) && count($token) > 1) {
$token['name'] = token_name($token[0]);
}
return $token;
})->all();

Our example would produce the following output:

PHP
Array
(
[0] => Array
(
[0] => 389
[1] => 1
[name] => T_OPEN_TAG
)
[1] => Array
(
[0] => 266
[1] => $value
[2] => 2
[name] => T_VARIABLE
)
[2] => Array
(
[0] => 392
[1] =>
[2] => 2
[name] => T_WHITESPACE
)
[3] => =
[4] => Array
(
[0] => 392
[1] =>
[2] => 2
[name] => T_WHITESPACE
)
[5] => Array
(
[0] => 260
[1] => 5
[2] => 2
[name] => T_LNUMBER
)
[6] => Array
(
[0] => 392
[1] =>
[2] => 2
[name] => T_WHITESPACE
)
[7] => +
[8] => Array
(
[0] => 392
[1] =>
[2] => 2
[name] => T_WHITESPACE
)
[9] => (
[10] => Array
(
[0] => 260
[1] => 532_323
[2] => 2
[name] => T_LNUMBER
)
[11] => )
[12] => Array
(
[0] => 392
[1] =>
[2] => 2
[name] => T_WHITESPACE
)
[13] => -
[14] => Array
(
[0] => 392
[1] =>
[2] => 2
[name] => T_WHITESPACE
)
[15] => Array
(
[0] => 266
[1] => $total
[2] => 2
[name] => T_VARIABLE
)
[16] => ;
)

Each element in our resulting array corresponds to some of our input text. For instance, the second element corresponds to the $value variable on line 2. The first value in all of our nested arrays contains the token identifier, the second value holds the contents of the match, and the ...