Lexer
TLDR
Run and Debug the source code.
Introduction
In this tutorial we will implement a Lexer for a simple SQL Select statement language:
SELECT column1 FROM table2
SELECT name, age FROM persons WHERE age > 100
...
A Lexer transforms a string input into a Token vector. Chevrotain has a built in Lexer engine based on Javascript Regular Expressions.
Our First Token
To use the Chevrotain lexer the Tokens must first be defined. Lets examine the definition for a "FROM" Token:
const createToken = chevrotain.createToken;
// using createToken API
const From = createToken({ name: "From", pattern: /FROM/ });
There is nothing much to it. We simply use the createToken API to define the tokens, and provide it with a name
property and a pattern
property which is a RegExp which will be used when splitting up the input string into separate Tokens.
More complex Tokens
How can we define Tokens for Identifiers or Integers?
const Identifier = createToken({ name: "Identifier", pattern: /[a-zA-Z]\w*/ });
const Integer = createToken({ name: "Integer", pattern: /0|[1-9]\d*/ });
Skipping Tokens
The obvious use case in this language (and many others) is whitespace. skipping certain Tokens is easily accomplished by marking them with the SKIP group.
const WhiteSpace = createToken({
name: "WhiteSpace",
pattern: /\s+/,
group: chevrotain.Lexer.SKIPPED,
});
All Our Tokens
Lets examine all the needed Tokens definitions:
const Identifier = createToken({ name: "Identifier", pattern: /[a-zA-Z]\w*/ });
// We specify the "longer_alt" property to resolve keywords vs identifiers ambiguity.
// See: https://github.com/chevrotain/chevrotain/blob/master/examples/lexer/keywords_vs_identifiers/keywords_vs_identifiers.js
const Select = createToken({
name: "Select",
pattern: /SELECT/,
longer_alt: Identifier,
});
const From = createToken({
name: "From",
pattern: /FROM/,
longer_alt: Identifier,
});
const Where = createToken({
name: "Where",
pattern: /WHERE/,
longer_alt: Identifier,
});
const Comma = createToken({ name: "Comma", pattern: /,/ });
const Integer = createToken({ name: "Integer", pattern: /0|[1-9]\d*/ });
const GreaterThan = createToken({ name: "GreaterThan", pattern: />/ });
const LessThan = createToken({ name: "LessThan", pattern: /</ });
const WhiteSpace = createToken({
name: "WhiteSpace",
pattern: /\s+/,
group: chevrotain.Lexer.SKIPPED,
});
Creating The Lexer
We now have Token definitions, but how do we create a Lexer from these?
// note we are placing WhiteSpace first as it is very common thus it will speed up the lexer.
let allTokens = [
WhiteSpace,
// "keywords" appear before the Identifier
Select,
From,
Where,
Comma,
// The Identifier must appear after the keywords because all keywords are valid identifiers.
Identifier,
Integer,
GreaterThan,
LessThan,
];
let SelectLexer = new Lexer(allTokens);
Note that:
The order of Token definitions passed to the Lexer is important. The first PATTERN to match will be chosen not the longest.
- See how to resolve Keywords vs Identifiers.
The lexer's
Tokenize
method is a pure function, thus only a single Lexer (per grammar) is needed.The lexer is context unaware, it lexes each token (pattern) individually.
- If you need to distinguish between different contexts during the lexing phase, take a look at Lexer Modes.
For more patterns requiring more complex constraints than a regular expression, take a look at Custom Token Patterns.
Using The Lexer
let inputText = "SELECT column1 FROM table2";
let lexingResult = SelectLexer.tokenize(inputText);
The Lexing Result will contain:
- A Token Vector.
- the lexing errors (if any were encountered)
- And other Token groups (if grouping was used)