Custom Token Patterns

TLDR

See: Runnable exampleopen in new window for quick starting.

Background

Normally a Token's pattern is defined using a JavaScript regular expression:

const IntegerToken = createToken({ name: "IntegerToken", pattern: /\d+/ });

However in some circumstances the capability to provide a custom pattern matching implementation may be required. There are a few use cases in which a custom pattern could be used:

Usage

A custom pattern has a similar API to the API of the RegExp.prototype.execopen in new window function. But with a small constraint.

  • A custom pattern should behave as though the RegExp sticky flagopen in new window has been set. This means that attempted matches must begin at the offset argument, not at the start of the input.

The basic syntax for supplying a custom pattern is defined by the ICustomPatternopen in new window interface. Example:

function matchInteger(text, startOffset) {
  let endOffset = startOffset;
  let charCode = text.charCodeAt(endOffset);
  // 0-9 digits
  while (charCode >= 48 && charCode <= 57) {
    endOffset++;
    charCode = text.charCodeAt(endOffset);
  }

  // No match, must return null to conform with the RegExp.prototype.exec signature
  if (endOffset === startOffset) {
    return null;
  } else {
    let matchedString = text.substring(startOffset, endOffset);
    // according to the RegExp.prototype.exec API the first item in the returned array must be the whole matched string.
    return [matchedString];
  }
}

createToken({
  name: "IntegerToken",
  pattern: { exec: matchInteger },

  // Optional property that will enable optimizations in the lexer
  // See: https://chevrotain.io/documentation/11_0_3/interfaces/ITokenConfig.html#start_chars_hint
  start_chars_hint: ["1", "2", "3", "4", "5", "6", "7", "8", "9"],
});

Using an Object literal with only a single property is still a little verbose so an even more concise syntax is also supported:

// pattern is passed the matcher function directly.
createToken({ name: "IntegerToken", pattern: matchInteger });

Lexing Context

A custom token matcher has two optional arguments which allows accessing the current lexing context. This context can be used to allow or disallow lexing certain Token Types depending on the previously lexed tokens.

Lets expand the previous example to only allow lexing integers if the previous token was not an identifier (contrived example).

import { tokenMatcher } from "chevrotain";

function matchInteger(text, offset, matchedTokens, groups) {
  let lastMatchedToken = _.last(matchedTokens);

  // An Integer may not follow an Identifier
  if (tokenMatcher(lastMatchedToken, Identifier)) {
    // No match, must return null to conform with the RegExp.prototype.exec signature
    return null;
  }
  // rest of the code from the example above...
}

A larger and non contrived example can seen here: Lexing Python like indentation using Chevrotainopen in new window.

It is important to note that The matchedTokens and groups arguments match the token and groups properties of the tokenize output (ILexingResultopen in new window). These arguments are the current state of the lexing result so even if the lexer has performed error recovery any tokens found in those arguments are still guaranteed to be in the final result.

Custom Payloads

Sometimes we want to collect additional properties on an IToken object, for example:

  • Save RegExp capturing groups on the token object.
  • Subsets of the matched text, e.g: strip away the quotes from string literals.
  • Computed values from the matched text, e.g: Integer values of Date parts (day/month/year).

This can be done by attaching a payload property to our custom token matcher returned value, for example:

// We define the regExp only **once** (outside) to avoid performance issues.
const stringLiteralPattern =
  /"(?:[^\\"]|\\(?:[bfnrtv"\\/]|u[0-9a-fA-F]{4}))*"/y;
function matchStringLiteral(text, startOffset) {
  // using 'y' sticky flag (Note it is not supported on IE11...)
  // https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/sticky
  stringLiteralPattern.lastIndex = startOffset;

  // Note that just because we are using a custom token pattern
  // Does not mean we cannot implement it using JavaScript Regular Expressions...
  const execResult = stringLiteralPattern.exec(text);
  if (execResult !== null) {
    const fullMatch = execResult[0];
    // compute the payload
    const matchWithOutQuotes = fullMatch.substr(1, fullMatch.length - 2);
    // attach the payload
    execResult.payload = matchWithOutQuotes;
  }

  return execResult;
}

const StringLiteral = createToken({
  name: "StringLiteral",
  pattern: matchStringLiteral,
  // custom patterns should explicitly specify the line_breaks option.
  line_breaks: false,
});

// When we lex a StringLiteral text a "payload" property will now exist on the resulting token object.

Note:

  • A custom pattern may be implemented using Regular Expressions, these concepts are not mutually exclusive.
  • The payload property may be anything e.g:
    • A single value (as in the example above).
    • A JavaScript object with multiple properties.
    • Capturing groups from a regExp exec method's results.
    • The "groups" property of an regExp exec method's result (If Named Capturing Groups are usedopen in new window).

Additional examples can be found here: Runnable example for custom payloadsopen in new window.