string/tokenizer_pipeline_utils library

Customizable tokenizer pipeline: ordered regex rules with keep/skip — roadmap #434.

A reusable lexer core: give it an ordered list of TokenRules and it walks the input, at each position taking the FIRST rule that matches as a prefix. Rules marked skip (whitespace, comments) advance the cursor without emitting a token. Unlike a one-off hand-rolled split/RegExp.allMatches, rule order resolves ambiguity deterministically and an unmatched position is a hard error rather than silently dropped text.

Classes

Token
A produced token: its rule type, the matched value, and the start offset into the original input.
TokenRule
One tokenizer rule: a type label, the pattern to match at the cursor, and whether matches are dropped (skip) instead of emitted.

Functions

tokenize(String input, List<TokenRule> rules) List<Token>
Tokenizes input by trying rules in order at each cursor position; the first rule whose pattern matches as a prefix wins. Skipped rules advance without emitting. Throws FormatException (with the offset) at any position no rule matches, and treats a zero-length match as a non-match so a rule like \d* can never spin the cursor in place.