rumil_tokens 0.1.0
rumil_tokens: ^0.1.0 copied to clipboard
Lossless source code tokenizer built on Rumil parser combinators. Classified token spans for syntax highlighting. Built-in grammars for Dart, Scala, YAML, JSON, and shell.
rumil_tokens #
Lossless source code tokenizer built on
Rumil parser combinators. Classifies
source text into typed token spans: keywords, strings, comments,
numbers, types, annotations, operators, variables, and punctuation.
Token streams are lossless; concatenating token.text across a stream
reconstructs the input exactly. Built-in grammars for Dart, Scala,
YAML, JSON, and shell.
Usage #
import 'package:rumil_tokens/rumil_tokens.dart';
final tokens = tokenize('final x = 42; // answer', dart);
for (final token in tokens) {
print('${token.runtimeType}: ${token.text}');
}
Use a built-in grammar (dart, scala, yaml, json, shell) or define
your own:
const rust = LangGrammar(
name: 'rust',
keywords: ['fn', 'let', 'mut', 'if', 'else', 'match', 'impl', 'struct'],
types: ['i32', 'u64', 'String', 'Vec', 'Option', 'Result', 'bool'],
lineComment: '//',
blockComment: ('/*', '*/'),
stringDelimiters: ['"'],
annotationPrefix: '#',
);
final tokens = tokenize(source, rust);
Look up a grammar by name:
final grammar = grammarFor('dart'); // returns null for unknown languages
Enumerate the built-in grammars:
for (final g in builtinGrammars) {
print(g.name);
}
For hot paths (REPL highlighting, large files), build the parser once and reuse it across calls:
final dartTokenizer = buildTokenizer(dart);
for (final source in sources) {
final result = dartTokenizer.run(source);
// ...
}
Lossless property #
Concatenating token.text for every token reconstructs the original source:
assert(tokens.map((t) => t.text).join() == source);
Positions #
For tooling that needs byte offsets, use tokenizeSpans:
final spans = tokenizeSpans(source, dart);
for (final s in spans) {
print('[${s.start}, ${s.end}) ${s.token}');
assert(source.substring(s.start, s.end) == s.token.text);
}
Spanned<Token> is an extension type over (Token, int, int).
The [start, end) interval is half-open; spans are contiguous
(spans[i].end == spans[i+1].start) and anchored (spans.first.start == 0,
spans.last.end == source.length).
Grammar coverage #
Known limitations (see CHANGELOG.md):
- YAML block scalars (
|,>) tokenize the indented body as regular YAML content rather than one string literal. - Dart string interpolation (
"$x","${expr}") remains oneStringLit. - Shell braced expansions do not balance nested braces.
- Heredoc body is one
StringLit. - Nested generic close renders the outer
>>as right-shift.
Part of the rumil-dart monorepo.