rumil_parsers 0.10.0
rumil_parsers: ^0.10.0 copied to clipboard
Format parsers built on Rumil: JSON, CSV, XML, TOML, YAML, Proto3, HCL, and CommonMark Markdown, plus typed AST decoders with ObjectAccessor pattern.
0.10.0 #
The value layer is now stack-safe to memory, and serializers can stream. The parser interpreter was already stack-safe; this release extends that to the operations that run after a parse (native conversion, serialization, and composed decoding), which had been recursive on nesting depth, so a document that parsed fine could overflow the stack on the next step. All converters, serializers, and composite decoders are now iterative. Additive on the public surface; the version jumps 0.8.1 → 0.10.0 to rejoin the rumil family in lockstep.
Fixed — value-layer stack safety #
- Native converters (
jsonToNative,yamlToNative/resolveAnchors,tomlToNative,hclToNative,xmlToNative) now walk the AST over an explicit worklist instead of recursing, so converting a deeply-nested document cannot overflow the Dart call stack. - Serializers (
serializeJson,serializeYaml,serializeToml,serializeXml,serializeHcl) are likewise iterative. - Composite decoders (
jsonListOf/jsonMapOf/nullable and the YAML/TOML equivalents) drain their nesting over a worklist. The one remaining host-recursion boundary is the user build-callback offromJsonObject/fromYamlMapping/fromTomlTable(and.map), which recurses once per schema level, not per value level, and cannot be trampolined without a breakingAstDecoder.decodesignature change, so it is documented as a known boundary.
Added — streaming serialization #
serialize{Json,Yaml,Toml,Xml,Hcl,HclValue}To(StringSink, …): each serializer now has a streaming primitive that writes into aStringSink; the existingString-returning functions are byte-for-byte-identical wrappers over it. This decouples stack-safety from output size: an indented pretty-printer emitsindent × depthwhitespace per level (Θ(depth²) total, inherent to pretty-printing, as injqorJSON.stringify(_, null, 2)), and streaming to a sink keeps peak memory bounded even when the total output is large.
0.8.1 #
Two additive parsers: parseMarkdownWithFrontmatter and parseNdJson.
Motivated by lambé's input pipeline (markdown frontmatter currently
leaks into document body; NDJSON line splitting was hand-rolled
downstream) and by rem's markdownWithFrontmatter helper, which can
collapse to a thin re-export of the upstream API.
Added #
-
parseMarkdownWithFrontmatter(String input) → Result<ParseError, MarkdownDocument>. Parses Markdown that may have a leading YAML frontmatter block delimited by---lines. Returns aMarkdownDocumentcarrying both the optionalYamlDocumentfrontmatter and theMdDocumentbody. Detection rules: the opening---must sit at offset 0 and be followed by a newline; the closing fence is the first line containing exactly---; CRLF is tolerated; an unclosed block falls back to plain Markdown without raising an error; an empty block (---\n---\n) yieldsYamlNull. YAML parse errors inside a well-formed block surface as the result's failure.parseMarkdownis byte-unchanged. -
parseNdJson(String input, {NdJsonConfig config})→Result<ParseError, List<JsonValue>>. Parses newline-delimited JSON (NDJSON / JSON Lines). A\rimmediately before\nis stripped so CRLF input parses identically to LF. Per-line errors are accumulated asPartialrather than aborting the stream — callers see every parsed value and every error in one pass. ErrorLocations reference the original input, with line/column precomputed in O(log n) via the newrumil.LineIndex.parseJsonis byte-unchanged.Strict by default. Blank lines are parse errors, matching jsonlines.org. The opt-in
NdJsonConfig(lenient: true)skips blank lines for log-file consumers and stanza-style inputs. Strict mode is the right default — tolerating blank lines silently is the kind of choice that makes one parser quietly different from another and bugs in upstream producers go unnoticed.
0.8.0 #
JSON parser, principled and fast. The HCL number AST follows the same
split. Five logical chunks ship together: the HCL decoder fix
originally scoped as 0.7.1, a JSON AST split (JsonNumber →
JsonInt | JsonDouble), the matching HCL AST split (HclNumber →
HclInt | HclDouble), a JSON parser perf overhaul, and two latent
correctness fixes (common.floatingPoint precision, YAML integer
overflow).
Changed (breaking) #
-
JsonNumberis now a sealed sum ofJsonInt(int)andJsonDouble(double). The previous single-JsonNumber(double)representation flattened integer-shaped and float-shaped tokens at the AST layer, silently losing precision for integers above 2^53 and denying downstream consumers the type discrimination they need to specialize integer-vs-float paths.The new shape matches the discrimination already present in
dart:convert(wherejsonDecodereturnsintordoublebased on token shape), serde_json'sNumberenum (PosInt/NegInt/Float), simdjson'snumber_type(signed_integer/unsigned_integer/floating_point_number), and Jackson'sNumericNodehierarchy. Pattern matching onJsonNumberbecomes pattern matching onJsonIntorJsonDouble. Equality across the variants isfalse:JsonInt(1) != JsonDouble(1.0).Big integers exceeding Dart's
intrange fall back toJsonDouble, matchingdart:convert's rule. Adding an explicitJsonBigIntvariant is reserved for a future release if real consumers need it.Round-trip fidelity is improved as a side effect:
parseJson('1.0')now serializes back as'1.0'rather than'1'. The source token shape is preserved.Decoders are tolerant of either variant —
jsonInt.decode(JsonDouble)narrows viavalue.toInt(),jsonDouble.decode(JsonInt)widens viavalue.toDouble(). Documented on each decoder. -
HclNumberis now a sealed sum ofHclInt(int)andHclDouble(double), mirroring the JSON AST split. The previous single-HclNumber(num)representation forced consumers to dispatch onvalue is intat every read; the new shape preserves the discrimination at parse time. Integer-shaped tokens that overflow Dart'sintfall back toHclDouble, matching JSON's rule. Equality across variants isfalse. Pattern matching onHclNumberbecomes pattern matching onHclIntorHclDouble. Round-trip preserves source token shape:1parses asHclInt(1)and serializes as'1';1.0parses asHclDouble(1.0)and serializes as'1.0'(was'1'under the flattened representation).
Fixed #
-
HCL decoder is now consistent across N=1 vs N≥2 same-labeled blocks.
hclDocToNativepreviously returned a single block as a non-list ({...}) and multiple blocks as a list. Now blocks always return as lists, regardless of count, using theHclBlockdiscriminator already present in the AST. Attributes are unchanged. Consumers that pattern-matched onresult['variable'] is Mapfor the N=1 case must switch toresult['variable'] is List(always). The previous behavior threw away structural information from the parser AST and made common Terraform patterns (oneterraform, oneprovider, singlevariable) require defensive shape checks. -
common.floatingPoint()precision. The helper previously computedvalue * math.pow(10, exp)for tokens with an exponent; that multiplication rounded before assembly and dropped the smallest positive subnormal (5e-324) to0.0. Now delegates todouble.parseon the captured source slice, which uses the platform's correctly-rounded conversion. YAML inherits the fix automatically since it consumesfloatingPoint(). -
YAML integer overflow.
_yamlIntegerpreviously calledint.parse(...)(viacommon.signedInt()), which throws on tokens exceeding Dart'sintrange. Now usesint.tryParse+ fallback toYamlFloat, matching JSON's big-integer rule. Affects YAML documents with very large integer literals (e.g.2^63or beyond).
Performance #
The JSON parser is now substantially faster on every workload, with the largest wins under Wasm where the JsonInt/JsonDouble split unlocks i64-vs-f64 specialization that the flattened representation forced into a single homogeneous f64 path.
Mean μs/op across 100 measured iterations + 100 warmup, Linux x86_64,
Dart SDK 3.11.4. Each pass run separately on a quiet system. Full
table and per-byte MB/s in BENCHMARKS.md.
| Workload | 0.7.0 AOT | 0.8.0 AOT | AOT speedup | 0.7.0 Wasm | 0.8.0 Wasm | Wasm speedup |
|---|---|---|---|---|---|---|
| integer_heavy | 162.1 ms | 154.5 ms | 1.05× | 86.8 ms | 64.7 ms | 1.34× |
| float_heavy | 189.2 ms | 179.5 ms | 1.05× | 96.0 ms | 76.9 ms | 1.25× |
| mixed | 1368 ms | 1115 ms | 1.23× | 609.4 ms | 430.3 ms | 1.42× |
Wins come from three changes: capture-based number parsing (one
allocation per token instead of a per-character interpolation chain),
capture-based string runs (one substring slice in the unescaped fast
path instead of O(n) per-character allocations), and elimination of
the redundant leading _ws in _lex (every token paid a leading skip
that the previous token's trailing skip had already consumed). The
combinator architecture's affinity for Wasm codegen surfaces in the
Wasm column — the mixed workload composes all four optimizations
(numbers, strings, dispatch, lex) and shows the largest relative win.
Reproduce via rumil_bench's bench_json_perf_pass:
cd rumil_bench
dart compile exe bin/bench_json_perf_pass.dart -o /tmp/perf.aot
/tmp/perf.aot
For the Wasm column, see BENCHMARKS.md for the full instructions.
0.7.0 #
- JSON: value-dispatch parser migrated from a 6-way
Orchain to rumil's newfirstCharChoicecombinator. JSON values have cleanly disjoint leading chars (n,t/f, digits/-,",[,{), so the O(1) dispatch replaces the linear scan. Bench numbers (AOT native, 6 runs): json-small 24.0 µs → 18.4 µs (-23%), json-medium 35.0 ms → 25.7 ms (-27%), json-large 429 ms → 312 ms (-27%). vs petitparser ratio improves from 13× to ~10× small / ~9× large. All RFC 8259 conformance tests pass unchanged. - HCL: operator precedence parser migrated from a six-layered
chainl1ladder + recursive_unaryto a singlepratt(...)call using rumil's newcFamilyPrecedencepreset. Functionally equivalent — same operators, same binding powers, same AST. Bench numbers: hcl-config 253 µs → 225 µs (-11%), hcl-50res 10.7 ms → 9.32 ms (-13%) on AOT native. All HCL conformance tests (specsuite, fuzz corpus, terraform-provider-aws .tf files) pass unchanged. - Other format parsers (CSV, TOML, XML, YAML, Proto3, Markdown) are
unchanged at the source level. They benefit transparently from
rumil 0.7's
Many(StringMatch)/SkipMany(simple)fast paths (CSV measured 10–22% faster) and from the FIRST-set Or dispatch optimization (small wins on alternation-heavy grammars). - Depends on
rumil: ^0.7.0.
0.6.0 #
- Depends on
rumil: ^0.6.0. Version aligned with the rumil-dart monorepo 0.6.0 release. No functional changes in this package.
0.5.0 #
CommonMark Markdown parser. Architecture audit. 7376 tests.
- Markdown: 652/652 CommonMark 0.31.2 spec conformance. Typed
MdNodeAST with structured fields (MdHeading.level,MdLink.href,MdImage.alt) — separates parsing from rendering. Public API:parseMarkdown(String) → Result<ParseError, MdDocument>. - TOML: Replace
throw/try-catchwithResult-based error flow. Zero exceptions in the parser. - XML: Replace manual
indexOf/substringwith combinators for QName parsing, entity reference validation, and attribute value expansion. - Delimited: Replace
while-loop field splitter andRegExpwith combinator parsers. - All formats: Apply
.captureoptimization (12 sites) — each benefits from fusedCapture(Many)interpreter fast path. - TOML: Deduplicate unicode escape parsers into parameterized
_unicodeEscape(marker, count). - Depends on rumil ^0.5.0.
0.4.0 #
All parsers to spec conformance. 6724 tests, zero analyzer warnings.
- HCL full spec: expression tower (operators, ternary, for-expressions,
function calls), string templates
${expr}, heredocs<<EOF/<<-EOF, template directives%{if}/%{for}, index/splat[*]/.*, scientific notation, Unicode identifiers, parenthesized object keys, object element commas. 2760/2760 including 2717 terraform-provider-aws.tffiles. - XML 1.0 5e: W3C conformance suite — 1506/1506. DOCTYPE/DTD parsing,
external entity resolution, namespace validation, Unicode names,
attribute uniqueness,
--restriction in comments. - Delimited overhaul: three-tier architecture (explicit config /
auto-detect dialect / per-row robust), BOM stripping, ragged row policies,
detectDialect(),parseDelimitedRobust(). 100 tests. - YAML 1.2: anchors, aliases, merge keys, block scalars, multi-document,
full escape set,
resolveAnchors(),YamlParseConfig. 333/333. - JSON: 318/318. TOML 1.1: 681/681. Proto3: 101/101.
- Conformance test runners for all formats in
test/conformance/.
0.3.1 #
- Doc on
ObjectBuilderconstructor. - Depends on rumil ^0.3.0.
0.3.0 #
- AST encoders + serializers for JSON, TOML, YAML, XML, CSV, Proto3, HCL.
- AstBuilder with nativeToAst for JSON, YAML, TOML, XML, HCL.
- Native decoders: jsonToNative, yamlToNative, tomlToNative, xmlToNative, hclToNative.
- Shared escape utilities.
- operator == and hashCode on all AST classes.
- YAML indentation-based nested block parsing.
- HCL parser (attributes, blocks, comments, references).
- 278 tests.
0.2.0 #
- Doc comments on all public API elements.
- Depends on rumil ^0.2.0 (
failrenamed tofailure).
0.1.0 #
- Core parser combinators: sealed Parser ADT with 26 subtypes, external interpreter, defunctionalized trampoline
- Warth seed-growth left recursion via
rule() - Stack-safe to 10M+ operations
- Typed errors with source location (line, column, offset)
- Lazy error construction via
late finalthunks - RadixNode O(m) string matching
- Full combinator DSL:
.zip(),.thenSkip(),.skipThen(),|,.map,.flatMap,.many,.sepBy,.chainl1,.chainr1,.between,.capture,.memoize - Format parsers: JSON (RFC 8259), CSV (RFC 4180), XML, TOML (v1.0.0), YAML (simplified 1.2), Proto3 schema
- AST decoders for JSON, TOML, YAML with
ObjectAccessorpattern - Formula evaluator with operator precedence via
chainl1, variables, custom functions - Binary codec: ZigZag, LEB128 Varint, BinaryCodec with
xmap+product2–product6composition - build_runner codegen for
@binarySerializableclasses and sealed hierarchies