betto_icu 0.1.0-dev.1
betto_icu: ^0.1.0-dev.1 copied to clipboard
Unicode text tokenization for Dart — Tokenizer interface, IcuTokenizer (system ICU FFI, UAX #29), and RegExpTokenizer (pure Dart, Latin fallback).
Changelog #
0.1.0-dev.1 #
Initial release.
Added #
-
Tokenizer— abstract interface for text segmentation. Implementations return only word-like tokens (letters and digits); punctuation and whitespace are discarded. -
IcuTokenizer— UAX #29 word-boundary segmentation via the system ICU library (FFI). Handles non-Latin scripts: CJK, Arabic, Thai, Devanagari, and more. Supported platforms:Platform Library macOS / iOS libicucore.dylib(ships with the OS)Android libicuuc.so(NDK)Linux libicuuc.so.NN(installlibicu-dev/icu)Windows icu.dll(Windows 10+)Probes for ICU symbol renaming (e.g.
ubrk_open_76on Debian Trixie+ and Fedora) so no manual configuration is needed.Uses
RegExp-based span classification instead ofubrk_getRuleStatus()for portability across all supported platforms, including Apple'slibicucorewhich does not export UAX #29 rule-status tags. -
RegExpTokenizer— pure-Dart tokenizer usingRegExpwith Unicode property escapes (\p{L},\p{N}). Zero FFI dependencies; suitable for English prose and common technical identifiers (mTLS,0x8004210B, etc.). Works on every platform, including web. -
BrowserTokenizer— UAX #29 word-boundary segmentation via the browser'sIntl.SegmenterAPI (dart:js_interop). Web-only; zero bundle cost. Requires Chrome 87+, Firefox 125+, or Safari 16.4+. On native targets the conditional import resolves to a stub that throwsUnsupportedError. -
Conditional export pattern in
lib/betto_icu.dart: the correctIcuTokenizerimplementation (real FFI or stub) andBrowserTokenizerimplementation (real JS interop or stub) are selected at compile time viadart.library.ffianddart.library.js_interop. -
IcuTokenizer.forPlatform(String platform)— named constructor that loads the ICU library for an explicit platform string. Enables tests to exercise each library-loading branch without requiring the native OS. -
bin/tokenize.dart— command-line tool that tokenizes a string with either or both tokenizers (--icu,--regexp). -
Flutter integration test app (
integration_test_app/) — runs the full contract and UAX #29 test suite on Android and iOS viapackage:integration_test.