dart_wordpiece 1.1.0 copy "dart_wordpiece: ^1.1.0" to clipboard
dart_wordpiece: ^1.1.0 copied to clipboard

Pure Dart BERT-compatible WordPiece tokenizer. Outputs input_ids, attention_mask, and token_type_ids. No Flutter or native dependencies.

dart_wordpiece #

A pure Dart implementation of the BERT-compatible WordPiece tokenizer.

Converts raw text into the three integer sequences expected by BERT-style ONNX models — input_ids, attention_mask, and token_type_ids — with zero dependencies and no Flutter requirement.


Features #

  • WordPiece algorithm — greedy longest-match-first subword encoding
  • Single & pair encodingencode() and encodePair() for QA / NLI tasks
  • Batch encodingencodeAll(List<String>)
  • Text normalization — lowercase, punctuation removal, stopword filtering
  • Three vocab loading strategies — from File, from String, from Map
  • Configurable — max length, special tokens, stopwords, normalization toggle
  • ONNX-readyInt64List getters for direct tensor creation
  • Pure Dart — works in Flutter, CLI, and server-side Dart

Getting started #

Add to your pubspec.yaml:

dependencies:
  dart_wordpiece: ^1.1.0

Usage #

1. Load vocabulary #

import 'package:dart_wordpiece/dart_wordpiece.dart';

// From a file (dart:io)
final vocab = await VocabLoader.fromFile(File('/path/to/vocab.txt'));

// From a Flutter asset string
final raw = await rootBundle.loadString('assets/vocab.txt');
final vocab = VocabLoader.fromString(raw);

// From an in-memory map (useful for tests)
final vocab = VocabLoader.fromMap({'[PAD]': 0, '[UNK]': 1, ...});

2. Create tokenizer #

// Default BERT configuration (maxLength=64, no stopwords)
final tokenizer = WordPieceTokenizer(vocab: vocab);

// Custom configuration
final tokenizer = WordPieceTokenizer(
  vocab: vocab,
  config: TokenizerConfig(
    maxLength: 128,
    stopwords: {'what', 'is', 'the', 'a', 'an'},
    normalizeText: true,   // lowercase + remove punctuation
  ),
);

3. Encode a single sequence #

final output = tokenizer.encode('What is Flutter?');

print(output.inputIds);      // [101, 2054, 2003, 14246, 2102, 1029, 102, 0, …]
print(output.attentionMask); // [1, 1, 1, 1, 1, 1, 1, 0, …]
print(output.tokenTypeIds);  // [0, 0, 0, 0, 0, 0, 0, 0, …]
print(output.realLength);    // 7  (non-padding positions)

4. Encode a sentence pair #

Use for BERT-based question answering or natural language inference:

final output = tokenizer.encodePair(
  'Flutter is a cross-platform UI toolkit.',  // segment A
  'What is Flutter?',                          // segment B
);

// Format:  [CLS] <A> [SEP] <B> [SEP] [PAD]…
// typeIds:   0    0    0    1    1    0…

5. Batch encoding #

final outputs = tokenizer.encodeAll([
  'What is Flutter?',
  'Dart is fast',
  'How to use isolates?',
]);
// Each output has the same length (maxLength) → stack into a batch tensor.

6. Inspect raw token strings #

tokenizer.tokenize('unaffable');
// → ['[CLS]', 'un', '##aff', '##able', '[SEP]']

7. Feed directly to an ONNX model #

// package:onnxruntime integration
final out = tokenizer.encode(query);
final inputs = {
  'input_ids':      OrtValueTensor.createTensorWithDataList(out.inputIdsInt64,       [1, out.length]),
  'attention_mask': OrtValueTensor.createTensorWithDataList(out.attentionMaskInt64,  [1, out.length]),
  'token_type_ids': OrtValueTensor.createTensorWithDataList(out.tokenTypeIdsInt64,   [1, out.length]),
};
final results = session.run(OrtRunOptions(), inputs);

API reference #

WordPieceTokenizer #

Member Description
WordPieceTokenizer({vocab, config}) Main constructor
WordPieceTokenizer.fromFile(file, {config}) Async factory from dart:io File
WordPieceTokenizer.fromString(content, {config}) Sync factory from String
encode(text)TokenizerOutput Encode single sequence
encodePair(textA, textB)TokenizerOutput Encode sentence pair
encodeAll(texts)List<TokenizerOutput> Batch encode
tokenize(text)List<String> Raw token strings (no padding)
tokenToId(token)int? Look up token ID
idToToken(id)String? Look up token string — O(1)
vocabSize Number of tokens in vocabulary

TokenizerOutput #

Member Type Description
inputIds List<int> Vocabulary IDs
attentionMask List<int> 1 = real token, 0 = padding
tokenTypeIds List<int> 0 = segment A, 1 = segment B
length int Always equals maxLength
realLength int Non-padding positions
inputIdsInt64 Int64List Ready for ONNX tensor
attentionMaskInt64 Int64List Ready for ONNX tensor
tokenTypeIdsInt64 Int64List Ready for ONNX tensor

TokenizerConfig #

Parameter Default Description
maxLength 64 Output sequence length (includes special tokens)
specialTokens SpecialTokens.bert() [CLS], [SEP], [PAD], [UNK], ##
stopwords {} Words removed before tokenization
normalizeText true Lowercase + remove punctuation

VocabLoader #

Method Description
VocabLoader.fromFile(File) Async load from file
VocabLoader.fromString(String) Sync parse from vocab text
VocabLoader.fromMap(Map) Wrap pre-built map

WordPiece algorithm #

  1. Split text into whitespace-delimited words.
  2. For each word, find the longest prefix present in the vocabulary.
  3. Emit the prefix as a token; prepend ## to the remaining suffix and repeat.
  4. If no single-character match exists, emit [UNK] for the whole word.
"unaffable"  →  ["un", "##aff", "##able"]
"playing"    →  ["play", "##ing"]
"xyz123"     →  ["[UNK]"]   (if none of the sub-strings are in vocab)

Where to get vocab.txt #

Download pre-trained BERT vocabularies from HuggingFace:

Model Link Language Size
BERT Base Uncased vocab.txt English ~100K tokens
BERT Base Cased vocab.txt English (cased) ~28K tokens
BERT Multilingual vocab.txt 104 languages ~120K tokens
RuBERT (Russian) vocab.txt Russian ~119K tokens
DistilBERT vocab.txt English ~30K tokens

Quick download (in a terminal):

curl -o vocab.txt https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt

For Flutter apps, place vocab.txt in assets/ and update pubspec.yaml:

flutter:
  assets:
    - assets/vocab.txt

Multilingual support #

The tokenizer works with any language — just use a multilingual vocab:

// Load multilingual BERT vocab
final raw = await rootBundle.loadString('assets/multilingual_vocab.txt');
final vocab = VocabLoader.fromString(raw);
final tokenizer = WordPieceTokenizer(vocab: vocab);

// Works with English, Chinese, Russian, Arabic, etc.
tokenizer.encode('Hello World');  // English
tokenizer.encode('你好世界');      // Chinese
tokenizer.encode('Привет мир');  // Russian
tokenizer.encode('مرحبا بالعالم');  // Arabic

The TextNormalizer uses Unicode-aware regex (\p{L}, \p{N}), so punctuation removal and letter detection work correctly for all writing systems.


Compatibility #

Compatible with vocabularies from:

  • bert-base-uncased
  • bert-base-cased
  • distilbert-base-uncased
  • bert-base-multilingual-cased
  • cointegrated/rubert-tiny2
  • Any model that follows the HuggingFace vocab.txt format

Performance #

  • Single tokenization: ~1–5 ms (on modern hardware)
  • Batch of 100 queries: ~50–200 ms
  • Memory footprint: Vocab size × ~8 bytes (e.g., 30K tokens = ~240 KB)
  • No external dependencies — pure Dart with optimized regex patterns

Contributing #

Issues and pull requests are welcome! Please:

  1. Run dart analyze and ensure no warnings
  2. Run dart test and ensure all tests pass
  3. Follow the Dart style guide

License #

MIT

4
likes
160
points
83
downloads

Publisher

verified publisherkorshun.xyz

Weekly Downloads

Pure Dart BERT-compatible WordPiece tokenizer. Outputs input_ids, attention_mask, and token_type_ids. No Flutter or native dependencies.

Repository (GitHub)
View/report issues

Documentation

API reference

License

MIT (license)

More

Packages that depend on dart_wordpiece