DonutTokenizer class

Tokenizer for Donut model.

Supports:

  • Loading vocabulary from tokenizer.json (HuggingFace format)
  • BPE tokenization
  • Special tokens (BOS, EOS, PAD, UNK, SEP)
  • Dynamic addition of task-specific special tokens
  • Encoding text to token IDs
  • Decoding token IDs back to text

Constructors

DonutTokenizer({required Map<String, int> vocab, required List<(String, String)> merges, String bosToken = '<s>', String eosToken = '</s>', String padToken = '<pad>', String unkToken = '<unk>', Set<String>? specialTokens})
DonutTokenizer.fromFile(String path)
Load tokenizer from a file path.
factory
DonutTokenizer.fromJson(String jsonString)
Load tokenizer from HuggingFace tokenizer.json format.
factory
DonutTokenizer.fromVocab(Map<String, int> vocab)
Create a simple tokenizer with just a vocabulary mapping.
factory

Properties

bosToken String
BOS token string.
getter/setter pair
bosTokenId int
Special token IDs.
getter/setter pair
eosToken String
EOS token string.
getter/setter pair
eosTokenId int
getter/setter pair
hashCode int
The hash code for this object.
no setterinherited
idToToken Map<int, String>
ID to token mapping.
final
merges List<(String, String)>
BPE merge rules (ordered).
final
padToken String
PAD token string.
getter/setter pair
padTokenId int
getter/setter pair
runtimeType Type
A representation of the runtime type of the object.
no setterinherited
specialTokens Set<String>
Set of special tokens.
final
unkToken String
UNK token string.
getter/setter pair
unkTokenId int
getter/setter pair
vocab Map<String, int>
Token to ID mapping.
final
vocabSize int
Vocabulary size.
no setter

Methods

addSpecialTokens(List<String> tokens) int
Add special tokens to the vocabulary.
batchDecode(List<List<int>> batchIds, {bool skipSpecialTokens = true}) List<String>
Batch decode: decode multiple sequences.
decode(List<int> ids, {bool skipSpecialTokens = true}) String
Decode a list of token IDs back to text.
encode(String text, {bool addSpecialTokens = true, int? maxLength}) List<int>
Encode text to a list of token IDs.
getAddedVocab() Map<String, int>
Get all added vocabulary entries (special tokens that were dynamically added).
noSuchMethod(Invocation invocation) → dynamic
Invoked when a nonexistent method or property is accessed.
inherited
pad(List<List<int>> sequences, {int? maxLength, bool padToMaxLength = false}) → (List<List<int>>, List<List<int>>)
Pad a list of token sequences to the same length.
save(String path) → void
Save tokenizer to a file.
toJson() Map<String, dynamic>
Convert tokenizer to a JSON-serializable map.
tokenize(String text) List<String>
Tokenize text into a list of token strings using BPE.
toString() String
A string representation of this object.
inherited

Operators

operator ==(Object other) bool
The equality operator.
inherited