DonutTokenizer class
Tokenizer for Donut model.
Supports:
- Loading vocabulary from tokenizer.json (HuggingFace format)
- BPE tokenization
- Special tokens (BOS, EOS, PAD, UNK, SEP)
- Dynamic addition of task-specific special tokens
- Encoding text to token IDs
- Decoding token IDs back to text
Constructors
-
DonutTokenizer({required Map<
String, int> vocab, required List<(String, String)> merges, String bosToken = '<s>', String eosToken = '</s>', String padToken = '<pad>', String unkToken = '<unk>', Set<String> ? specialTokens}) - DonutTokenizer.fromFile(String path)
-
Load tokenizer from a file path.
factory
- DonutTokenizer.fromJson(String jsonString)
-
Load tokenizer from HuggingFace tokenizer.json format.
factory
-
DonutTokenizer.fromVocab(Map<
String, int> vocab) -
Create a simple tokenizer with just a vocabulary mapping.
factory
Properties
- bosToken ↔ String
-
BOS token string.
getter/setter pair
- bosTokenId ↔ int
-
Special token IDs.
getter/setter pair
- eosToken ↔ String
-
EOS token string.
getter/setter pair
- eosTokenId ↔ int
-
getter/setter pair
- hashCode → int
-
The hash code for this object.
no setterinherited
-
idToToken
→ Map<
int, String> -
ID to token mapping.
final
-
merges
→ List<
(String, String)> -
BPE merge rules (ordered).
final
- padToken ↔ String
-
PAD token string.
getter/setter pair
- padTokenId ↔ int
-
getter/setter pair
- runtimeType → Type
-
A representation of the runtime type of the object.
no setterinherited
-
specialTokens
→ Set<
String> -
Set of special tokens.
final
- unkToken ↔ String
-
UNK token string.
getter/setter pair
- unkTokenId ↔ int
-
getter/setter pair
-
vocab
→ Map<
String, int> -
Token to ID mapping.
final
- vocabSize → int
-
Vocabulary size.
no setter
Methods
-
addSpecialTokens(
List< String> tokens) → int - Add special tokens to the vocabulary.
-
batchDecode(
List< List< batchIds, {bool skipSpecialTokens = true}) → List<int> >String> - Batch decode: decode multiple sequences.
-
decode(
List< int> ids, {bool skipSpecialTokens = true}) → String - Decode a list of token IDs back to text.
-
encode(
String text, {bool addSpecialTokens = true, int? maxLength}) → List< int> - Encode text to a list of token IDs.
-
getAddedVocab(
) → Map< String, int> - Get all added vocabulary entries (special tokens that were dynamically added).
-
noSuchMethod(
Invocation invocation) → dynamic -
Invoked when a nonexistent method or property is accessed.
inherited
-
pad(
List< List< sequences, {int? maxLength, bool padToMaxLength = false}) → (List<int> >List< , List<int> >List< )int> > - Pad a list of token sequences to the same length.
-
save(
String path) → void - Save tokenizer to a file.
-
toJson(
) → Map< String, dynamic> - Convert tokenizer to a JSON-serializable map.
-
tokenize(
String text) → List< String> - Tokenize text into a list of token strings using BPE.
-
toString(
) → String -
A string representation of this object.
inherited
Operators
-
operator ==(
Object other) → bool -
The equality operator.
inherited