encode method
Uint32List
encode(
- String text, {
- SpecialTokensSet allowedSpecial = const SpecialTokensSet.empty(),
- SpecialTokensSet disallowedSpecial = const SpecialTokensSet.all(),
Encodes a string into tokens.
Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. So we want to be careful about accidentally encoding special tokens, since they can be used to trick a model into doing something we don't want it to do.
Hence, by default, encode will raise an error if it encounters text that corresponds
to a special token. This can be controlled on a per-token level using the allowedSpecial
and disallowedSpecial
parameters. In particular:
- Setting
disallowedSpecial
to SpecialTokensSet.empty() will prevent this function from raising errors and cause all text corresponding to special tokens to be encoded as natural text. - Setting
allowedSpecial
to SpecialTokensSet.all() will cause this function to treat all text corresponding to special tokens to be encoded as special tokens.
Example:
final enc = getEncoding("gpt2"); // Get instance of encoder
enc.encode("hello world"); // [31373, 995]
enc.encode("<|endoftext|>", allowedSpecial: SpecialTokensSet.custom({"<|endoftext|>"})); // [50256]
enc.encode("<|endoftext|>", allowedSpecial: SpecialTokensSet.all()); // [50256]
enc.encode("<|endoftext|>") // Throws
enc.encode("<|endoftext|>", disallowedSpecial: SpecialTokensSet.empty()); // [27, 91, 437, 1659, 5239, 91, 29]
Implementation
Uint32List encode(
String text, {
SpecialTokensSet allowedSpecial = const SpecialTokensSet.empty(),
SpecialTokensSet disallowedSpecial = const SpecialTokensSet.all(),
}) {
final allowedSpecialSet =
allowedSpecial.isAll ? specialTokensSet : allowedSpecial.set;
final disallowedSpecialSet = disallowedSpecial.isAll
? specialTokensSet.difference(allowedSpecialSet)
: disallowedSpecial.set;
_verifyDisallowed(text, disallowedSpecialSet);
return _coreBPE.encodeNative(text, allowedSpecialSet).i1;
}