encode method

Uint32List encode(
  1. String text, {
  2. SpecialTokensSet allowedSpecial = const SpecialTokensSet.empty(),
  3. SpecialTokensSet disallowedSpecial = const SpecialTokensSet.all(),
})

Encodes a string into tokens.

Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. So we want to be careful about accidentally encoding special tokens, since they can be used to trick a model into doing something we don't want it to do.

Hence, by default, encode will raise an error if it encounters text that corresponds to a special token. This can be controlled on a per-token level using the allowedSpecial and disallowedSpecial parameters. In particular:

  • Setting disallowedSpecial to SpecialTokensSet.empty() will prevent this function from raising errors and cause all text corresponding to special tokens to be encoded as natural text.
  • Setting allowedSpecial to SpecialTokensSet.all() will cause this function to treat all text corresponding to special tokens to be encoded as special tokens.

Example:

final enc = getEncoding("gpt2"); // Get instance of encoder
enc.encode("hello world"); // [31373, 995]
enc.encode("<|endoftext|>", allowedSpecial: SpecialTokensSet.custom({"<|endoftext|>"})); // [50256]
enc.encode("<|endoftext|>", allowedSpecial: SpecialTokensSet.all()); // [50256]
enc.encode("<|endoftext|>") // Throws
enc.encode("<|endoftext|>", disallowedSpecial: SpecialTokensSet.empty()); // [27, 91, 437, 1659, 5239, 91, 29]

Implementation

Uint32List encode(
  String text, {
  SpecialTokensSet allowedSpecial = const SpecialTokensSet.empty(),
  SpecialTokensSet disallowedSpecial = const SpecialTokensSet.all(),
}) {
  final allowedSpecialSet =
      allowedSpecial.isAll ? specialTokensSet : allowedSpecial.set;

  final disallowedSpecialSet = disallowedSpecial.isAll
      ? specialTokensSet.difference(allowedSpecialSet)
      : disallowedSpecial.set;

  _verifyDisallowed(text, disallowedSpecialSet);

  return _coreBPE.encodeNative(text, allowedSpecialSet).i1;
}