Tiktoken Tokenizer for GPT-4o, GPT-4, and o1 OpenAI models #

This is an implementation of the Tiktoken tokeniser, a BPE used by OpenAI's models. It's a partial Dart port from the original tiktoken library from OpenAI, but with a much nicer API.

Although there are other tokenizers available on pub.dev, as of November 2024, none of them support the GPT-4o and o1 model families. This package was created to fill that gap.

The supported models are these:

Gpt-4
Gpt-4o
Gpt-4o-mini
o1
o1-mini
o1-preview

Also important, this is a Dart-only package (does not require any platform channels to work), and the tokenization is done synchronously.

Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you whether:

Some text is too long for a text model to process.
How much an OpenAI API call costs (as usage is priced by token).

Example #

To see it in action, run the example app:

How to use it #

// Create a Tiktoken instance for the model you want to use.
var tiktoken = Tiktoken(OpenAiModel.gpt_4);

// Encode a text string into tokens.
var encoded = tiktoken.encode("hello world");

// Decode a token string back into text.
var decoded = tiktoken.decode(encoded);

// Count the number of tokens in a text string.
int numberOfTokens = tiktoken.count("hello world");

Advanced use #

Alternatively, you can use the static helper functions getEncoder and getEncoderForModel to get a TiktokenEncoder instance first:

var encoder = Tiktoken.getEncoder(TiktokenEncodingType.o200k_base);
var encoder = Tiktoken.getEncoderForModel(OpenAiModel.gpt_4o);

The TiktokenEncoder instance gives you more fine-grained control over the encoding process, as you now have access to more advanced methods:

Uint32List encode(
    String text, {
    SpecialTokensSet allowedSpecial = const SpecialTokensSet.empty(),
    SpecialTokensSet disallowedSpecial = const SpecialTokensSet.all(),
  });

Uint32List encodeOrdinary(String text);

(List<int>, Set<List<int>>) encodeWithUnstable(
    String text, {
    SpecialTokensSet allowedSpecial = const SpecialTokensSet.empty(),
    SpecialTokensSet disallowedSpecial = const SpecialTokensSet.all(),
  });
  
int encodeSingleToken(List<int> bytes);

Uint8List decodeBytes(List<int> tokens); 

String decode(List<int> tokens, {bool allowMalformed = true}); 

Uint8List decodeSingleTokenBytes(int token)

List<Uint8List> decodeTokenBytes(List<int> tokens);

int? get eotToken;

Online tokenizer #

I've added many tests to make sure this Dart implementation is correct, but you can also compare yourself the output of this package with the output of the default implementation, by visiting the online Tiktokenizer.

Counting words #

What's the relationship between words and tokens? Every language has a different word-to-token ratio. Here are a few general rules:

For English: 1 word is about 1.3 tokens
For Spanish and French: 1 word is about 2 tokens
How Many Tokens Are Punctuation Marks, Special Characters, and Emojis? Each punctuation mark (like ,:;?!) counts as 1 token. Special characters (like ∝√∅°¬) range from 1 to 3 tokens, and emojis (like 😁🙂🤩) range from 2 to 3 tokens.

In this package I provide a word counter. Here is how you can use it:

var wordCounter = WordCounter();

// Prints 0
print(wordCounter.count(''));

// Prints 1
print(wordCounter.count('hello'));

// Prints 2
print(wordCounter.count('hello world!'));

Counting words is complex because each language has its own rules for what constitutes a word. For this reason, the provided word counter is only an approximation and will give reasonable results only for languages written in the Latin alphabet.

Credits #

This package code was mostly adapted from: https://pub.dev/packages/langchain_tiktoken from publisher dragonx.cloud / website. I've just added more encodings, added tests, and made the API more user-friendly.

By Marcelo Glasberg

glasberg.dev
github.com/marcglasberg
linkedin.com/in/marcglasberg/
twitter.com/glasbergmarcelo
stackoverflow.com/users/3411681/marcg
medium.com/@marcglasberg

The Flutter packages I've authored:

My Medium Articles:

Async Redux: Flutter’s non-boilerplate version of Redux (versions: Português)
i18n_extension (versions: Português)
Flutter: The Advanced Layout Rule Even Beginners Must Know (versions: русский)
The New Way to create Themes in your Flutter App

My article in the official Flutter documentation:

Understanding constraints

tiktoken_tokenizer_gpt4o_o1 1.0.2
tiktoken_tokenizer_gpt4o_o1: ^1.0.2 copied to clipboard

Metadata

Tiktoken Tokenizer for GPT-4o, GPT-4, and o1 OpenAI models #

Example #

How to use it #

Advanced use #

Online tokenizer #

Counting words #

Credits #

← Metadata

Publisher

Metadata

Documentation

License

Dependencies

More

tiktoken_tokenizer_gpt4o_o1 1.0.2 tiktoken_tokenizer_gpt4o_o1: ^1.0.2 copied to clipboard

Metadata

Tiktoken Tokenizer for GPT-4o, GPT-4, and o1 OpenAI models #

Example #

How to use it #

Advanced use #

Online tokenizer #

Counting words #

Credits #

← Metadata

Publisher

Metadata

Documentation

License

Dependencies

More

tiktoken_tokenizer_gpt4o_o1 1.0.2
tiktoken_tokenizer_gpt4o_o1: ^1.0.2 copied to clipboard