tiktoken_tokenizer_gpt4o_o1 1.1.0 copy "tiktoken_tokenizer_gpt4o_o1: ^1.1.0" to clipboard
tiktoken_tokenizer_gpt4o_o1: ^1.1.0 copied to clipboard

OpenAI's Tiktoken tokenizer for models: GPT-4, GPT-4o, GPT-4o-mini, o1, o1-mini, and o1-preview.

pub package

Tiktoken Tokenizer for GPT-4o, GPT-4, and o1 OpenAI models #

This is an implementation of the Tiktoken tokeniser, a BPE used by OpenAI's models. It's a partial Dart port from the original tiktoken library from OpenAI, but with a much nicer API.

Although there are other tokenizers available on pub.dev, as of November 2024, none of them support the GPT-4o and o1 model families. This package was created to fill that gap.

The supported models are these:

  • Gpt-4
  • Gpt-4o
  • Gpt-4o-mini
  • o1
  • o1-mini
  • o1-preview

Also important, this is a Dart-only package (does not require any platform channels to work), and the tokenization is done synchronously.

Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you whether:

  • Some text is too long for a text model to process.
  • How much an OpenAI API call costs (as usage is priced by token).

Example #

To see it in action, run the example app:

How to use it #

// Create a Tiktoken instance for the model you want to use.
var tiktoken = Tiktoken(OpenAiModel.gpt_4);

// Encode a text string into tokens.
var encoded = tiktoken.encode("hello world");

// Decode a token string back into text.
var decoded = tiktoken.decode(encoded);

// Count the number of tokens in a text string.
int numberOfTokens = tiktoken.count("hello world");

Advanced use #

Alternatively, you can use the static helper functions getEncoder and getEncoderForModel to get a TiktokenEncoder instance first:

var encoder = Tiktoken.getEncoder(TiktokenEncodingType.o200k_base);
var encoder = Tiktoken.getEncoderForModel(OpenAiModel.gpt_4o);

The TiktokenEncoder instance gives you more fine-grained control over the encoding process, as you now have access to more advanced methods:

Uint32List encode(
    String text, {
    SpecialTokensSet allowedSpecial = const SpecialTokensSet.empty(),
    SpecialTokensSet disallowedSpecial = const SpecialTokensSet.all(),
  });

Uint32List encodeOrdinary(String text);

(List<int>, Set<List<int>>) encodeWithUnstable(
    String text, {
    SpecialTokensSet allowedSpecial = const SpecialTokensSet.empty(),
    SpecialTokensSet disallowedSpecial = const SpecialTokensSet.all(),
  });
  
int encodeSingleToken(List<int> bytes);

Uint8List decodeBytes(List<int> tokens); 

String decode(List<int> tokens, {bool allowMalformed = true}); 

Uint8List decodeSingleTokenBytes(int token)

List<Uint8List> decodeTokenBytes(List<int> tokens);

int? get eotToken;

Online tokenizer #

I've added many tests to make sure this Dart implementation is correct, but you can also compare yourself the output of this package with the output of the default implementation, by visiting the online Tiktokenizer.

Counting words #

What's the relationship between words and tokens? Every language has a different word-to-token ratio. Here are a few general rules:

  • For English: 1 word is about 1.3 tokens
  • For Spanish and French: 1 word is about 2 tokens
  • How Many Tokens Are Punctuation Marks, Special Characters, and Emojis? Each punctuation mark (like ,:;?!) counts as 1 token. Special characters (like ∝√∅°¬) range from 1 to 3 tokens, and emojis (like 😁🙂🤩) range from 2 to 3 tokens.

In this package I provide a word counter. Here is how you can use it:

var wordCounter = WordCounter();

// Prints 0
print(wordCounter.count(''));

// Prints 1
print(wordCounter.count('hello'));

// Prints 2
print(wordCounter.count('hello world!'));

Counting words is complex because each language has its own rules for what constitutes a word. For this reason, the provided word counter is only an approximation and will give reasonable results only for languages written in the Latin alphabet.

Credits #

This package code was mostly adapted from: https://pub.dev/packages/langchain_tiktoken from publisher dragonx.cloud / website. I've just added more encodings, added tests, and made the API more user-friendly.



By Marcelo Glasberg

glasberg.dev
github.com/marcglasberg
linkedin.com/in/marcglasberg/
twitter.com/glasbergmarcelo
stackoverflow.com/users/3411681/marcg
medium.com/@marcglasberg

My article in the official Flutter documentation:

The Flutter packages I've authored:

My Medium Articles:

1
likes
150
pub points
27%
popularity

Publisher

verified publisherglasberg.dev

OpenAI's Tiktoken tokenizer for models: GPT-4, GPT-4o, GPT-4o-mini, o1, o1-mini, and o1-preview.

Repository (GitHub)
View/report issues

Topics

#tokenizer #openai #gpt #gpt-4o #tiktoken

Documentation

API reference

License

BSD-2-Clause (license)

Dependencies

characters, flutter

More

Packages that depend on tiktoken_tokenizer_gpt4o_o1