A simple implementation of tokenization similar to tiktoken, but using UTF-8 bytes as a base.
This is a simplified version and won't match tiktoken exactly, but provides a reasonable approximation.
For production use, you should implement or use a proper tiktoken port.