The Hidden Costs of Tokenization: A Comparative Analysis
Different AI model families use various tokenizers, but there’s been limited analysis on how the tokenization process varies across these tokenizers. This article explores whether all tokenizers result in the same number of tokens for a given input text and examines the practical implications of tokenization variability, focusing on OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet.
API Pricing Comparison
As of June 2024, both models have competitive pricing structures. Claude 3.5 Sonnet offers a 40% lower cost for input tokens compared to GPT-4o, while their output token costs are identical. However, experiments revealed that running tests with GPT-4o was cheaper than with Claude 3.5 Sonnet on a fixed set of prompts.
The Hidden ‘Tokenizer Inefficiency’
The discrepancy in costs stems from Anthropic’s tokenizer breaking down input into more tokens than OpenAI’s tokenizer. For identical prompts, Anthropic models produce considerably more tokens, offsetting the savings from lower input token costs and leading to higher overall costs in practical use cases. This ‘tokenizer inefficiency’ significantly impacts costs and context window utilization.
Domain-Dependent Tokenization Inefficiency
Different content types are tokenized differently by Anthropic’s tokenizer, leading to varying levels of increased token counts compared to OpenAI’s models. Tests on English articles, Python code, and mathematical content showed:
- English articles: Claude’s tokenizer produced approximately 16% more tokens than GPT-4o
- Python code: Claude generated 30% more tokens
- Math content: 21% more tokens
The variation occurs because technical content, such as code and mathematical equations, contains patterns and symbols that Anthropic’s tokenizer fragments into smaller pieces, resulting in higher token counts.
Practical Implications Beyond Cost
The tokenizer inefficiency also affects context window utilization. Although Anthropic models have a larger advertised context window (200K tokens) compared to OpenAI’s (128K tokens), the effective usable token space may be smaller due to verbosity.
Tokenizer Implementation
GPT models use Byte Pair Encoding (BPE), specifically the o200k_base tokenizer. In contrast, Anthropic’s tokenizer details are not readily available, though it’s known to have fewer token variations (65,000) compared to GPT-4 (100,261).
Key Takeaways
- Anthropic’s competitive pricing comes with hidden costs due to tokenizer inefficiency.
- The degree of tokenizer inefficiency varies significantly across content domains, with technical content being more affected.
- The effective context window size may differ from advertised sizes due to tokenizer verbosity.
Understanding these differences is crucial for businesses processing large volumes of text when evaluating the true cost of deploying AI models.