LLMs struggle with something as simple as counting letters in “strawberry.” Sounds absurd, right? But this “strawberry problem” isn’t just a random quirk. It’s a fundamental flaw in how today’s models process text.
The culprit? Tokenization.
Tokenization — especially methods like byte pair encoding (BPE), which is used by most LLMs — introduces fragmentation that can distort how LLMs process text.
Instead of treating words as whole units, BPE splits them into subword pieces based on frequency in the training data. Some words, like “banana,” are common enough to remain a single token, while others, like “strawberry,” get broken into multiple tokens (which are often inconsistent). As a result, when an LLM tries to count letters within a word, it isn’t seeing the full word at once — it’s reasoning over fragmented pieces.
This inefficiency in tokenization has persisted for years, but Meta’s new Byte Latent Transformer (BLT) might change everything.
Unlike traditional models that rely on fixed tokenization, BLT processes raw bytes directly, dynamically grouping them into variable-length patches based on complexity. This eliminates the rigid token splits that cause issues like the “strawberry problem” while improving efficiency and generalization.
Could BLT finally spell the end of tokenization-based models? Let’s explore what BLT brings to the table.
The
Entropy measures uncertainty or unpredictability in a dataset. Low entropy (close to 0) indicates high predictability, meaning the outcome is almost certain, while high entropy (close to 1 or more) implies greater unpredictability, with outcomes being less certain.
Consider a sentence: “Cats are great pets.”Â
When predicting the first letter of this sentence without any context, the entropy is high because it could be any letter from the alphabet, making it unpredictable.
However, once the initial letter “C” is established, predicting the next few letters has lower entropy.
Following the “C,” the likelihood of specific letters forming recognizable words like “Cats” increases, reducing uncertainty.
As you move further into the sentence and more context is available, predicting additional words like “are” and “great” becomes even more predictable, lowering entropy further with each new word.
In the context of BLT, this entropy principle is used to intelligently allocate computational resources:
High-entropy areas (like the beginning of a sentence where predictions are uncertain) require more processing power to handle the complexity and variability.
Low-entropy areas with certain predictions can be processed with less computational effort.
This dynamic allocation allows BLT to enhance efficiency by focusing resources according to the predictability of different parts of the data.
BLT defines a global entropy threshold to identify where uncertainty in predicting the next byte is high. Whenever the entropy measured for a byte exceeds this threshold, it signals a transition from a more predictable region to a less predictable one, prompting the start of a new patch (vertical gray lines). This way, BLT dynamically adjusts the size of each patch based on how uncertain the model is, allocating more compute to bytes that are harder to predict while grouping more predictable bytes.
In the figure, the letter “D” at the start of “Daenerys” has high entropy because the model is uncertain which letter might appear at the beginning of the word.
Once the model identifies that the word is “Daenerys,” letters like “a,” “e,” “r,” “y,” and “n” have much lower entropy, indicating greater predictability. As a result, no additional patch boundaries show up for those letters, reflecting the model’s confidence in predicting them.
BLT’s architecture is composed of three main parts:
A lightweight local encoder transforms sequences of raw bytes into compact “patch” representations. It uses entropy-based grouping to decide where one patch ends and the next begins.
A large latent transformer operates on these patch representations instead of raw bytes. Because it receives fewer (but more information-rich) inputs, it can focus its expensive computations where they matter most (i.e., on uncertain or complex input parts).
A lightweight local decoder then expands the patch representations to byte-level detail, enabling the final next-byte prediction. This step ensures the model has full access to the finer details of the original text sequence.
By grouping high-entropy regions into patches, BLT allocates more capacity to handle the less predictable parts of the text. Meanwhile, simpler or more predictable regions remain in fewer, condensed patches, reducing unnecessary computation.
BLT’s entropy-based patching uses a separately trained small language model to compute entropy for patch boundaries.
Meta compared an 8‑billion‑parameter baseline model (Llama 3) that uses one trillion tokens to two BLT variants: BLT-Space (using a space-patching scheme) and BLT-Entropy. The latter achieves higher overall scores, indicating that entropy‑based patching allows the model to learn more effectively and efficiently than the token‑based baseline.
Meta also compared a basic Llama 3 model (trained on 1 trillion tokens), a more heavily trained Llama 3.1 (16 trillion tokens), and BLT (1 trillion tokens) to handle noisy or linguistically tricky tasks.
Despite being trained on fewer tokens, BLT performs at or above the level of its counterparts in many of these tests. In particular, BLT shows strong performance on tasks involving character‑level changes (like deleting or inserting characters), suggesting that its byte‑level approach provides a deeper awareness of text structure that cannot simply be matched by training on more tokenized data.
But let's circle back to our strawberry problem. Does BLT work better?
Let’s see. This table shows how BLT and Llama 3 handle various text manipulation tasks, from substituting words to matching semantic similarity. BLT moves beyond token boundaries and often gives more accurate or relevant responses (for instance, correctly inserting, removing, or replacing characters), suggesting it handles tricky byte-level edits more effectively than the token-based Llama 3.
...No.
Because BLT’s patching is guided by the entropy of the next byte prediction, it can struggle in situations where entropy isn’t a reliable indicator of the text’s actual complexity or importance.
Here's why:Â
Highly repetitive patterns can lead to consistently low entropy, even if the model should pay attention to specific noise or subtle changes.
Example: In DNA sequences (which are highly repetitive), a crucial mutation might not trigger high entropy, or in code, a frequent keyword might always be low entropy but very significant for meaning.
Specialized domains (like chemical notation or source code) can feature tokens whose meaning depends on context beyond predictability.Â
Texts where low-entropy sequences carry critical information, even though their structure appears predictable.Â
In these cases, relying on a fixed or domain-specific tokenizer might be more effective. This would ensure that crucial symbols or sequences remain discrete units rather than being overlooked because of their low measured entropy.
BLT isn’t just an incremental improvement. It’s a paradigm shift.
BLT moves away from fixed tokenization and instead dynamically segments text into variable-length patches based on entropy.
So far, the results are promising, but some questions remain:
Can entropy alone always define meaningful patches?
Will domain-specific knowledge still require tokenization?
The future of NLP might not be fully token-free, but BLT makes one thing clear: token-based models have real limits. And the next generation of LLMs will have to move beyond tokens.
Are you ready for the next wave of LLMs?
Here are 2 courses to help you prepare:
Essentials of Large Language Models for a strong foundation in how models process text, why tokenization matters, and how fine-tuning shapes their capabilities.
Fine-Tuning LLMs Using LoRA and QLoRA to get hands-on with fine-tuning LLMs. No matter the LLM's architecture, fine-tuning is key. As models evolve beyond tokenization, efficiency will be more critical than ever.
Every new model is a signal for us to keep upskilling to prepare for the next generation of AI development. I hope you respond to it.