Large language models (LLMs) struggle with long inputs because their attention layers scale with the square of the sequence length. A model that handles a moderate input well may slow down or fail when the text becomes large.
In simple terms, doubling the number of tokens results in about four times the compute requirements. Suppose a sequence has
This makes long-context processing slow and expensive. Imagine forcing a model to read every word of a 300-page novel one token at a time, incurring cost at each step. Naive transformer attention scales quadratically with sequence length, so long contexts are expensive, and many models still degrade on very long inputs despite large nominal windows. This long-context problem is a major roadblock to AI systems that could otherwise retain entire books, lengthy conversations, or complex legal contracts.
So, how can we give LLMs a longer memory without exhausting compute?
One idea is to change what we feed the model. Instead of pouring in thousands of text tokens and making the transformer examine them all, we can hand it a condensed version that captures the essentials in far fewer tokens. This is where DeepSeek-OCR comes in, changing how LLMs handle long context.