Cross-Attention
Learn how cross-attention empowers models to align and integrate information across sequences, enabling tasks like translation, multimodal fusion, and retrieval-augmented generation.
A classic follow-up to self-attention questions is: What is cross-attention, and why do we need it? In roles ranging from machine-translation engineer to multimodal-AI researcher or retrieval-augmented developer, you’ll often be asked how one sequence (or modality) can query another—and that’s exactly what cross-attention does. Explaining its importance shows you understand how encoders and decoders—or text and images, or questions and knowledge bases—truly interact in modern architectures.
This lesson will demystify cross-attention and explain why it matters in encoder–decoder models, multimodal pipelines, and memory-query systems. We will also learn how it works and how to implement and test a minimal cross-attention block in NumPy.
What is cross-attention?
Cross-attention is an attention mechanism in which the queries come from one sequence (or data source) and the keys and values come from a different sequence. In other words, it “crosses” information between two sequences. This contrasts with self-attention, where queries, keys, and values come from the same sequence.
In self-attention, each token looks at other tokens in its sequence (Q = K = V from one sequence).
In cross-attention, each token in a query sequence looks at tokens differently for keys and values. Q is derived from one source (e.g., decoder), and K/V are derived from another (e.g., encoder).
Think of cross-attention like a Q&A process:
The query (Q) is a question from one sequence (e.g., a decoder asking “What information from the encoder should I use for this step?”).
The keys (K) are labels or pointers in the other sequence that index its information (e.g., positions in the encoder output).
The values (V) are the actual pieces of information in that other sequence.
The attention mechanism finds which key matches the query (via dot-product similarity) and uses the corresponding value to inform the output.
Mathematically, cross-attention uses the same formula as self-attention for computing attention weights and output:
Where the only difference is that