...

/

Cross-Attention

Cross-Attention

Learn how cross-attention empowers models to align and integrate information across sequences, enabling tasks like translation, multimodal fusion, and retrieval-augmented generation.

A classic follow-up to self-attention questions is: What is cross-attention, and why do we need it? In roles ranging from machine-translation engineer to multimodal-AI researcher or retrieval-augmented developer, you’ll often be asked how one sequence (or modality) can query another—and that’s exactly what cross-attention does. Explaining its importance shows you understand how encoders and decoders—or text and images, or questions and knowledge bases—truly interact in modern architectures.

This lesson will demystify cross-attention and explain why it matters in encoder–decoder models, multimodal pipelines, and memory-query systems. We will also learn how it works and how to implement and test a minimal cross-attention block in NumPy.

What is cross-attention?

Cross-attention is an attention mechanism in which the queries come from one sequence (or data source) and the keys and values come from a different sequence. In other words, it “crosses” information between two sequences. This contrasts with self-attention, where queries, keys, and values come from the same sequence.

  • In self-attention, each token looks at other tokens in its sequence (Q = K = V from one sequence).

  • In cross-attention, each token in a query sequence looks at tokens differently for keys and values. Q is derived from one source (e.g., decoder), and K/V are derived from another (e.g., encoder).

Think of cross-attention like a Q&A process:

  • The query (Q) is a question from one sequence (e.g., a decoder asking “What information from the encoder should I use for this step?”).

  • The keys (K) are labels or pointers in the other sequence that index its information (e.g., positions in the encoder output).

  • The values (V) are the actual pieces of information in that other sequence.

  • The attention mechanism finds which key matches the query (via dot-product similarity) and uses the corresponding value to inform the output.

Mathematically, cross-attention uses the same formula as self-attention for computing attention weights and output:

Where the only difference is thatQQcomes from one source andK ...