The Feedforward Network and Final Assembly

Learn how the feedforward network independently transforms context-enriched token vectors after multi-head attention, adding computational depth within transformer blocks. Understand the role of residual connections and layer normalization in assembling these blocks, forming the foundational structure powering large language models.

We'll cover the following...

The “thinking” component
- A primer on the FFN
Residual connections
Assembling the complete transformer block
- Putting it all into code
Stacking the blocks: Creating a deep LLM
- What have we built? Placing our block in context
Conclusion

In our last lesson, we built the sophisticated multi-head attention mechanism. Its output is a matrix of deeply context-aware vectors, where each token has absorbed relevant information from its neighbors from multiple expert perspectives. The “communication” phase of our process is now complete.

However, after a productive meeting, you need to return to your desk to process what you’ve learned. Communication is not enough; you also need time to “think.” Our tokens are in the same position. They are rich with new context, but they haven’t had a chance to process it individually. How does the model perform this deep, individual processing to truly digest the information it just gathered?

This is the job of the final core component in our block: the feedforward network (FFN).

The “thinking” component

The FFN is a simple but powerful transformation. After the multi-head attention step, each vector in our matrix is passed independently through its own small, two-layer neural network. This is an important distinction: attention is an “all-to-all” communication step where tokens interact, while the FFN is a “one-to-one” processing step where each token contemplates on its own.

This is the primary “thinking” or processing phase of the block. The FFN is where the model applies a significant portion of its learned knowledge (a large number of its parameters reside here) to the context-rich vectors it just received from the attention step. It allows the model to identify and transform complex patterns within each vector, adding a layer of computational depth. Without this individual processing step, the model would be proficient at gathering information but struggle to comprehend it deeply.

A primer on the FFN

So far, we have focused on the clever architecture of attention, which is all about communication. However, we’ve used the term feedforward network (FFN) to refer to the “thinking” part. What exactly does this mean, and how is it different from the matrix math we’ve already done?

What is a neural network layer?

At its heart, a single layer of a neural network is a simple two-step mathematical operation that you are already familiar with, plus one new ingredient. For any given input vector x: ...

1.Course Overview

2.The Inference Journey

3.The Training Journey

4.Building with LLMs: The Developer’s Toolkit

5.Wrap Up

The Feedforward Network and Final Assembly

The “thinking” component

A primer on the FFN

What is a neural network layer?