The Output Layer and Final Prediction
Learn how to implement the final steps of the inference process: how the model translates its final, abstract representation into a concrete prediction for the next token.
We have reached the end of our assembly line. We’ve successfully built a complete transformer decoder block, and we understand that an LLM is a deep stack of these blocks. After our prompt, "Twinkle, twinkle, little", travels through this entire stack, we are left with a final, highly-refined matrix of vectors.
But we still don’t have a new word. How do we bridge the gap from the model’s internal world of high-dimensional vectors back to the human world of language and choose the single next token? This is the final step in our inference journey.
The bridge from thought to language
The process of speaking is a two-stage translation: first, from a single abstract thought to a universe of possibilities, and second, from those possibilities to a single, definitive choice.
Out of the entire final matrix of vectors, we only care about one: the vector corresponding to the very last token in our sequence (“little”). Why? Because of the causal mask we built into our attention mechanism. The mask ensures that this final vector is the only one that has gathered context from the entire preceding sequence. It is the single point containing all the information needed to predict the immediate future.
Next, we must translate this vector’s “thought space” (a dense vector of, say, 128 dimensions) into the “word space” of our entire vocabulary (a list of over 50,000 possible tokens). This is done with a final linear projection. The most effective way to do this is to use the same weight matrix we used for our initial embeddings, just transposed. This is called weight tying.
Think of the embedding matrix as ...