The Output Layer and Final Prediction
Explore the final step of LLM inference by learning how the last token's vector is transformed into vocabulary probabilities using weight tying and softmax. Understand decoding strategies such as greedy, temperature, top-k, and top-p sampling that influence the model’s output creativity and coherence.
We have reached the end of our assembly line. We’ve successfully built a complete transformer decoder block, and we understand that an LLM is a deep stack of these blocks. After our prompt, "Twinkle, twinkle, little", travels through this entire stack, we are left with a final, highly-refined matrix of vectors.
But we still don’t have a new word. How do we bridge the gap from the model’s internal world of high-dimensional vectors back to the human world of language and choose the single next token? This is the final step in our inference journey.
The bridge from thought to language
The process of speaking is a two-stage translation: first, from a single abstract thought to a universe of possibilities, and second, from those possibilities to a single, definitive choice.
Out of the entire final matrix of vectors, we only care about one: the vector corresponding to the very last token in our sequence (“little”). Why? Because of the causal mask we built into our attention mechanism. The mask ensures that this final vector is the only one that has gathered context from the entire preceding sequence. It is the single point containing all the information needed to predict the immediate future.
Next, we must translate this vector’s “thought space” (a dense vector of, say, 128 dimensions) into the “word space” of our entire vocabulary (a list of over 50,000 possible tokens). This is done with a final linear projection. The most ...