Neural Networks Refresher
Explore the core principles of feedforward neural networks, their components like weights, biases, and activation functions, and how these fundamentals underpin the transformer layers used in modern large language models such as GPT-4. Learn how depth and nonlinearity enable hierarchical linguistic feature extraction essential for effective language processing.
Every time a language model like GPT-4 or Claude generates a sentence, billions of numerical parameters activate in sequence across dozens of layers. These parameters, organized as weights and biases, perform the same fundamental computation repeated billions of times: multiply inputs, add an offset, and apply a nonlinear function. Understanding this computation is not an optional background for working with large language models. It is the prerequisite for grasping how transformers process and generate text.
This lesson walks through feedforward neural networks from the ground up, then explicitly connects each concept to the deep architectures used in modern LLMs. By the end, you will understand the computational flow from input to output in a neural network and see why depth, nonlinearity, and parameter scale are central to language modeling.
Anatomy of a feedforward neural network
A
Think of it like an assembly line in a factory. Raw materials enter at one end, each station transforms them in some way, and a finished product comes out the other end. No station sends materials backward.
Neurons, weights, and biases
Each neuron in the network is a small computational unit that performs three steps. First, it receives numerical inputs from the previous layer. Second, it computes a weighted sum of those inputs and adds a bias term. Third, it passes the result through an activation function to produce its output.
The key learnable components in this process are:
Weights: These parameters determine the strength of the connection between two neurons. A large weight amplifies the input signal, while a weight near zero effectively silences it. During training, the network adjusts these weights to minimize prediction errors.
Biases: These are offset values added to the weighted sum before the activation function. They allow the neuron to shift its activation threshold, making the model more flexible. Without biases, every neuron’s decision boundary would be forced to pass through the origin.
The formula inside a single neuron looks like this: