...

/

Neural Networks Training

Neural Networks Training

Learn how to implement, explain, and debug forward/backward passes in neural networks for technical interviews.

A staple query in modern AI/ML interviews revolves around manually implementing and explaining the forward and backward passes in a basic neural network. Although it’s often considered a straightforward initial tech-screen question, interviewers aren’t just testing whether you’ve memorized definitions. They want to know if you can articulate why each computational step—matrix multiplication, activation functions, or gradient calculation—matters.

How to show you understand neural networks

A classic way to illustrate how neural networks work is by using the analogy of a “busy kitchen.” Imagine the following scenario:

  • Input data: Think of raw data as your ingredients. Just as fresh, well-measured components are essential to a good meal, your input features must be high quality.

  • Weights and biases: These are analogous to the trainable parameters that adjust the input. Just as different spices bring out various flavors, weights and biases tweak the data differently to improve the outcome.

  • Forward pass: The forward pass is like a chef mixing everything into a dish. Here, the input data is transformed via matrix multiplications and activations to produce an output—a final prediction.

  • Backward pass: The backward pass is akin to tasting the dish. The chef adjusts the spices and other ingredients if the flavor isn’t right, say, too salty or bland. In the network, backpropagation calculates how each parameter contributed to the error and updates it accordingly.

This analogy helps you remember the process and is a natural starting point for more technical discussions when an interviewer probes further.

Press + to interact

How to deal with data preparation

Just as you wouldn’t expect a gourmet meal from unwashed, unmeasured ingredients, raw data must be cleaned and transformed before it’s fed into the neural network. Interviewers may ask if you have experience scaling or normalizing features, handling missing values, or encoding categorical data. These questions are designed to test your familiarity with data preprocessing.

At this stage, interviewers want to see that you understand data preprocessing is not just an optional add-on but a critical foundation for successful model training. They expect you to be familiar with a few fundamental techniques without necessarily listing them exhaustively from memory. It would be beneficial if you mention that scaling features—either through standardization (zero mean, unit variance) or min-max normalization (bringing values into a specific range)—ensure that all input features contribute evenly to the model, particularly when using gradient-based methods.

Also, state that you typically inspect the data for missing values and then decide on an appropriate strategy—filling in the gaps (imputation) using mean, median, or mode values, or removing incomplete records if they are few in number or non-critical. Emphasize that the choice depends on the context and size of the dataset, and this would be a good time to ask what types of data they deal with in their company.

Note: Tailor your answer according to the role you are applying for. Mentioning that you know advanced techniques like embeddings when relevant shows you can adapt your approach based on context.

How to initialize parameters

Just as every great chef selects the right blend of spices and ingredients before starting the cooking process, a neural network’s performance heavily depends on how its trainable parameters—weights and biases—are initialized.

  • Weights determine how input features are combined and transformed as they pass through the network. Rather than initializing weights to zero, you should set them to small random values. This randomness is critical because it breaks symmetry. Without it, every neuron in a layer would begin with the same values and learn identical features, severely limiting the network’s ability to capture diverse patterns.

  • Biases are added to each neuron’s weighted input and help shift the activation function. Unlike weights, it is common to initialize biases to zero or a small constant. This is generally acceptable because biases do not suffer from the symmetry problem.

Interviewers expect you to note that initializing weights to small random values prevents neurons from learning identical features. They want to see that you understand this foundational concept in neural network training. While you don’t need to list every method, a brief mention of common approaches, such as normal distribution, can illustrate that you know techniques to set your network up for efficient learning.

Remember, the interviewer isn’t just after a rote response—they want to see that you know the reasoning behind these choices!

How to explain forward propagation effectively

Once your data is prepped and your parameters are in place, it’s time to start cooking. The forward pass is where the neural network mixes the ingredients to produce the final dish—the output prediction. Let’s take a look at what happens step-by-step:

  • The forward pass begins with a linear combination of inputs, z=XW+bz = XW + b., whereXXis the input data matrix andWWis the weight matrix. This operation is a matrix multiplication followed by adding the bias,bb. It creates a new representation of the data that highlights various features learned by the network.

  • The next step involves applying a nonlinear activation function (such as Sigmoid, ReLU, or Tanh) to the linear output, a=activation(z)a = \text{activation(z)}. Without a nonlinear activation, no matter how many layers you stack, the model would still act as a single linear transformation. Nonlinearity empowers the network to learn and model complex, nonlinear relationships in the data. For example, using ReLU (Rectified Linear Unit) introduces nonlinearity by converting all negative values inzzto zero, while letting positive values pass through unchanged. This simple operation enables the network to capture intricate patterns that a purely linear model would miss.

  • In deeper networks, the output of one layer (aa) becomes the input to the next. This layered approach allows the network to progressively extract higher-level features from the raw input data.

Educative byte: In a basic neural network designed for binary classification, the final layer typically applies a Sigmoid activation to produce an output between 0 and 1. This output represents the probability that a given input belongs to a certain class.


In contrast, for multi-class classification problems—where the input could belong to one of several classes—the final layer usually applies a Softmax activation. Softmax outputs a probability distribution across all possible classes, ensuring the probabilities sum to 1. The predicted class is the one with the highest probability.

For regression tasks, where the goal is to predict continuous values, the output layer might not use any activation function at all (i.e., linear activation), depending on the range and nature of the target variable.

The interviewers want to see that you can explain how data flows through the network. Interviewers will check if you know how each layer’s transformation builds on the previous layer’s output. An interviewer might ask, “Why choose ReLU over Sigmoid in hidden layers?” They’re looking for an explanation that covers practical considerations such as preventing vanishing gradients and the efficiency of computation. It’s important to articulate that the forward pass isn’t just a series of equations—it’s the process by which the network “cooks” the raw data into a final prediction that can be compared against the ground truth.

How to implement forward propogation

By articulating these steps clearly, you not only show that you know the technical details but also that you understand the underlying purpose of each operation. Also, before moving on to backpropagation, it’s important to note that at this stage an interviewer might ask you to implement forward propagation from scratch—without using deep learning libraries like PyTorch. They typically expect you to write simple Python code using a library like NumPy. The goal is to see if you can translate the conceptual understanding you’ve just discussed into code.

Try coding a forward propagation function for a simple neural network in Python using NumPy. The network should have:

  • One hidden layer

  • A linear transformation (matrix multiplication plus bias) for each layer

  • A nonlinear activation (e.g., Sigmoid) after the linear step

  • Final output activation using a Sigmoid function (assuming binary classification)

Try it yourself

In the following code widget, we’ve included clear comments indicating where you should implement or modify your own code.

Press + to interact
Python 3.10.4
import numpy as np
# Define the Sigmoid activation function.
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Forward propagation function for a 2-layer neural network.
def forward_propagation(X, W1, b1, W2, b2):
"""
Arguments:
X -- Input data of shape (n_samples, n_features)
W1 -- Weights for the first layer (shape: [n_features, n_hidden])
b1 -- Biases for the first layer (shape: [1, n_hidden])
W2 -- Weights for the output layer (shape: [n_hidden, 1])
b2 -- Bias for the output layer (shape: [1, 1])
Returns:
a2 -- Output of the network (predictions, shape: [n_samples, 1])
a1 -- Activation from the hidden layer (for potential use in backpropagation)
z1 -- Linear component for the first layer (for debugging or backprop)
z2 -- Linear component for the output layer (for debugging or backprop)
"""
# TODO: Compute the first layer linear transformation:
# TODO: Compute the activation for the first layer:
# TODO: Compute the second layer linear transformation:
# TODO: Compute the output layer activation:
# TODO: Return a2, a1, z1, z2
pass
# Example usage:
if __name__ == "__main__":
# TODO: Set the random seed for reproducibility.
# TODO: Define sample input data.
# TODO: Initialize the parameters W1, b1, W2, b2 for your neural network.
# For instance, for a network with 2 input features, 3 hidden units, and 1 output:
# W1 = ...
# b1 = ...
# W2 = ...
# b2 = ...
# TODO: Run forward propagation by calling forward_propagation(X, W1, b1, W2, b2)
# TODO: Print the output (a2) of the forward pass.
pass

Did it work? Great job if you got it right! However, don't be anxious if you didn't because sometimes these things are tricky to get right on the first attempt and thats why we are here to help.

Solution

You can find a valid solution in the widget below to compare your answer with:

Press + to interact
Python 3.10.4
import numpy as np
# Define the Sigmoid activation function.
# You can modify or expand this if you want to experiment with other activations.
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Boilerplate function for forward propagation.
# Your task is to correctly implement the steps for the forward pass.
def forward_propagation(X, W1, b1, W2, b2):
"""
Performs forward propagation for a 2-layer neural network.
Arguments:
X -- Input data of shape (n_samples, n_features)
W1 -- Weights for the first layer (shape: [n_features, n_hidden])
b1 -- Biases for the first layer (shape: [1, n_hidden])
W2 -- Weights for the output layer (shape: [n_hidden, 1])
b2 -- Bias for the output layer (shape: [1, 1])
Returns:
a2 -- Output of the network (predictions, shape: [n_samples, 1])
a1 -- Activation from the hidden layer (for potential use in backpropagation)
z1 -- Linear component for the first layer (for debugging or backprop)
z2 -- Linear component for the output layer (for debugging or backprop)
"""
# --- Implement the first layer linear transformation ---
# Compute z1 = X * W1 + b1
# (Ensure the dimensions of X, W1, and b1 are compatible)
z1 = np.dot(X, W1) + b1 # Candidate: You might write this code yourself.
# --- Apply the activation function for the hidden layer ---
# Here we're using Sigmoid, but you could choose ReLU or another activation.
a1 = sigmoid(z1) # Candidate: Verify the activation function meets the requirements.
# --- Implement the second layer linear transformation ---
# Compute z2 = a1 * W2 + b2
z2 = np.dot(a1, W2) + b2 # Candidate: Implement as per the above formula.
# --- Apply the activation function for the output layer ---
# Using Sigmoid for binary classification; again adjust if necessary.
a2 = sigmoid(z2) # Candidate: Ensure this step produces output probabilities.
return a2, a1, z1, z2
# Example usage to test your forward propagation function.
if __name__ == '__main__':
np.random.seed(42) # For reproducibility
# Sample input: 2 samples, each with 2 features.
X = np.array([[0.5, -0.2], [0.1, 0.4]])
# Initialize parameters for a network with 2 input features, 3 hidden units, and 1 output unit.
W1 = np.random.randn(2, 3) * 0.01 # 2x3 matrix for the first layer.
b1 = np.zeros((1, 3)) # 1x3 bias vector for the first layer.
W2 = np.random.randn(3, 1) * 0.01 # 3x1 matrix for the output layer.
b2 = np.zeros((1, 1)) # 1x1 bias for the output layer.
# Run forward propagation
a2, a1, z1, z2 = forward_propagation(X, W1, b1, W2, b2)
print("Output probabilities (a2):")
print(a2)

In the code above:

  • Lines 5–6: Defines the Sigmoid activation function, which takes an input z and returns a value between 0 and 1. This function will be applied to the outputs of each layer to introduce nonlinearity into the model.

  • Line 31: Computes the linear transformation for the first layer by performing a matrix multiplication of the input data X and the weight matrix W1, then adding the bias b1. This produces the linear combination z1.

  • Line 35: Applies the Sigmoid activation function to the output z1 of the first layer. This transforms the linear combination into a nonlinear activation a1, which is necessary to capture complex patterns in the data.

  • Line 39: Performs a similar linear transformation for the second layer. Here, the activated output a1 from the first layer is multiplied by the weights W2 of the second layer, and the bias b2 is added. This produces the linear combination z2 at the output layer.

  • Line 43: Applies the Sigmoid function to z2 to generate the final output a2 of the network. For binary classification tasks, this output represents the predicted probabilities that an input belongs to a particular class.

  • Line 45: Returns the final output a2, as well as intermediate values a1, z1, and z2. These intermediate values can be useful later for backpropagation or debugging purposes.

  • Line 49: The random seed is set for reproducibility, ensuring that random operations (like parameter initialization) yield the same results each time.

  • Line 52: Creates a sample input dataset X as a NumPy array with 2 samples, each having 2 features. This data serves as an example input for testing the forward propagation function.

  • Lines 55–58: Randomly initializes the weight matrices W1 and W2 with small values (scaled by 0.01) and initializes the bias vectors b1 and b2 to zeros. This setup is for a network with 2 input features, a hidden layer containing 3 units, and an output layer with 1 unit.

  • Lines 61–63: Calls the forward_propagation function with the sample input data and parameters, and then prints the output probabilities (a2). This demonstrates the full end-to-end forward pass of the network.

You can see how each block of code fits into the overall structure of the neural network's forward pass. This understanding is essential to implement the code and articulate the process effectively during an interview.

How to explain backpropagation effectively

Backpropagation is the mechanism that computes the gradient (i.e., the derivative) of the loss function with respect to each trainable parameter (weights and biases). These gradients tell you how much each parameter influenced the error—the off flavor in your dish. By adjusting the parameters in the opposite direction of these gradients (scaled by a learning rate), you iteratively improve the network’s performance.

Let’s take a look at how this happens

  • After performing the forward pass (the “cooking”), the network produces a prediction. You then compare this prediction to the actual values using a loss function—such as binary cross-entropy for classification or mean squared error for regression. This loss represents how “off” the dish is, or in other words, how far the network's prediction is from the truth.

  • Backpropagation uses the chain rule from calculus to figure out how much each parameter (like weights and biases) influenced the error. Imagine you’re trying to determine how a small change in salt affects the taste of your dish; you’d measure that sensitivity with a derivative. In the network, you calculate these derivatives—or gradients—starting at the output layer and moving back to the input layer. Essentially, you multiply the local gradients at each layer to see how a small change in a parameter affects the final loss.

  • Once you know these gradients, you update each parameter in the opposite direction of its gradient. This is like reducing the amount of salt if the dish is too salty. Mathematically, every parameter θθ (which could be a weight or a bias) is updated as follows:


Whereηη is the learning rate, which controls how big a step you take in adjusting the parameter.

By following these steps—calculating the error, using the chain rule to compute gradients, and then updating the parameters—you iteratively reduce the network’s error, much like a chef fine-tuning a recipe over multiple tastings. This systematic approach is what allows neural networks to learn and improve over time.

How to implement backpropagation

After mastering forward propagation, interviewers may ask you to implement backpropagation from scratch. This step is crucial: it shows that you understand how to compute the gradient (or derivative) of the loss function with respect to each trainable parameter so you can adjust them in order to reduce the error. In our “busy kitchen” analogy, after tasting the dish (calculating the loss), backpropagation tells you how much each ingredient (parameter) affected the overall flavor, and then you tweak them accordingly.

Press + to interact
Backpropogation in neural networks
Backpropogation in neural networks

The interviewer will expect you to know that backpropagation uses the chain rule from calculus to compute gradients for each layer starting from the output and moving backward. You should also be able to explain that you first compute how “off” the output is (the error), then determine how sensitive this error is to each parameter by propagating the error backward. Consider the following problem:

Implement backpropagation for a simple two-layer neural network using NumPy. Your network should already have performed a forward pass (with one hidden layer and a final Sigmoid output). Now, your task is to compute the gradients for:

  • The output layer, using the derivative of the loss with respect to its linear output.

  • The hidden layer, by “backpropagating” the error using the chain rule.

  • Then, return the gradients for both layers.

Try it yourself

In the following code widget, we’ve included clear comments indicating where you should implement or modify your own code.

Press + to interact
Python 3.10.4
import numpy as np
# Define the Sigmoid activation function.
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Define the derivative of the Sigmoid activation function.
def sigmoid_deriv(a):
# TODO: Implement the derivative of the Sigmoid function.
pass
# Forward propagation function (for context).
def forward_propagation(X, W1, b1, W2, b2):
z1 = np.dot(X, W1) + b1
a1 = sigmoid(z1)
z2 = np.dot(a1, W2) + b2
a2 = sigmoid(z2)
return a2, a1, z1, z2
# Backward propagation function for a 2-layer neural network.
def backward_propagation(X, y, W2, a1, a2, z1):
"""
Arguments:
X -- Input data of shape (n_samples, n_features)
y -- True labels (vector of shape (n_samples,))
W2 -- Weights for the output layer (shape: [n_hidden, 1])
a1 -- Activation from the hidden layer (shape: [n_samples, n_hidden])
a2 -- Output of the network (shape: [n_samples, 1])
z1 -- Linear component for the first layer
Returns:
dW1 -- Gradient with respect to W1
db1 -- Gradient with respect to b1
dW2 -- Gradient with respect to W2
db2 -- Gradient with respect to b2
"""
# TODO: Determine the number of samples (m).
# TODO: Compute the gradient of the loss with respect to z2.
# Hint: For binary classification with a Sigmoid output and BCE loss, dz2 = a2 - y.
# TODO: Compute the gradients for the output layer:
# dW2 = (a1^T dot dz2) / m, and db2 = sum(dz2) / m.
# TODO: Backpropagate the error to the hidden layer:
# Compute dz1 = (dz2 dot W2^T) * sigmoid_deriv(a1).
# TODO: Compute the gradients for the hidden layer:
# dW1 = (X^T dot dz1) / m, and db1 = sum(dz1) / m.
# TODO: Return dW1, db1, dW2, db2.
pass
# Example usage:
if __name__ == "__main__":
np.random.seed(42) # For reproducibility
# Sample input data: 2 samples with 2 features each.
X = np.array([[0.5, -0.2],
[0.1, 0.4]])
# Sample true labels.
y = np.array([1, 0])
# Initialize parameters for a network with 2 input features, 3 hidden units, and 1 output unit.
W1 = np.random.randn(2, 3) * 0.01
b1 = np.zeros((1, 3))
W2 = np.random.randn(3, 1) * 0.01
b2 = np.zeros((1, 1))
# Run forward propagation.
a2, a1, z1, z2 = forward_propagation(X, W1, b1, W2, b2)
# TODO: Run backward propagation by calling backward_propagation with the appropriate arguments.
# TODO: Print the gradients (dW1, db1, dW2, db2).
pass

Did it work? Great job if you got it right! However, don't be anxious if you didn't because sometimes these things are tricky to get right on the first attempt and thats why we are here to help.

Solution

You can find a valid solution in the widget below to compare your answer with:

Press + to interact
Python 3.10.4
import numpy as np
# Define the Sigmoid activation function.
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Define the derivative of the Sigmoid function.
def sigmoid_deriv(a):
return a * (1 - a)
# Forward propagation function (for context).
def forward_propagation(X, W1, b1, W2, b2):
z1 = np.dot(X, W1) + b1
a1 = sigmoid(z1)
z2 = np.dot(a1, W2) + b2
a2 = sigmoid(z2)
return a2, a1, z1, z2
# Backward propagation function for a 2-layer neural network.
def backward_propagation(X, y, W2, a1, a2):
"""
Arguments:
X -- Input data of shape (n_samples, n_features)
y -- True labels (vector of shape (n_samples,))
W2 -- Weights for the output layer (shape: [n_hidden, 1])
a1 -- Activation from the hidden layer (shape: [n_samples, n_hidden])
a2 -- Output from the network (shape: [n_samples, 1])
z1 -- Linear component for the first layer (for computing the derivative)
Returns:
dW1 -- Gradient of the loss with respect to W1
db1 -- Gradient of the loss with respect to b1
dW2 -- Gradient of the loss with respect to W2
db2 -- Gradient of the loss with respect to b2
"""
m = X.shape[0] # Number of samples
# Compute gradient of loss with respect to z2.
# For binary classification with sigmoid and BCE loss, this simplifies to:
dz2 = a2 - y.reshape(-1, 1)
# Compute gradients for the output layer parameters.
dW2 = np.dot(a1.T, dz2) / m
db2 = np.sum(dz2, axis=0, keepdims=True) / m
# Backpropagate to the hidden layer.
dz1 = np.dot(dz2, W2.T) * sigmoid_deriv(a1)
dW1 = np.dot(X.T, dz1) / m
db1 = np.sum(dz1, axis=0, keepdims=True) / m
return dW1, db1, dW2, db2
# Example usage:
if __name__ == "__main__":
np.random.seed(42) # For reproducibility
# Sample input data: 2 samples with 2 features each.
X = np.array([[0.5, -0.2],
[0.1, 0.4]])
# Sample true labels.
y = np.array([1, 0])
# Initialize parameters for a network with 2 input features, 3 hidden units, and 1 output unit.
W1 = np.random.randn(2, 3) * 0.01
b1 = np.zeros((1, 3))
W2 = np.random.randn(3, 1) * 0.01
b2 = np.zeros((1, 1))
# Run forward propagation.
a2, a1, z1, z2 = forward_propagation(X, W1, b1, W2, b2)
# Run backward propagation.
dW1, db1, dW2, db2 = backward_propagation(X, y, W2, a1, a2)
# Print the computed gradients.
print("Gradient for W1 (dW1):")
print(dW1)
print("Gradient for b1 (db1):")
print(db1)
print("Gradient for W2 (dW2):")
print(dW2)
print("Gradient for b2 (db2):")
print(db2)

In the code above:

  • Lines 8–9: Defines the derivative of the Sigmoid function. This derivative is crucial in backpropagation for determining how sensitive the activation is to changes in its input. It computes the gradient of the Sigmoid function given its activated output.

  • Line 36: Determines the number of samples mm from the input data XX.

  • Line 40: Computes dz2, the gradient of the loss with respect to z2. For binary classification with a Sigmoid output and binary cross-entropy loss, this simplifies to a2ya2−y.

  • Lines 43–44: Calculates gradients for the output layers. dW2 is computed by taking the dot product of the transpose of a1 (the hidden activations) and dz2, then dividing by m. Also, db2 is is computed by summing dz2 over all samples and dividing by m.

  • Line 47: Backpropagates the error to the hidden layer. Here, dz1 is computed by multiplying the dot product of dz2 and the transpose of W2 with the derivative of the Sigmoid function applied to a1.

  • Lines 48–49: Computes gradients for the hidden layer. dW1 is the dot product of the transpose of the input X and dz1, averaged over all samples. db1 is the sum of dz1 across the batch, divided by m.

  • Line 51: Returns all the computed gradients: dW1, db1, dW2, and db2.

  • Line 73: The backward_propagation function is called with the input data, true labels, output layer weights, and intermediate activations. The gradients computed here can be used to update the weights and biases during training.

Interviewers want to see that you can clearly articulate how the model learns by adjusting its parameters based on the error from the predictions. They’ll check if you understand how backpropagation uses the chain rule to propagate error gradients from the output backward through each layer, updating every weight and bias. An interviewer might ask, “How does the error computed at the output layer get propagated back to the earlier layers?” They’re looking for a detailed explanation that covers how the gradients are computed—for example, why the derivative of the Sigmoid function matters—and how these gradients are averaged over the dataset and used for parameter updates. This shows that you grasp not only the individual equations but also the overall learning process.

Conclusion

Implementing the forward and backward pass is a fundamental step in mastering neural networks—and it's a true rite of passage for anyone delving into deep learning.

Once you’ve demonstrated a solid grasp of the basics, interviewers may follow up with deeper conceptual questions to assess your intuition and problem-solving ability. These might include:

  • “Why can't we initialize all weights to zero?”

  • “How is backpropagation related to the chain rule?”

  • “Why is ReLU often preferred over Tanh in deeper networks?”

  • “How does batch size influence training dynamics and convergence?”

Preparing thoughtful answers to these kinds of questions can further showcase your depth and readiness for real-world challenges.