Search⌘ K
AI Features

Neural Networks Training

Explore fundamental techniques in neural network training including data preprocessing, forward and backward propagation, parameter initialization, and optimization. This lesson prepares you to explain how neural networks learn, implement training steps in code, and understand essential interview questions in AI engineering roles.

A staple query in modern AI and ML interviews focuses on the fundamental mechanisms that enable neural networks to learn from data. This topic appears consistently across technical interviews for AI engineering roles because it tests several essential competencies that employers look for when evaluating candidates.

Interviewers use it to assess several dimensions of technical competence:

  • Understanding of core mathematical concepts, such as the chain rule.

  • Ability to translate theoretical principles into working code.

  • Awareness of why each step in the process is necessary, not just how it functions.

A basic implementation with explanation typically takes 15–25 minutes. Senior candidates may be asked to expand the discussion to include optimization techniques, alternative architectures, or debugging strategies. Mastering this question prepares you for natural follow-ups, such as vanishing gradients, batch normalization, and modern optimization methods, providing a strong foundation for demonstrating expertise.

What are forward and backward propagation in a neural network?

Neural networks learn by passing inputs through layers that apply a linear transformation and a nonlinearity. Their trainable parameters are weights and biases, which are adjusted so the network’s outputs match the targets. A loss function measures the mismatch.

Training consists of two phases: the forward pass computes predictions, and the backward pass applies the chain rule to compute gradients of the loss with respect to each parameter. An optimizer such as stochastic gradient descent updates parameters in the direction that reduces the loss, scaled by a learning rate

Effective training also depends on good preprocessing, sensible initialization, suitable activation functions, and selecting the appropriate final layer and loss function. ReLU supports gradient flow in deep models, and softmax with cross-entropy is standard for multiclass classification.

This short answer is interview-friendly, as it clearly states what a neural network does, how it learns, and what practical aspects matter. It’s intentionally concise, but later sections dig into forward/backward math, code, initialization, data preparation, and upcoming lessons. These lessons also discuss common follow-ups, such as vanishing gradients, batch normalization, and optimization strategies, so you can expand and defend any part of this answer during an interview.

Interview trap: An interviewer might ask, “Does the backward pass happen when we use the model to predict a cat in a new photo?” and candidates often respond, “Yes.” However, that’s incorrect! Backpropagation is only used during training to update weights. During inference (prediction), only the forward pass occurs.

Can you explain the key components of a neural network and how they work together?

Interviewers ask this to see whether you understand a neural network as an integrated system rather than a chain of formulas. A strong answer shows how data flows through the model and why each part is necessary.

  • Input data: This is the raw information the network relies on to learn patterns—numerical features, images, audio signals, text embeddings, anything representable as numbers. Its quality has a direct impact on the model’s performance. Before training even begins, the data often needs to be cleaned, scaled, encoded, or imputed. Normalization keeps values in a sensible range, so gradients behave predictably. Handling missing or corrupted values prevents the network from learning noise rather than structure. Careful preprocessing ensures that the model sees consistent, meaningful signals instead of being overwhelmed by mismatched scales or irrelevant variation.

  • Weights and biases: These are the adjustable numerical parameters that define how information moves through the network. Each weight controls how strongly one unit influences another, shaping the computation the network performs. Biases act as offsets, giving each neuron the flexibility to shift its activation threshold. During training, the backward pass computes gradients that indicate how each weight and bias should change to reduce error. Over many iterations, this tuning process embeds the network’s knowledge in these parameters. After training, the weights and biases alone determine how new inputs are transformed into predictions.

Educative Byte: In many trained models, the values of the weights and biases—just long arrays of numbers—are the entire model. Everything the network has “learned” is stored in those parameters.

  • Forward pass: The process that transforms input data through matrix multiplications and activation functions to produce an output prediction. Each layer builds on the previous layer’s output, creating increasingly abstract representations.

  • Backward pass: This mechanism for computing how each parameter contributed to the prediction error, then updating parameters to improve performance. This uses the chain rule from calculus to propagate error gradients backward through the network.

Quick answer for interview: Inputs are transformed layer-by-layer: each layer applies a linear map (weights + biases) followed by a nonlinearity to produce progressively higher-level representations (forward pass). The backward pass uses the chain rule to compute the gradients of the loss with respect to those parameters, and an optimizer updates the weights. Good preprocessing and proper initialization help ukeep training stable.

We will examine each step in detail in the coming sections.

How do you prepare input data before training a neural network?

Raw data rarely arrives in a form suitable for learning. Just as a good recipe depends on well-prepared ingredients, a neural network depends on clean, consistent input. Interviewers ask about preprocessing—scaling, normalization, handling missing values, encoding categories—to see whether you understand that data preparation is a core part of model development rather than a side task.

At this stage, they don’t expect a catalog of every preprocessing method. They want to hear that you know the fundamentals and can explain why they matter. Scaling features—using standardization or min-max normalization—keeps input values on comparable ranges, which helps gradient-based methods behave predictably. It shows you understand that preprocessing affects stability and convergence.

Interview trap: An interviewer might ask, “To make things efficient, should we normalize the entire dataset before splitting it into training and testing sets?” and candidates often say, “Yes, that keeps the code cleaner.”

However, that’s incorrect! This causes data leakage, where information from your test set (such as the mean and variance) “leaks” into your training process, resulting in artificially high accuracy scores that won’t hold up in production. You must first split the data, then calculate statistics only on the training set.

You should also mention that you check for missing or corrupted values and choose a strategy based on context: imputing with the mean, median, or mode; using more sophisticated imputation if appropriate; or removing incomplete rows when they are few or unimportant. The key is showing that the decision depends on the dataset’s size, purpose, and structure.

If the role involves text, images, or more complex modalities, it’s appropriate to reference techniques like embeddings or domain-specific transformations to show adaptability without overcomplicating your answer.

Quick answer for interview: Before training a neural network, I clean and preprocess the input data so the model receives consistent signals. This usually includes scaling features, handling missing values, and encoding categories when necessary. Good preprocessing stabilizes training and ensures each feature contributes meaningfully to the learning process.

How do you initialize the parameters of a neural network, and why does it matter?

Just as every great chef selects the right blend of spices and ingredients before starting the cooking process, a neural network’s performance heavily depends on how its trainable parameters—weights and biases—are initialized.

  • Weights determine how input features are combined and transformed as they pass through the network. Rather than initializing weights to zero, you should set them to small random values. This randomness is critical because it breaks symmetry. Without it, every neuron in a layer would begin with the same values and learn identical features, severely limiting the network’s ability to capture diverse patterns.

Interview trap: An interviewer might ask, “Since we want the model to start neutral, why don’t we just initialize all the weights to zero?” However, do remember that if all weights are zero, every neuron in a hidden layer receives the same input and calculates the same gradient. They will all update in the exact same way, effectively making your massive neural network act like a single neuron. We must use random initialization to break this symmetry.

  • Biases are added to each neuron’s weighted input and help shift the activation function. Unlike weights, it is common to initialize biases to zero or a small constant. This is generally acceptable because biases do not suffer from the symmetry problem.

Interviewers don’t expect a list of every possible initialization strategy, but mentioning that you typically use small random values drawn from a normal or uniform distribution shows that you know the standard approaches. Most importantly, you should convey that good initialization helps the model learn efficiently and avoids early training difficulties.

Quick answer for interview: I initialize weights with small, random values to break symmetry, allowing neurons to learn different features. Biases usually start at zero or a small constant since they don’t cause symmetry issues. Proper initialization gives the network a stable starting point and helps training converge smoothly.

Remember, the interviewer isn’t just after a rote response—they want to see that you know the reasoning behind these choices!

Can you explain how forward propagation works in a neural network?

Once your data is prepped and your parameters are in place, it’s time to start cooking. The forward pass is where the neural network mixes the ingredients to produce the final dish—the output prediction. Let’s take a look at what happens step-by-step:

  • The forward pass begins with a linear combination of inputs, z=XW+bz = XW + b., where XX is the input data matrix and WW is the weight matrix. This operation is a matrix multiplication followed by adding the bias, bb. It creates a new representation of the data that highlights various features learned by the network.

  • The next step involves applying a nonlinear activation function (such as Sigmoid, ReLU, or Tanh) to the linear output, a=activation(z)a = \text{activation(z)}. Without a nonlinear activation, no matter how many layers you stack, the model would still act as a single linear transformation. Nonlinearity empowers the network to learn and model complex, nonlinear relationships in the data. For example, using ReLU (Rectified Linear Unit) introduces nonlinearity by converting all negative values inzzto zero, while letting positive values pass through unchanged. This simple operation enables the network to capture intricate patterns that a purely linear model would miss.

  • In deeper networks, the output of one layer (aa) becomes the input to the next. This layered approach allows the network to progressively extract higher-level features from the raw input data.

Educative byte: In a basic neural network designed for binary classification, the final layer typically applies a Sigmoid activation to produce an output between 0 and 1. This output represents the probability that a given input belongs to a certain class.


In contrast, for multi-class classification problems—where the input could belong to one of several classes—the final layer usually applies a Softmax activation. Softmax outputs a probability distribution across all possible classes, ensuring the probabilities sum to 1. The predicted class is the one with the highest probability.

For regression tasks, where the goal is to predict continuous values, the output layer might not use any activation function at all (i.e., linear activation), depending on the range and nature of the target variable.

The interviewers want to see that you can explain how data flows through the network. Interviewers will check if you know how each layer’s transformation builds on the previous layer’s output. An interviewer might ask, “Why choose ReLU over Sigmoid in hidden layers?” They’re looking for an explanation that covers practical considerations such as preventing vanishing gradients and the efficiency of computation. It’s important to articulate that the forward pass isn’t just a series of equations—it’s the process by which the network “cooks” the raw data into a final prediction that can be compared against the ground truth.

How would you implement forward propagation in code?

By articulating these steps clearly, you not only show that you know the technical details but also that you understand the underlying purpose of each operation. Also, before moving on to backpropagation, it’s important to note that at this stage, an interviewer might ask you to implement forward propagation from scratch—without using deep learning libraries like PyTorch. They typically expect you to write simple Python code using a library like NumPy. The goal is to see if you can translate the conceptual understanding you’ve just discussed into code.

Try coding a forward propagation function for a simple neural network in Python using NumPy. The network should have:

  • One hidden layer.

  • A linear transformation (matrix multiplication plus bias) for each layer.

  • A nonlinear activation (e.g., Sigmoid) after the linear step.

  • Final output activation using a Sigmoid function (assuming binary classification).

Try it yourself

In the following code widget, we’ve included clear comments indicating where you should implement or modify your own code.

Python 3.10.4
import numpy as np
# Define the Sigmoid activation function.
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Forward propagation function for a 2-layer neural network.
def forward_propagation(X, W1, b1, W2, b2):
"""
Arguments:
X -- Input data of shape (n_samples, n_features)
W1 -- Weights for the first layer (shape: [n_features, n_hidden])
b1 -- Biases for the first layer (shape: [1, n_hidden])
W2 -- Weights for the output layer (shape: [n_hidden, 1])
b2 -- Bias for the output layer (shape: [1, 1])
Returns:
a2 -- Output of the network (predictions, shape: [n_samples, 1])
a1 -- Activation from the hidden layer (for potential use in backpropagation)
z1 -- Linear component for the first layer (for debugging or backprop)
z2 -- Linear component for the output layer (for debugging or backprop)
"""
# TODO: Compute the first layer linear transformation:
# TODO: Compute the activation for the first layer:
# TODO: Compute the second layer linear transformation:
# TODO: Compute the output layer activation:
# TODO: Return a2, a1, z1, z2
pass
# Example usage:
if __name__ == "__main__":
# TODO: Set the random seed for reproducibility.
# TODO: Define sample input data.
# TODO: Initialize the parameters W1, b1, W2, b2 for your neural network.
# For instance, for a network with 2 input features, 3 hidden units, and 1 output:
# W1 = ...
# b1 = ...
# W2 = ...
# b2 = ...
# TODO: Run forward propagation by calling forward_propagation(X, W1, b1, W2, b2)
# TODO: Print the output (a2) of the forward pass.
pass

Did it work? Great job if you got it right! However, don’t be anxious if you didn’t because sometimes these things are tricky to get right on the first attempt, and that's why we are here to help.

Solution

You can find a valid solution in the widget below to compare your answer with:

Python 3.10.4
import numpy as np
# Define the Sigmoid activation function.
# You can modify or expand this if you want to experiment with other activations.
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Boilerplate function for forward propagation.
# Your task is to correctly implement the steps for the forward pass.
def forward_propagation(X, W1, b1, W2, b2):
"""
Performs forward propagation for a 2-layer neural network.
Arguments:
X -- Input data of shape (n_samples, n_features)
W1 -- Weights for the first layer (shape: [n_features, n_hidden])
b1 -- Biases for the first layer (shape: [1, n_hidden])
W2 -- Weights for the output layer (shape: [n_hidden, 1])
b2 -- Bias for the output layer (shape: [1, 1])
Returns:
a2 -- Output of the network (predictions, shape: [n_samples, 1])
a1 -- Activation from the hidden layer (for potential use in backpropagation)
z1 -- Linear component for the first layer (for debugging or backprop)
z2 -- Linear component for the output layer (for debugging or backprop)
"""
# --- Implement the first layer linear transformation ---
# Compute z1 = X * W1 + b1
# (Ensure the dimensions of X, W1, and b1 are compatible)
z1 = np.dot(X, W1) + b1 # Candidate: You might write this code yourself.
# --- Apply the activation function for the hidden layer ---
# Here we're using Sigmoid, but you could choose ReLU or another activation.
a1 = sigmoid(z1) # Candidate: Verify the activation function meets the requirements.
# --- Implement the second layer linear transformation ---
# Compute z2 = a1 * W2 + b2
z2 = np.dot(a1, W2) + b2 # Candidate: Implement as per the above formula.
# --- Apply the activation function for the output layer ---
# Using Sigmoid for binary classification; again adjust if necessary.
a2 = sigmoid(z2) # Candidate: Ensure this step produces output probabilities.
return a2, a1, z1, z2
# Example usage to test your forward propagation function.
if __name__ == '__main__':
np.random.seed(42) # For reproducibility
# Sample input: 2 samples, each with 2 features.
X = np.array([[0.5, -0.2], [0.1, 0.4]])
# Initialize parameters for a network with 2 input features, 3 hidden units, and 1 output unit.
W1 = np.random.randn(2, 3) * 0.01 # 2x3 matrix for the first layer.
b1 = np.zeros((1, 3)) # 1x3 bias vector for the first layer.
W2 = np.random.randn(3, 1) * 0.01 # 3x1 matrix for the output layer.
b2 = np.zeros((1, 1)) # 1x1 bias for the output layer.
# Run forward propagation
a2, a1, z1, z2 = forward_propagation(X, W1, b1, W2, b2)
print("Output probabilities (a2):")
print(a2)

In the code above:

  • Lines 5–6: Defines the Sigmoid activation function, which takes an input z and returns a value between 0 and 1. This function will be applied to the outputs of each layer to introduce nonlinearity into the model.

  • Line 31: Computes the linear transformation for the first layer by performing a matrix multiplication of the input data X and the weight matrix W1, then adding the bias b1. This produces the linear combination z1.

  • Line 35: Applies the Sigmoid activation function to the output z1 of the first layer. This transforms the linear combination into a nonlinear activation a1, which is necessary to capture complex patterns in the data.

  • Line 39: Performs a similar linear transformation for the second layer. Here, the activated output a1 from the first layer is multiplied by the weights W2 of the second layer, and the bias b2 is added. This produces the linear combination z2 at the output layer.

  • Line 43: Applies the Sigmoid function to z2 to generate the final output a2 of the network. For binary classification tasks, this output represents the predicted probabilities that an input belongs to a particular class.

  • Line 45: Returns the final output a2, as well as intermediate values a1, z1, and z2. These intermediate values can be useful later for backpropagation or debugging purposes.

  • Line 49: The random seed is set for reproducibility, ensuring that random operations (like parameter initialization) yield the same results each time.

  • Line 52: Creates a sample input dataset X as a NumPy array with 2 samples, each having 2 features. This data serves as an example input for testing the forward propagation function.

  • Lines 55–58: Randomly initializes the weight matrices W1 and W2 with small values (scaled by 0.01) and initializes the bias vectors b1 and b2 to zeros. This setup is for a network with 2 input features, a hidden layer containing 3 units, and an output layer with 1 unit.

  • Lines 61–63: Calls the forward_propagation function with the sample input data and parameters, and then prints the output probabilities (a2). This demonstrates the full end-to-end forward pass of the network.

You can see how each block of code fits into the overall structure of the neural network's forward pass. This understanding is essential to implement the code and articulate the process effectively during an interview.

Can you explain how backpropagation works and why it’s needed?

Backpropagation is the mechanism that computes the gradient (i.e., the derivative) of the loss function with respect to each trainable parameter (weights and biases). These gradients tell you how much each parameter influenced the error—the off-flavor in your dish. By adjusting the parameters in the opposite direction of these gradients (scaled by a learning rate), you iteratively improve the network’s performance.

Let’s take a look at how this happens

  • After performing the forward pass (the “cooking”), the network produces a prediction. You then compare this prediction to the actual values using a loss function—such as binary cross-entropy for classification or mean squared error for regression. This loss represents how “off” the dish is, or in other words, how far the network’s prediction is from the truth.

  • Backpropagation uses the chain rule from calculus to figure out how much each parameter (like weights and biases) influenced the error. Imagine you’re trying to determine how a small change in salt affects the taste of your dish; you’d measure that sensitivity with a derivative. In the network, you calculate these derivatives—or gradients—starting at the output layer and moving back to the input layer. Essentially, you multiply the local gradients at each layer to see how a small change in a parameter affects the final loss.

  • Once you know these gradients, you update each parameter in the opposite direction of its gradient. This is similar to reducing the amount of salt if a dish is too salty. Mathematically, every parameter θθ (which could be a weight or a bias) is updated as follows:

Whereηη is the learning rate, which controls how big a step you take in adjusting the parameter.

By following these steps—calculating the error, using the chain rule to compute gradients, and then updating the parameters—you iteratively reduce the network’s error, much like a chef fine-tuning a recipe over multiple tastings. This systematic approach is what allows neural networks to learn and improve over time.

How would you implement backpropagation during training?

After mastering forward propagation, interviewers may ask you to implement backpropagation from scratch. This step is crucial, as it demonstrates your understanding of how to compute the gradient (or derivative) of the loss function with respect to each trainable parameter, allowing you to adjust them to reduce the error. In our “busy kitchen” analogy, after tasting the dish (calculating the loss), backpropagation tells you how much each ingredient (parameter) affected the overall flavor, and then you tweak them accordingly.

Backpropogation in neural networks
Backpropogation in neural networks

The interviewer will expect you to know that backpropagation uses the chain rule from calculus to compute gradients for each layer, starting from the output and moving backward. You should also be able to explain that you first compute how “off” the output is (the error), then determine how sensitive this error is to each parameter by propagating the error backward. Consider the following problem:

Implement backpropagation for a simple two-layer neural network using NumPy. Your network should already have performed a forward pass (with one hidden layer and a final Sigmoid output). Now, your task is to compute the gradients for:

  • The output layer, using the derivative of the loss with respect to its linear output.

  • The hidden layer, by “backpropagating” the error using the chain rule.

  • Then, return the gradients for both layers.

Try it yourself

In the following code widget, we’ve included clear comments indicating where you should implement or modify your own code.

Python 3.10.4
import numpy as np
# Define the Sigmoid activation function.
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Define the derivative of the Sigmoid activation function.
def sigmoid_deriv(a):
# TODO: Implement the derivative of the Sigmoid function.
pass
# Forward propagation function (for context).
def forward_propagation(X, W1, b1, W2, b2):
z1 = np.dot(X, W1) + b1
a1 = sigmoid(z1)
z2 = np.dot(a1, W2) + b2
a2 = sigmoid(z2)
return a2, a1, z1, z2
# Backward propagation function for a 2-layer neural network.
def backward_propagation(X, y, W2, a1, a2, z1):
"""
Arguments:
X -- Input data of shape (n_samples, n_features)
y -- True labels (vector of shape (n_samples,))
W2 -- Weights for the output layer (shape: [n_hidden, 1])
a1 -- Activation from the hidden layer (shape: [n_samples, n_hidden])
a2 -- Output of the network (shape: [n_samples, 1])
z1 -- Linear component for the first layer
Returns:
dW1 -- Gradient with respect to W1
db1 -- Gradient with respect to b1
dW2 -- Gradient with respect to W2
db2 -- Gradient with respect to b2
"""
# TODO: Determine the number of samples (m).
# TODO: Compute the gradient of the loss with respect to z2.
# Hint: For binary classification with a Sigmoid output and BCE loss, dz2 = a2 - y.
# TODO: Compute the gradients for the output layer:
# dW2 = (a1^T dot dz2) / m, and db2 = sum(dz2) / m.
# TODO: Backpropagate the error to the hidden layer:
# Compute dz1 = (dz2 dot W2^T) * sigmoid_deriv(a1).
# TODO: Compute the gradients for the hidden layer:
# dW1 = (X^T dot dz1) / m, and db1 = sum(dz1) / m.
# TODO: Return dW1, db1, dW2, db2.
pass
# Example usage:
if __name__ == "__main__":
np.random.seed(42) # For reproducibility
# Sample input data: 2 samples with 2 features each.
X = np.array([[0.5, -0.2],
[0.1, 0.4]])
# Sample true labels.
y = np.array([1, 0])
# Initialize parameters for a network with 2 input features, 3 hidden units, and 1 output unit.
W1 = np.random.randn(2, 3) * 0.01
b1 = np.zeros((1, 3))
W2 = np.random.randn(3, 1) * 0.01
b2 = np.zeros((1, 1))
# Run forward propagation.
a2, a1, z1, z2 = forward_propagation(X, W1, b1, W2, b2)
# TODO: Run backward propagation by calling backward_propagation with the appropriate arguments.
# TODO: Print the gradients (dW1, db1, dW2, db2).
pass

Did it work? Great job if you got it right! However, don’t be anxious if you didn’t because sometimes these things are tricky to get right on the first attempt, and that's why we are here to help.

Solution

You can find a valid solution in the widget below to compare your answer with:

Python 3.10.4
import numpy as np
# Define the Sigmoid activation function.
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Define the derivative of the Sigmoid function.
def sigmoid_deriv(a):
return a * (1 - a)
# Forward propagation function (for context).
def forward_propagation(X, W1, b1, W2, b2):
z1 = np.dot(X, W1) + b1
a1 = sigmoid(z1)
z2 = np.dot(a1, W2) + b2
a2 = sigmoid(z2)
return a2, a1, z1, z2
# Backward propagation function for a 2-layer neural network.
def backward_propagation(X, y, W2, a1, a2):
"""
Arguments:
X -- Input data of shape (n_samples, n_features)
y -- True labels (vector of shape (n_samples,))
W2 -- Weights for the output layer (shape: [n_hidden, 1])
a1 -- Activation from the hidden layer (shape: [n_samples, n_hidden])
a2 -- Output from the network (shape: [n_samples, 1])
z1 -- Linear component for the first layer (for computing the derivative)
Returns:
dW1 -- Gradient of the loss with respect to W1
db1 -- Gradient of the loss with respect to b1
dW2 -- Gradient of the loss with respect to W2
db2 -- Gradient of the loss with respect to b2
"""
m = X.shape[0] # Number of samples
# Compute gradient of loss with respect to z2.
# For binary classification with sigmoid and BCE loss, this simplifies to:
dz2 = a2 - y.reshape(-1, 1)
# Compute gradients for the output layer parameters.
dW2 = np.dot(a1.T, dz2) / m
db2 = np.sum(dz2, axis=0, keepdims=True) / m
# Backpropagate to the hidden layer.
dz1 = np.dot(dz2, W2.T) * sigmoid_deriv(a1)
dW1 = np.dot(X.T, dz1) / m
db1 = np.sum(dz1, axis=0, keepdims=True) / m
return dW1, db1, dW2, db2
# Example usage:
if __name__ == "__main__":
np.random.seed(42) # For reproducibility
# Sample input data: 2 samples with 2 features each.
X = np.array([[0.5, -0.2],
[0.1, 0.4]])
# Sample true labels.
y = np.array([1, 0])
# Initialize parameters for a network with 2 input features, 3 hidden units, and 1 output unit.
W1 = np.random.randn(2, 3) * 0.01
b1 = np.zeros((1, 3))
W2 = np.random.randn(3, 1) * 0.01
b2 = np.zeros((1, 1))
# Run forward propagation.
a2, a1, z1, z2 = forward_propagation(X, W1, b1, W2, b2)
# Run backward propagation.
dW1, db1, dW2, db2 = backward_propagation(X, y, W2, a1, a2)
# Print the computed gradients.
print("Gradient for W1 (dW1):")
print(dW1)
print("Gradient for b1 (db1):")
print(db1)
print("Gradient for W2 (dW2):")
print(dW2)
print("Gradient for b2 (db2):")
print(db2)

In the code above:

  • Lines 8–9: Defines the derivative of the Sigmoid function. This derivative is crucial in backpropagation for determining how sensitive the activation is to changes in its input. It computes the gradient of the Sigmoid function given its activated output.

  • Line 36: Determines the number of samples mm from the input data XX.

  • Line 40: Computes dz2, the gradient of the loss with respect to z2. For binary classification with a Sigmoid output and binary cross-entropy loss, this simplifies to a2ya2−y.

  • Lines 43–44: Calculates gradients for the output layers. dW2 is computed by taking the dot product of the transpose of a1 (the hidden activations) and dz2, then dividing by m. Also, db2 is is computed by summing dz2 over all samples and dividing by m.

  • Line 47: Backpropagates the error to the hidden layer. Here, dz1 is computed by multiplying the dot product of dz2 and the transpose of W2 with the derivative of the Sigmoid function applied to a1.

  • Lines 48–49: Computes gradients for the hidden layer. dW1 is the dot product of the transpose of the input X and dz1, averaged over all samples. db1 is the sum of dz1 across the batch, divided by m.

  • Line 51: Returns all the computed gradients: dW1, db1, dW2, and db2.

  • Line 73: The backward_propagation function is called with the input data, true labels, output layer weights, and intermediate activations. The gradients computed here can be used to update the weights and biases during training.

Interviewers want to see that you can clearly articulate how the model learns by adjusting its parameters based on the error from the predictions. They’ll check if you understand how backpropagation uses the chain rule to propagate error gradients from the output backward through each layer, updating every weight and bias. An interviewer might ask, “How does the error computed at the output layer get propagated back to the earlier layers?” They’re looking for a detailed explanation that covers how the gradients are computed—for example, why the derivative of the Sigmoid function matters—and how these gradients are averaged over the dataset and used for parameter updates. This shows that you grasp not only the individual equations but also the overall learning process.

Conclusion

Implementing the forward and backward pass is a foundational milestone in understanding how neural networks learn. It’s also one of the clearest signals to interviewers that you grasp both the mathematics and the intuition behind modern deep learning models.

Once you’ve shown competence in these essentials, interviewers often probe the same underlying ideas using different wording. Their goal is to see whether your understanding is flexible rather than memorized.

Common ways interviewers may phrase these questions:

  • “Walk me through what happens inside a neural network during training.”

  • “How does the model use the loss to improve itself?”

  • “Why would identical initial weights cause problems?”

  • “What role do activation functions play in letting the network learn non-linear patterns?”

  • “Can you explain how gradients move backward through the network?”

  • “How do preprocessing choices affect gradient-based learning?”

  • “What happens if your features aren’t scaled properly during training?”

  • “How would you debug a network that isn’t learning?”

Each of these questions points back to the same core principles covered in this lesson: how data flows forward, how errors propagate backward, why parameter initialization matters, and how preprocessing sets the stage for stable learning.

Preparing for multiple phrasings helps you recognize the underlying concept instantly, even when the wording shifts. That flexibility is what interviewers interpret as real understanding, and it’s what sets confident candidates apart.