Search⌘ K
AI Features

Machine Learning Foundations

Explore the fundamental machine learning concepts behind AI models including neural network components, forward and backward propagation, weight initialization, gradient descent, and optimizers. This lesson builds a solid foundation for understanding how advanced AI systems learn and perform, critical for excelling in AI engineering interviews.

If you are preparing for an interview in AI engineering, one thing is certain: you will be tested extensively on the fundamentals. Before any interviewer asks about transformers, RLHF, or RAG pipelines, they will probe whether you genuinely understand how neural networks learn. These questions are filters.

A candidate who cannot clearly explain backpropagation or articulate why zero initialization fails will not be trusted to design a fine-tuning pipeline or debug a production model.

This lesson covers the foundational concepts almost every AI interview probes and in the order interviewers usually actually ask them. And one important reminder:

Important: Modern generative AI systems, including transformers are not a different species of model. They are neural networks composed of the same ingredients you are learning here: linear transformations, nonlinear activations, normalization layers, residual connections, and gradient-based optimization. The scale is larger and the architecture more structured, but the learning mechanics are identical. If you understand this lesson, you understand the core engine behind GPT‑5.2, Claude Opus 4.6, Gemini 3.1, and every frontier model.

What are the core components of a neural network and how do they work together?

A neural network is a function composed of layers. Each layer applies a linear transformation followed by a nonlinear activation, passing the result to the next layer. The goal of training is to find weights and biases that produce correct outputs. Neural networks are powerful because they are universal function approximators given enough parameters and at least one nonlinear hidden layer, they can approximate any continuous function to arbitrary precision.

The three components you must know:

  • Weights and biases are the adjustable parameters. After training, they are the model as they encode everything learned from data.

  • Activation functions introduce nonlinearity. Without them, any depth of layers is equivalent to a single linear transformation. Think of activations as 'logic gates' that decide which information is important enough to pass forward. They turn the network from a simple calculator into a system capable of complex 'if-then' reasoning.

    • ReLU (max(0, x)) is the default for hidden layers because it filters out negative noise and keeps the "learning signal" strong.

    • Sigmoid maps output to (0, 1) for binary classification, essentially turning a raw score into a "percent confidence" for a Yes/No decision.

    • Softmax is the multi-choice version of Sigmoid; it ensures all category predictions sum to 100%, allowing the model to pick the single most likely winner from a group.

Interview Trap: “Why is ReLU preferred over Sigmoid in hidden layers?”

Sigmoid saturates for large positive or negative inputs. Its derivative approaches zero, causing gradients to vanish in deep networks. ReLU has derivative 1 in the positive region, enabling stronger gradient flow. It also creates sparse activations and is computationally simpler.

  • Loss functions measure prediction error and provide the training signal. Cross-entropy for classification, MSE for regression.

When someone says a model has 70 billion parameters, they mean it contains 70 billion individual weights and biases stored as floating-point numbers. Running the model consists of performing matrix multiplications with those numbers; it is nothing more than that.

In short, a neural network stacks layers that apply learned linear transformations followed by nonlinear activations, producing progressively more abstract representations. The loss measures error; training adjusts weights to minimize it. Networks can solve diverse problems because they are universal function approximators.

How does forward propagation work?

Forward propagation is the actual "thinking" process of the network. It is the sequence of operations that moves data from the raw input, through the hidden layers, to a final prediction.

Think of each layer as a station on an assembly line. Every single layer performs the exact same two-step operation before handing the data off to the next station.

  • Z = XW + b (linear transformation): The layer takes the incoming data (XX), multiplies it by its learned weights (WW), and adds a bias (bb). This step aggregates the features to see what patterns are present.

  • A = activation(Z) (nonlinearity): That raw score (ZZ) is passed through an activation function (like ReLU). This filters the data, deciding which signals are important enough to pass forward.

This output (AA ...