Search⌘ K
AI Features

What is Machine Learning?

Explore the fundamentals of machine learning, including different learning types, model training, and evaluation methods. Learn how models make predictions from data and the importance of avoiding overfitting and underfitting to ensure good performance on unseen data.

Machine learning (ML), a subfield of artificial intelligence, is concerned with programming computers to make decisions automatically. ML aims to build mathematical models capable of learning from existing data to make predictions or to find patterns and trends in the data.

Examples of ML applications

ML is required when we want computers to recognize patterns in data. Learning might also be required when the solution to a problem changes over time. For instance, routing algorithms working in a computer network can require previous data to perform better routing.

Note: ML requires example data to learn patterns and to make decisions.

Some of the real-world examples of ML include:

  • Disease diagnosis

  • Fraud detection in finance

  • Robotic process control in the industry

  • Spam email detection

  • Search engines in web mining

Types of ML

Three main types of ML and their types are given in the following figure.

  • Supervised learning: This uses the training examples along with labels to build an ML model. Its major types are:

    • Regression: This deals with continuous labels.

    • Classification: This deals with discrete labels.

  • Unsupervised learning: This uses the training examples without labels to build an ML model. Its major types are:

    • Clustering: This groups similar data based on input features.

    • Dimensionality reduction: This reduces the dimensions of the data to get only the most salient features.

    • Association: This finds associations or relationships between variables in large datasets.

  • Reinforcement learning: We maximize some kind of reward over a series of actions (policy). Suppose we want to develop a computer game that plays chess. The system’s output is a sequence of actions. It’s the policy, instead of single actions, that is crucial to winning the game. Learning a policy is the main task in reinforcement learning. Its main types are given below.

    • Positive reinforcement learning: This learns a series of actions to increase the frequency and strength of a favorable stimulus.
      Negative reinforcement learning: This learns a series of actions to decrease the frequency and strength of an adverse stimulus.

Basic terminology in ML

ML model

An ML model is a mathematical description or a formula that describes a dataset. It can be a simple linear formula or a complex model whose number of parameters can vary according to the dataset.

Model parameters and hyperparameters

Model parameters determine how the input data transforms into the desired output. For a simple linear model, f(x)=θ1x+θ0,f(x) = \theta_1x+\theta_0, with input data xx and output f(x)f(x), θ0 and θ1\theta_0 \text{ and } \theta_1 are parameters of the model to be learned from all samples of the training data.

Hyperparameters are different from model parameters. They decide the shape of the model and provide model configuration. Since the example data can’t determine their values, we manually specify them. Examples of hyperparameters include the number of model parameters to estimate and the number of iterations of the algorithm.

Training and test data

To model a dataset, we usually divide it into a training and a test set. The former builds an ML model while the latter tests the built model. We omit the data points in the training set from the test set and vice versa. ML models try to generalize their prediction capabilities. The selected test set isn’t used in building the model. It’s a normal practice to use a training set of 60 to 90 percent of the dataset. However, depending upon the size of the available dataset, the percentage of the training set can be adjusted. The remaining data points form the test set.

If the training set is large enough such that it’s unable to fit into our machine’s memory, we divide our training set into multiple batches. Since the training is done on individual batches having fewer samples, the overall training requires less memory.

Cross-validation

Sometimes, we keep aside the test set, choose a part of the training set to train the model, and the rest (the validation set) to validate the model. The model is iteratively trained and validated on different validation sets generated randomly from the training set. This process is known as cross-validation (CV).

In each iteration of cross-validation, we get an error or accuracy score. We find the mean of the error or accuracy score to get an average score.

Error, loss function, and cost function

To check how well an ML model performs on a given dataset, we need to compare the predicted response, f(x)f(x), for a given observation to the true label, yy , for that observation. In supervised learning, a commonly used function, mean squared error (MSE), calculates the error:

Here, nn , f(x)f(x), and yy represent the number of observations, the output of the trained model, and the actual labels of the data points, respectively.

The difference between the output of the model and the actual label, (f(x)y)2 \big( f(x) - y\big )^2 , is a loss function that’s computed for each data point. The average taken over loss functions for all data points forms a cost function. Therefore, MSE is a cost function. The training algorithm minimizes the cost function by reducing the difference between the predictions and the labels of the training examples.

Note: One of the criteria to have a well-trained model is to have a minimum value of the chosen loss function.

Minimizing the cost function

The cost function measures the error between the model output and the actual labels. We seek to minimize this error. One of the most common algorithms to minimize the cost function is the gradient descent algorithm. It’s an iterative optimization algorithm that takes steps in a direction opposite to the direction of the gradient. If we use MSE as our cost function, the gradient descent algorithm updates the model parameters θj\theta_j using the following iterative scheme:

Here, θj\theta_j and θj+1\theta_{j+1} correspond to the current and the updated values of the model parameters, respectively. Note that the MSE is a function of all the parameters, i.e., the vector Θ\Theta. If we have nn number of model parameters, we compute the above-mentioned equation nn times to update all the parameters. The constant, α\alpha , is known as the learning rate that controls the amount of the update of the parameters. A large value of α\alpha results in bigger steps toward the minimum of the cost function as compared to a smaller value. Typically, α\alpha ranges between 0.001 to 0.1, but an optimal learning rate depends on the dataset and our ML model. We use a learning rate small enough to avoid the algorithm to diverge and large enough to quickly converge to the minimum.

Overfitting and underfitting

Model overfitting and underfitting are among the main causes behind the poor performance of ML algorithms. The main aim of an ML model is to generalize from the training data to the unseen test data, where generalization is the model's ability to perform well on examples not used at the time of training/learning.

If a model is too simple to capture data trends, the model underfits the data. On the contrary, if a complex model captures the noise instead of the data trends, the model overfits the data.

The following figure illustrates the concept of overfitting and underfitting in a univariate dataset with a single input feature xx and a continuous output label yy.

The orange squares represent the data points, whereas the blue curve shows the learned model. The linear model in (a) is unable to capture the true data pattern. In contrast, a higher-order polynomial in (c) overfits the data because it tries to fit most points (possibly noise) in the dataset. The optimal model in (b) captures the actual trend in the data.