What is Machine Learning?
Explore the fundamentals of machine learning, including different learning types, model training, and evaluation methods. Learn how models make predictions from data and the importance of avoiding overfitting and underfitting to ensure good performance on unseen data.
Machine learning (ML), a subfield of artificial intelligence, is concerned with programming computers to make decisions automatically. ML aims to build mathematical models capable of learning from existing data to make predictions or to find patterns and trends in the data.
Examples of ML applications
ML is required when we want computers to recognize patterns in data. Learning might also be required when the solution to a problem changes over time. For instance, routing algorithms working in a computer network can require previous data to perform better routing.
Note: ML requires example data to learn patterns and to make decisions.
Some of the real-world examples of ML include:
Disease diagnosis
Fraud detection in finance
Robotic process control in the industry
Spam email detection
Search engines in web mining
Types of ML
Three main types of ML and their types are given in the following figure.
Supervised learning: This uses the training examples along with labels to build an ML model. Its major types are:
Regression: This deals with continuous labels.
Classification: This deals with discrete labels.
Unsupervised learning: This uses the training examples without labels to build an ML model. Its major types are:
Clustering: This groups similar data based on input features.
Dimensionality reduction: This reduces the dimensions of the data to get only the most salient features.
Association: This finds associations or relationships between variables in large datasets.
Reinforcement learning: We maximize some kind of reward over a series of actions (policy). Suppose we want to develop a computer game that plays chess. The system’s output is a sequence of actions. It’s the policy, instead of single actions, that is crucial to winning the game. Learning a policy is the main task in reinforcement learning. Its main types are given below.
Positive reinforcement learning: This learns a series of actions to increase the frequency and strength of a favorable stimulus.
Negative reinforcement learning: This learns a series of actions to decrease the frequency and strength of an adverse stimulus.
Basic terminology in ML
ML model
An ML model is a mathematical description or a formula that describes a dataset. It can be a simple linear formula or a complex model whose number of parameters can vary according to the dataset.
Model parameters and hyperparameters
Model parameters determine how the input data transforms into the desired output. For a simple linear model,
Hyperparameters are different from model parameters. They decide the shape of the model and provide model configuration. Since the example data can’t determine their values, we manually specify them. Examples of hyperparameters include the number of model parameters to estimate and the number of iterations of the algorithm.
Training and test data
To model a dataset, we usually divide it into a training and a test set. The former builds an ML model while the latter tests the built model. We omit the data points in the training set from the test set and vice versa. ML models try to generalize their prediction capabilities. The selected test set isn’t used in building the model. It’s a normal practice to use a training set of 60 to 90 percent of the dataset. However, depending upon the size of the available dataset, the percentage of the training set can be adjusted. The remaining data points form the test set.
If the training set is large enough such that it’s unable to fit into our machine’s memory, we divide our training set into multiple batches. Since the training is done on individual batches having fewer samples, the overall training requires less memory.
Cross-validation
Sometimes, we keep aside the test set, choose a part of the training set to train the model, and the rest (the validation set) to validate the model. The model is iteratively trained and validated on different validation sets generated randomly from the training set. This process is known as cross-validation (CV).
In each iteration of cross-validation, we get an error or accuracy score. We find the mean of the error or accuracy score to get an average score.
Error, loss function, and cost function
To check how well an ML model performs on a given dataset, we need to compare the predicted response,
Here,
The difference between the output of the model and the actual label,
Note: One of the criteria to have a well-trained model is to have a minimum value of the chosen loss function.
Minimizing the cost function
The cost function measures the error between the model output and the actual labels. We seek to minimize this error. One of the most common algorithms to minimize the cost function is the gradient descent algorithm. It’s an iterative optimization algorithm that takes steps in a direction opposite to the direction of the gradient. If we use MSE as our cost function, the gradient descent algorithm updates the model parameters
Here,
Overfitting and underfitting
Model overfitting and underfitting are among the main causes behind the poor performance of ML algorithms. The main aim of an ML model is to generalize from the training data to the unseen test data, where generalization is the model's ability to perform well on examples not used at the time of training/learning.
If a model is too simple to capture data trends, the model underfits the data. On the contrary, if a complex model captures the noise instead of the data trends, the model overfits the data.
The following figure illustrates the concept of overfitting and underfitting in a univariate dataset with a single input feature
The orange squares represent the data points, whereas the blue curve shows the learned model. The linear model in (a) is unable to capture the true data pattern. In contrast, a higher-order polynomial in (c) overfits the data because it tries to fit most points (possibly noise) in the dataset. The optimal model in (b) captures the actual trend in the data.