Measuring ML Model Performance
Explore the essentials of measuring machine learning model performance, including how to use loss functions to quantify errors. Understand key concepts such as training error, test error, overfitting, underfitting, and the bias-variance tradeoff. Learn how dataset splitting into training, validation, and test sets helps to select optimal model parameters for reliable predictions.
We'll cover the following...
When we build the ML model, we use the data and create the model. Consider the problem of predicting the skill level of a student from his answers to a different question. We take the data and create a classification model that tells us the skill of a student on a scale of 1 to 5 (integers). We verify this model on the data available to build the model.
However, what happens when we deploy this model to production? Maybe it is good for some students, but it gives bad results for others. Maybe some highly skilled students got lower ratings due to wrong answers to a particular question. So, we may end up either creating a good model or a bad model that no one wants to use. Our goal is to always create the best model. In cases where business depends on ML models, it is necessary to create a reliable, good model.
Loss function
So, we want to know how successful the model is or the amount of loss associated with our predictions. This can be measured with a loss function.
Loss function = L(y,fw(x))
Here, y stands for the true value, and fw(x) is the predicted value. L is the function that takes these two and gives the loss using some computation.
For example
Squared error function: (y - fw(x))2
Absolute error function: |y - fw(x)|
Training error
Training error is the error we get for the training data. Training data is used for training the model. So, we create the model and get the error estimates on the same data.
How does training error change with increased model complexity?
Is training error a good performance measure for any ML model?
Now, the complete data looks like this:
With a highly complex model, it is optimistic about training data but misses the distribution of true data.
Quiz: Training error
You have been given two datasets. Both have the same instances. Features of one dataset are the subset of the other dataset? Which of the following statements is true about the training errors?
Training errors would be the same for both dataset for any model
Training errors would be higher for larger feature dataset
Training errors would be higher for smaller feature dataset
Can’t answer with the provided information
Test error
We can create a test set from the data. Test data is the data that was not used for model training. So, we train the model on training data and check its performance on testing data.
How does the testing error change with increasing model complexity?
Is testing error a good performance measure for any ML model?
Overfitting
Now let’s understand an important concept in machine learning: overfitting.
Overfitting occurs when there is a model with estimated parameters p’, such that
In other words, if our model does very good on training data but not on testing data or true data, that is overfitting.
From the example above, if we keep increasing the complexity, we can get a smaller training error, but the testing error may increase at some point.
For example, if we keep training models on training data and reduce the training error, we may get a perfect model on training data. However, it might not perform well on unseen data or test data.
Underfitting
Underfitting is a situation where the model has not learned enough about the data, generating an overall low generalization and bad predictions.For example, if we do not train the model enough, it will not be good enough for prediction. This is underfitting.
Sources of error?
We can think of three sources of error in the model:
- Noise
- Bias
- Variance
Noise
Real data is noisy. When we fit a model on the dataset, data points may change from a true relationship by some noise value. Even if we get the best parameters, most of the time, we cannot reduce the error to zero. The true relationship between y and x is below:
Y = fw(x) + Errornoise
This is also called an irreducible error. We cannot control this.
Bias
Bias is the measurement of how our model can fit the true relationship between the dependent and independent variables. Bias is the difference between true function and our estimated function on the dataset.
Bias = fw(true)(x) – fw(estimated)
Variance
Variance is the difference between the different model fits. Variance is how a specific model differs from the estimated model fit. If we fit the model on two different samples of the dataset, the variance is the difference between the specific model and estimated general model.
Variance = Over all possible i (fw(estimated)(x) – fw(i))
What would the mean of noise error with the best model be?
Quiz: Bias and variance
For the low complexity model, what would the values of bias and variance be ?
Low bias, low variance
Low bias, high variance
High bias, low variance
High bias, high variance
Bias-variance tradeoff
As the model complexity increases, bias keeps coming down and variance keeps increasing. This is known as the bias-variance tradeoff. You can understand this from below.
Can we compute the bias and variance error?
How do true errors behave with increasing data and fixed model complexity?
How do training errors behave with increasing data and a fixed model complexity?
Quiz: Data size
For a fixed model complexity (parameter value), which of these errors exists? (Choose all that are valid) Multi-select
Bias
Variance
Noise
Dataset splitting
Earlier, we understood the dataset division using test and train sets. We train our model on the training dataset and check the performance on the testing dataset. For any model, we have different parameters to select. The parameter control model complexity. For example, in the tree-based model, the parameters could include what the depth of the tree should be or how many trees should be created in the model? Selecting good parameters can lead to low true error. Now the question is: how can we choose our parameters?
One way of doing this is training on the training dataset with pre-selected parameter values (or complexity) and evaluating the performance on the test dataset. We check the model performance for each parameter value and select the best model that gives the lowest testing error.
Any problem with this approach?
Yes! In the above approach, we make each decision based on the performance of the test data. In other words, we are exposing our test set during the training of the model, which is not effective.
To handle this situation, we can split the data into three parts instead of two: train, validation, and test. We can train the model on the training set. Then, we evaluate the value during the parameter selection on the validation set and after model creation, we test the final performance on the test set.
Quiz: Performance of an ML model
If we have high errors on both training and testing sets, what could the problem be?
Low Bias, Low Variance
Low Bias, High Variance
High Bias, Low Variance
High Bias, High Variance