Improving ML Model Performance
Explore practical strategies to improve your machine learning model's performance. Understand how to optimize training data, choose appropriate evaluation metrics like F1 score, balance precision with inference time, analyze errors, handle real-world data variations, and apply transfer learning. This lesson prepares you to develop more robust and effective ML models for real applications.
In the previous lesson, you saw ways to measure the performance of the machine learning model and set the expectations right. Now we move towards improving the performance.
Ideas for improving ML model performance
Here are a few general ideas to improve any ML model performance.
- Collecting more data
- Increase diversity in the training set
- Trying bigger/smaller network architecture
- Trying different technique of gradient descent
- Trying alternate network architecture
Assumptions in ML
When building models, we must keep a few assumptions in mind, including:
- Good performance on the training set
- Good performance on validation set
- Good performance on the test set
- Good performance on real data
To consider all these assumptions, we start with training data, apply some ideas, and moving to a validation set followed by the test data. If performance is not good on real-world data, we need to enhance our algorithm with large data in different sets and change the cost function according to the real problem.
Single number evaluation metric
While performing any machine learning task, we should try to make our target metric a single number. This will help when comparing models.
Consider the example of three classifiers: C1, C2, and C3. We have evaluated performance on the test set and these are the results
| Model | Precision | Recall |
|---|---|---|
| C1 | 0.80 | 0.90 |
| C2 | 0.85 | 0.86 |
| C3 | 0.88 | 0.82 |
Now, one model is doing good on precision, and another model is doing good on recall. If we have a single goal (e.g., optimize only for precision or only for recall), we can choose only one. However, if we have to consider both precision and remail, we need to combine these two measures. In statistics, we have F1-Score, which is the harmonic mean of precision and recall:
With these changes, the above table will look like this:
| Model | Precision | Recall | F1 Score |
|---|---|---|---|
| C1 | 0.80 | 0.90 | 0.847 |
| C2 | 0.85 | 0.86 | 0.854 |
| C3 | 0.88 | 0.82 | 0.848 |
We can choose the model that has a higher F1 score.
We also have to check other factors, such as inference time. Consider accuracy and inference time (the time for the model to take inputs and generate predictions) in the example below.
| Model | F1 Score | Inference Time |
|---|---|---|
| C4 | 0.88 | 52ms |
| C5 | 0.78 | 15ms |
| C6 | 0.90 | 350ms |
We need a model that can generate predictions every 100ms. Clearly, the last one is not an option because it requires 350ms to generate the prediction, and this is not suitable for real-world problems. Hence, both the F1 Score and inference time are equally important. We can consider, then, that the best model takes an inference time of less than 100 ms.
Training and test set distribution
We are working on the sales prediction problem. We have data from five countries: the US, Canada, India, France, and Germany. Can we build a model on the US, Canada, and India datasets and test the model on the remaining countries?
Comparing algorithms and human
The goal of any machine learning model is to solve a problem efficiently by building a model that replaces the need for human labor on those tasks. For some tasks, like audio recognition and facial recognition, machine learning actually surpasses human performance and provides superior results.
Comparing the performance between a computer and a human shows how the algorithm is more beneficial. If the algorithm surpasses human-level performance, it is more efficient, and we can try with better parameters to improve it further. If it does not reach human-level performance, we can take the data from the human performance and add it to the training data to improve the computer’s performance.
Human performance can help create better models. That’s why large companies focus on data generated from human labor and take surveys or arrange paid tasks to provide this data. and improve our algorithms. Below, the chart compares the accuracy of different approaches for two tasks.
| Task-1 | Task-2 | |
|---|---|---|
| Human Accuracy | 99.5% | 92% |
| Training Accuracy | 90% | 90% |
| Test Accuracy | 85% | 85% |
There are a few areas where algorithms perform better than humans:
- Calculating transit time
- Product recommendations
- Advertising
- Sales prediction
In general, we consider a machine learning model superior if:
-
It gives accurate results on the training set.
-
The difference between the training error and testing error is not high.
Training Accuracy can be improved with:
- A better model
- A better architecture
Testing Accuracy can be improved by:
- More data
- Avoiding overfitting
Is it acceptable to get a good result only on testing data?
Error analysis of the model
When our model is not performing well, it is advisable to focus on where it is going wrong to improve it. To do so, we can collect the examples where the model is making wrong predictions and analyze the issues accordingly.
Does adding more data help create a better model?
| Accuracy | Wrong Labels | |
|---|---|---|
| C1 | 80% | 2% |
| C2 | 96% | 2% |
C1 and C2 are trained on different problems. Both have approximately 2% wrong labels. In the case of C1, wrong labels are not impacting performance. When improving bias, we can increase accuracy. But in the case of C2, the error is 4%, and the data has 2% of wrong labels. In this case, spending time correcting labels and reducing wrong data will improve the model’s performance significantly.
What is the ideal timeline for delivering and testing a model?
Data from different distribution
To solve any problem using machine learning, we start with the data. If the data is not available, we must generate it. Consider a voice command smart home application. You want to design a system that takes a user’s voice and does the operations requested of it. A few examples are, “Turn on the TV”, “Switch off the light”, “Increase fan speed by 2”,“Open the door”, etc. You start by building this system and creating the data. You record these instructions in a studio and write the instructions for each voice command. Your system works well in your testing, and you deploy it to your smart devices.
But wait! Let’s say it goes terribly wrong. When the user instructs the system to switch on the light, the system turns off the fan. When asked to turn on the TV, it opens the door instead. The user becomes frustrated and uninstalls your system. Where did things go wrong?
One possibility is different data during training versus in the real world. For example, in testing, you only considered a positive voice, but in practice, a music system is also on, so that voice is also coming into your system. The home of one user is near a loud road, so traffic noise is interrupting your system. So many problems can arise that you don’t account for in training.
A good system addresses every possible scenario that could occur once an application is deployed. It is necessary to train on these noisy examples and evaluate the system’s performance. Take examples of noisy sets, prepare a noisy test set, and evaluate the performance. If it’s not running perfectly, improve your model with the addition of new data. You will need to:
Handle different data distributions
-
Check data manually to understand the problem’s origin.
-
Put more real data into the training and validation set.
-
Generate augmented data. (i.e., noisy voice data or multiple voices talking at once)
Suppose you have created an image classification model. You added different examples, blurry images, multiclass examples, etc. You separated data into training and validation. You built a model on training data and checked the performance on the validation set. Now, you are using a set of new images captured from your mobile as a testing set and apply the model. You used these new images as a test set. These are the performance numbers:
Training error – 10%
Validation error – 9%
Testing error – 1%.
Is this case possible? How can you justify a low testing error?
Transfer learning
Transfer learning refers to gaining knowledge from one problem and using it to solve another problem. These are used very often with deep learning techniques. We learn or build a model using one data problem, remove the last layer, and use the same network to solve other problems. If our new problem has fewer data points and is not sufficient to build the model, we can use transfer learning to create new models.
Quiz: ML Model performance
Mark all the techniques that can improve the machine learning model performance in general. Multi-select
Adding more data
Using simple architecture
Removing the validation and test set and use training data for testing.
Adding data points from different distributions.