How to use RandomizedSearchCV function

In Machine Learning, multiple hyperparameters are used to maximize the input model’s performance and generalization ability. These parameters are not learned from the training data but are selected before training the model. These parameters can include the number of epochs, the number of hidden layers in the neural network, the number of trees, tree depth in random forests, etc.

Tuning hyperparameters

There are multiple methods to tune the hyperparameters. One of the widely used tuning methods is GridSearchCV, which uses the scikit GridSearchCV function. It searches for the best hyperparameter combination exhaustively from the predefined grid. It ensures that each combination of parameters is tried, which can be computationally expensive. This is why the runtime for GridSearchCV can be drastically lower.

Another method is randomized parameter optimization. This is implemented in the randomizedsearchcv function. The randomizedsearchcv function searches for the best hyperparameter combination within the predefined distributions that gives the best score as an output. Instead of searching for all possible combinations, it efficiently narrows the search space. However, it is not guaranteed to get the global optimal because of its random nature.

Parameters

The randomizedsearchcv function takes the following parameters:

estimator: This is a model for which hyperparameters need tuning.
param_distributions: This is a dictionary specifying the parameter’s names as keys and their values. The values can be either lists of values or distributions. The function optimizes these parameters.
- If the input is a distribution, then some sampling method is provided.
- If the input is a list, it is sampled Every sample has an equal probability of being chosen.uniformlyEvery sample has an equal probability of being chosen..
- If the input is a list of dictionaries, then a dictionary is uniformly sampled, and parameters are uniformly sampled using that dictionary.
n_iter: This is an optional parameter that determines the number of random parameter combinations to try. Its default value is set to 10.
scoring: This is also an optional parameter that determines the scoring strategy to evaluate the model’s performance on the test set. Its default value is set to None.
n_jobs: This optional parameter sets the number of parallel jobs to run. Its default value is set to None.
refit: This optional parameter has a default value that is set to True. It refits the model using the best-found hyperparameters.
cv: This is an optional parameter that sets the number of cross-validation folds. Its default value is set to None.
verbose: This is an optional parameter that sets the number of output messages. Its default value is set to 1.
pre_dispatch: This determines the number of jobs dispatched during parallel computation.
random_state: This is an integer that acts as a seed for a pseudo-random number generator. It can also be RandomState instance. The RandomState instance provides more control for generating random numbers from various probability distributions using multiple methods. Its default value is set to None.
error_score: If its value is set to raise, it will raise the error that occurs during model fitting. If a numeric value is set, then FitFailedWarning is raised. Its default value is set to nan.
return_train_score: The score tells how well the estimator fits on the training data. Its default value is set to False.

Code example

Let's see the code example for how to use the RandomizedSearchCV function.

from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_diabetes
from scipy.stats import randint
X,y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Creating a Model
model = RandomForestClassifier()
param_distributions = {'n_estimators': randint(64, 256),
                        'max_depth': [None, 4, 8, 16],
                        'min_samples_split': [2, 4, 8]}
# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_distributions, cv=5, verbose=1)
# Perform the search
random_search.fit(X_train, y_train)
# Print the best hyperparameters
best_n_estimators = random_search.best_params_['n_estimators']
best_max_depth = random_search.best_params_['max_depth']
best_min_samples_split = random_search.best_params_['min_samples_split']
# Print the results
print("Best number of estimators:", best_n_estimators)
print("Best maximum depth:", best_max_depth)
print("Best minimum number of samples split:", best_min_samples_split)

Explanation

Lines 1–4: We are importing the following necessary libraries:
- RandomizedSearchCV method for optimizing parameters
- train_test_split method for splitting and shuffling the dataset into training and testing data with a specific ratio
- RandomForestClassifier is our estimator for tuning the hyperparameters
- load_diabetes loads the diabetes dataset
- randint function generates a discrete random integers
Line 6: The load_diabetes function returns the input features (X) and target values (y). The true parameter returns the features and target values in separate arrays.
Line 7: The train_test_split function splits the data into 67% percent training and 33% testing. The random_state sets the seed for the random number generator that determines how data is shuffled.
Line 10: An instance of the RandomForestClassifier class is created.
Lines 12–14: We are defining a dictionary specifying the hyperparameters.
Line 17: The RandomizedSearchCV function is initialized using the estimator, the hyperparameter distributions 5-fold cross-validation, and verbose output.
Line 20: The fit method is called to perform the randomized search for the best hyperparameters.
Lines 22–30: We print the best hyperparameters found during the search.

Conclusion

We can enhance the tuning process for our model by using different ranges of hyperparameters and utilizing more training data. This can potentially improve the model performance. RandomizedSearchCV offers a straightforward and flexible approach to optimizing model performance. It provides a balance between thoroughly exploring different hyperparameter values and efficiently using computational power.

Free Resources