How to use RandomizedSearchCV function
In Machine Learning, multiple hyperparameters are used to maximize the input model’s performance and generalization ability. These parameters are not learned from the training data but are selected before training the model. These parameters can include the number of epochs, the number of hidden layers in the neural network, the number of trees, tree depth in random forests, etc.
Tuning hyperparameters
There are multiple methods to tune the hyperparameters. One of the widely used tuning methods is GridSearchCV, which uses the scikit GridSearchCV function. It searches for the best hyperparameter combination exhaustively from the predefined grid. It ensures that each combination of parameters is tried, which can be computationally expensive. This is why the runtime for GridSearchCV can be drastically lower.
Another method is randomized parameter optimization. This is implemented in the randomizedsearchcv function. The randomizedsearchcv function searches for the best hyperparameter combination within the predefined distributions that gives the best score as an output. Instead of searching for all possible combinations, it efficiently narrows the search space. However, it is not guaranteed to get the global optimal because of its random nature.
Parameters
The randomizedsearchcv function takes the following parameters:
estimator: This is a model for which hyperparameters need tuning.param_distributions: This is a dictionary specifying the parameter’s names as keys and their values. The values can be either lists of values or distributions. The function optimizes these parameters.If the input is a distribution, then some sampling method is provided.
If the input is a list, it is
sampled Every sample has an equal probability of being chosen. .uniformly Every sample has an equal probability of being chosen. If the input is a list of dictionaries, then a dictionary is uniformly sampled, and parameters are uniformly sampled using that dictionary.
n_iter: This is an optional parameter that determines the number of random parameter combinations to try. Its default value is set to10.scoring: This is also an optional parameter that determines the scoring strategy to evaluate the model’s performance on the test set. Its default value is set toNone.n_jobs: This optional parameter sets the number of parallel jobs to run. Its default value is set toNone.refit: This optional parameter has a default value that is set toTrue. It refits the model using the best-found hyperparameters.cv: This is an optional parameter that sets the number of cross-validation folds. Its default value is set toNone.verbose: This is an optional parameter that sets the number of output messages. Its default value is set to1.pre_dispatch: This determines the number of jobs dispatched during parallel computation.random_state: This is an integer that acts as a seed for a pseudo-random number generator. It can also beRandomStateinstance. TheRandomStateinstance provides more control for generating random numbers from various probability distributions using multiple methods. Its default value is set toNone.error_score: If its value is set toraise, it will raise the error that occurs during model fitting. If a numeric value is set, thenFitFailedWarningis raised. Its default value is set tonan.return_train_score: The score tells how well the estimator fits on the training data. Its default value is set toFalse.
Code example
Let's see the code example for how to use the RandomizedSearchCV function.
from sklearn.model_selection import RandomizedSearchCV, train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import load_diabetesfrom scipy.stats import randintX,y = load_diabetes(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)# Creating a Modelmodel = RandomForestClassifier()param_distributions = {'n_estimators': randint(64, 256),'max_depth': [None, 4, 8, 16],'min_samples_split': [2, 4, 8]}# Create the RandomizedSearchCV objectrandom_search = RandomizedSearchCV(estimator=model, param_distributions=param_distributions, cv=5, verbose=1)# Perform the searchrandom_search.fit(X_train, y_train)# Print the best hyperparametersbest_n_estimators = random_search.best_params_['n_estimators']best_max_depth = random_search.best_params_['max_depth']best_min_samples_split = random_search.best_params_['min_samples_split']# Print the resultsprint("Best number of estimators:", best_n_estimators)print("Best maximum depth:", best_max_depth)print("Best minimum number of samples split:", best_min_samples_split)
Explanation
Lines 1–4: We are importing the following necessary libraries:
RandomizedSearchCVmethod for optimizing parameterstrain_test_splitmethod for splitting and shuffling the dataset into training and testing data with a specific ratioRandomForestClassifieris ourestimatorfor tuning the hyperparametersload_diabetesloads the diabetes datasetrandintfunction generates a discrete random integers
Line 6: The
load_diabetesfunction returns the input features (X) and target values (y). Thetrueparameter returns the features and target values in separate arrays.Line 7: The
train_test_splitfunction splits the data into 67% percent training and 33% testing. Therandom_statesets the seed for the random number generator that determines how data is shuffled.Line 10: An instance of the
RandomForestClassifierclass is created.Lines 12–14: We are defining a dictionary specifying the hyperparameters.
Line 17: The
RandomizedSearchCVfunction is initialized using theestimator, the hyperparameter distributions 5-fold cross-validation, and verbose output.Line 20: The
fitmethod is called to perform the randomized search for the best hyperparameters.Lines 22–30: We print the best hyperparameters found during the search.
Conclusion
We can enhance the tuning process for our model by using different ranges of hyperparameters and utilizing more training data. This can potentially improve the model performance. RandomizedSearchCV offers a straightforward and flexible approach to optimizing model performance. It provides a balance between thoroughly exploring different hyperparameter values and efficiently using computational power.
Free Resources