In Machine Learning, multiple hyperparameters are used to maximize the input model’s performance and generalization ability. These parameters are not learned from the training data but are selected before training the model. These parameters can include the number of epochs, the number of hidden layers in the neural network, the number of trees, tree depth in random forests, etc.
There are multiple methods to tune the hyperparameters. One of the widely used tuning methods is GridSearchCV, which uses the scikit
GridSearchCV
function. It searches for the best hyperparameter combination exhaustively from the predefined grid. It ensures that each combination of parameters is tried, which can be computationally expensive. This is why the runtime for GridSearchCV
can be drastically lower.
Another method is randomized parameter optimization. This is implemented in the randomizedsearchcv
function. The randomizedsearchcv
function searches for the best hyperparameter combination within the predefined distributions that gives the best score as an output. Instead of searching for all possible combinations, it efficiently narrows the search space. However, it is not guaranteed to get the global optimal because of its random nature.
The randomizedsearchcv
function takes the following parameters:
estimator
: This is a model for which hyperparameters need tuning.
param_distributions
: This is a dictionary specifying the parameter’s names as keys and their values. The values can be either lists of values or distributions. The function optimizes these parameters.
If the input is a distribution, then some sampling method is provided.
If the input is a list, it is
If the input is a list of dictionaries, then a dictionary is uniformly sampled, and parameters are uniformly sampled using that dictionary.
n_iter
: This is an optional parameter that determines the number of random parameter combinations to try. Its default value is set to 10
.
scoring
: This is also an optional parameter that determines the scoring strategy to evaluate the model’s performance on the test set. Its default value is set to None
.
n_jobs
: This optional parameter sets the number of parallel jobs to run. Its default value is set to None
.
refit
: This optional parameter has a default value that is set to True
. It refits the model using the best-found hyperparameters.
cv
: This is an optional parameter that sets the number of cross-validation folds. Its default value is set to None
.
verbose
: This is an optional parameter that sets the number of output messages. Its default value is set to 1
.
pre_dispatch
: This determines the number of jobs dispatched during parallel computation.
random_state
: This is an integer that acts as a seed for a pseudo-random number generator. It can also be RandomState
instance. The RandomState
instance provides more control for generating random numbers from various probability distributions using multiple methods. Its default value is set to None
.
error_score
: If its value is set to raise
, it will raise the error that occurs during model fitting. If a numeric value is set, then FitFailedWarning
is raised. Its default value is set to nan
.
return_train_score
: The score tells how well the estimator fits on the training data. Its default value is set to False
.
Let's see the code example for how to use the RandomizedSearchCV
function.
from sklearn.model_selection import RandomizedSearchCV, train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import load_diabetesfrom scipy.stats import randintX,y = load_diabetes(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)# Creating a Modelmodel = RandomForestClassifier()param_distributions = {'n_estimators': randint(64, 256),'max_depth': [None, 4, 8, 16],'min_samples_split': [2, 4, 8]}# Create the RandomizedSearchCV objectrandom_search = RandomizedSearchCV(estimator=model, param_distributions=param_distributions, cv=5, verbose=1)# Perform the searchrandom_search.fit(X_train, y_train)# Print the best hyperparametersbest_n_estimators = random_search.best_params_['n_estimators']best_max_depth = random_search.best_params_['max_depth']best_min_samples_split = random_search.best_params_['min_samples_split']# Print the resultsprint("Best number of estimators:", best_n_estimators)print("Best maximum depth:", best_max_depth)print("Best minimum number of samples split:", best_min_samples_split)
Lines 1–4: We are importing the following necessary libraries:
RandomizedSearchCV
method for optimizing parameters
train_test_split
method for splitting and shuffling the dataset into training and testing data with a specific ratio
RandomForestClassifier
is our estimator
for tuning the hyperparameters
load_diabetes
loads the diabetes dataset
randint
function generates a discrete random integers
Line 6: The load_diabetes
function returns the input features (X) and target values (y). The true
parameter returns the features and target values in separate arrays.
Line 7: The train_test_split
function splits the data into 67% percent training and 33% testing. The random_state
sets the seed for the random number generator that determines how data is shuffled.
Line 10: An instance of the RandomForestClassifier
class is created.
Lines 12–14: We are defining a dictionary specifying the hyperparameters.
Line 17: The RandomizedSearchCV
function is initialized using the estimator
, the hyperparameter distributions 5-fold cross-validation, and verbose output.
Line 20: The fit
method is called to perform the randomized search for the best hyperparameters.
Lines 22–30: We print the best hyperparameters found during the search.
We can enhance the tuning process for our model by using different ranges of hyperparameters and utilizing more training data. This can potentially improve the model performance. RandomizedSearchCV
offers a straightforward and flexible approach to optimizing model performance. It provides a balance between thoroughly exploring different hyperparameter values and efficiently using computational power.
Free Resources