Using Gridsearch for Hyper Parameter Tuning

Hyperparameter tuning can be quite the task. It can be heavily manual and rely on gut feel as to what the value for that parameter should roughly be. If we did it in a completely manual way, I imagine that depression rates in data scientists would go through the roof but luckily, we have GridSearch.

GridSearch is an exhaustive search which runs through each of the hyper parameters you define in the Python Dictionary and prints the best config for you.

Let’s go through the below block of code. Here, I am defining a random forest classification model. In the parameter grid, I’ve set some values (e.g. n_estimators; max_depth and bootstrap. I’ve defined some of them below, you can find out more on the SciKit Learn website

Random Forest Parameters

  • N_estimators is the number of trees you want to include in your forest. The trees essentially vote & the majority wins. The higher number of trees gives you better model performance, but it can be very slow. The key is to take the maximum number of trees that your processor can handle as it makes your predictions more accurate and more stable when released into the wild.
  • Max features covers the max number of features that a random forest is allowed to try in a single decision tree. Increasing the max number of features usually (but not always) improves model accuracy but slows down model performance. Here are the options:
    • Auto/None – This will take all the features it can possibly put in the tree
    • Sqrt – will take the square root of the total number of features for each individual tree. E.g. 100 features, would be 10 features per tree
    • 0.2 – this takes 20% of features for each tree.
  • Min leaf size: The leaf is the end node of the decision tree. The smaller leaf makes the tree more prone to capturing noise in training data & overfitting. Keeping the leaf min size at around 50 observations should help remove some of the overfitting.
  • Bootstrap is all about the sampling methodology. If we are working with a sample of the population as our training dataset (almost always), bootstrapping is the notion of repeatedly sampling that dataset. But, instead of removing the sample from the training dataset, bootstrapping puts it back, so it’s possible some of the same data points will appear in multiple samples. This means, the same sub samples may be used, leading to slightly different distributions of data & a more representative sampling strategy.

Defining grid search

Where I define best we create a grid search:

  • Estimator is the model, which as we defined it to be a random forest classifier
  • The list of parameters that grid search will run through is defined in the param_grid dictionary
  • cv stands for cross-validation, which I covered in detail here. Essentially, how many blocks do you want to split your data into to find the best estimate
  • n_jobs defines how many jobs can run in parallel. Setting this to -1 means that it should use all processors.

So, now we have defined what the grid search will loop through, we set it going by calling .fit and .score, which will output the best score of all the grid search attempts.

Using Randomized Search

The thing is…. grid search is really, really slow, because it searches for absolutely everything. Of course, by searching all of the parameters, you can be sure that you’re getting the most accurate model, but sometimes that is extremely exhaustive.

To get around this, we can use RandomizedSearch. It follows exactly the same config as grid search, but instead of searching absolutely everything, it randomly samples your parameter list. You define the number of samples it takes with the n_iter variable.

Kodey