Getting started with Sci-Kit Learn AutoML

An automl workflow should be able to preprocess data; select the right model to use; tune the hyper parameters and provide us with the best possible model as a result.

One such automl library is auto-sklearn. This library automatically finds the right algorithm for the dataset you have provided and automatically tunes the hyper parameters for you. It also handles missing values; scales and encodes data.

More than that, it creates an ensemble pipeline. That is, it will train several models and then use them in conjunction with one another to make a final prediction. Each of those models will be weighted differently & their weights will determine their influence on the final prediction.

We can look at the classifiers it has at its disposal here:

import autosklearn.pipeline.components.classification
for name in autosklearn.pipeline.components.classification.ClassifierChoice.get_components():
    print(name)
adaboost
bernoulli_nb
decision_tree
extra_trees
gaussian_nb
gradient_boosting
k_nearest_neighbors
lda
liblinear_svc
libsvm_svc
mlp
multinomial_nb
passive_aggressive
qda
random_forest
sgd

We can also print all the possible preprocessors using the below.

import autosklearn.pipeline.components.feature_preprocessing
for name in autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice.get_components():
    print(name)
densifier
extra_trees_preproc_for_classification
extra_trees_preproc_for_regression
fast_ica
feature_agglomeration
kernel_pca
kitchen_sinks
liblinear_svc_preprocessor
no_preprocessing
nystroem_sampler
pca
polynomial
random_trees_embedding
select_percentile_classification
select_percentile_regression
select_rates_classification
select_rates_regression
truncatedSVD

Configuring Auto Sklearn

Using the library is super simple. As below:

  1. Import all the libraries we need
  2. Read our CSV data into a dataframe
  3. Complete a test/train split on the data
  4. Define th emodel as an autosklearn classifier
  5. Fit the model with our dataset
  6. Make predictions using the model
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
import autosklearn.classification
df = pd.read_csv('/home/Datasets/heart.csv')
output = df['target']
features = df[['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal']]
train_features, test_features, train_labels, test_labels = train_test_split(features, output, test_size = 0.2, random_state = 42)
model = autosklearn.classification.AutoSklearnClassifier()
model.fit(train_features, train_labels)
predictions = model.predict(test_features)
print("Accuracy score", accuracy_score(test_labels, predictions))

This takes a very long time to run, which is okay, as it’s saving you the hours you’d usually spend tuning parameters and testing different models. This will output a tonne of different models that auto-sklearn assessed and built into your ensemble pipeline.

You can control the number of models that are added to your ensemble using the ensemble_size parameter. If you only want it to return the single best model, from an accuracy perspective, you can pass ensemble_size=1 into the AutoSklearnClassifier() function. You can also train your model using the parameter n-jobs = -1 to use all processor cores; and we can remove memory limits with memory_limit=None.

We may also only want to include the n best models as part of our model. In this instance, we can use the ensemble_nbest = 5 parameter, where 5 is the number of models you want to include.

If we want to train only certain types of model, we can add include_estimators=[random_forest‘, ‘gradient_boosting’] into the AutoSklearnClassifier() function or you could exclude specific model like: exclude_estimators=[‘random_forest’].

Similarly, we can define which preprocessors we wish to use: include_preprocessors=[‘pca’] or exclude those we don’t want to use exclude_preprocessors=[‘pca’].

We can define our resampling strategy (e.g. resampling_strategy=’cv’, resampling_strategy_arguments={‘folds’: 5}) – which would give us 5 fold cross validation.

Finally, we can control the time taken for this activity. By default, it’s 1 hour. However, we can play with time_left_for_this_task allows you to adjust the amount of time you’ll allow auto-sklearn to search for the best model. We can also adjust per_run_time_limit which allows you to cut out models that take too long to fit.

Other useful parameters can be found here.

The Models

If we now look at the model itself by simply print(model), we see the below, which gives you a nice view of the chosen model. Here, we have gaussian_nb, which is a Naive Bayes model – it’s used a median imputation strategy, run some encoding and done some data scaling for us.

SimpleClassificationPipeline({'balancing:strategy': 'weighting', 'classifier:__choice__': 'gaussian_nb', 'data_preprocessing:categorical_transformer:categorical_encoding:__choice__': 'encoding', 'data_preprocessing:categorical_transformer:category_coalescence:__choice__': 'minority_coalescer', 'data_preprocessing:numerical_transformer:imputation:strategy': 'median', 'data_preprocessing:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:__choice__': 'select_rates_classification', 'data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction': 0.009151554238227241, 'data_preprocessing:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 1726, 'data_preprocessing:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'normal', 'feature_preprocessor:select_rates_classification:alpha': 0.027868485240680432, 'feature_preprocessor:select_rates_classification:score_func': 'chi2', 'feature_preprocessor:select_rates_classification:mode': 'fpr'},
dataset_properties={
  'task': 1,
  'sparse': False,
  'multilabel': False,
  'multiclass': False,
  'target_type': 'classification',
  'signed': False})

We can also take a look at the sprint_statistics, which gives us an overview of what auto sklearn did for us.

print(model.sprint_statistics())
auto-sklearn results:
  Dataset name: 4d969e18-f480-11eb-823d-00163e10be7a
  Metric: accuracy
  Best validation score: 0.875000
  Number of target algorithm runs: 747
  Number of successful target algorithm runs: 563
  Number of crashed target algorithm runs: 184
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0

We can now take a look at the ‘leaderboard’, which gives you a view of the best models, including their execution durration & cost. This is a really nice way to view the best performing models among those that were tested. We alter how this is displayed using parameters here.

print(model.leaderboard())
          rank  ensemble_weight                type    cost  duration
model_id                                                             
481          1             0.10                 mlp  0.1250  1.113225
663          2             0.08         extra_trees  0.1250  1.651528
506          3             0.10                 mlp  0.1500  1.269330
643          4             0.04       liblinear_svc  0.1500  0.887935
710          5             0.02  passive_aggressive  0.1500  0.961617
25           6             0.38            adaboost  0.1625  1.270192
172          7             0.02                 mlp  0.1625  0.926105
348          8             0.02  passive_aggressive  0.1625  0.906898
351          9             0.02  passive_aggressive  0.1625  0.748605
380         10             0.04  passive_aggressive  0.1625  1.074969
399         11             0.04  passive_aggressive  0.1625  0.960480
436         12             0.10       liblinear_svc  0.1625  0.880505
442         13             0.04  passive_aggressive  0.1625  1.072690

We can choose to also print some more information in relation to the models in the leaderboard. This is really useful as we can now see the preprocessors that were applied to each of the models too.

fields=[ 'data_preprocessors', 'feature_preprocessors']
print(model.leaderboard(detailed=True, include=fields))
                                         data_preprocessors  \
model_id                                                      
25        [one_hot_encoding, minority_coalescer, normalize]   
172       [one_hot_encoding, minority_coalescer, power_t...   
348       [one_hot_encoding, no_coalescense, quantile_tr...   
351       [one_hot_encoding, minority_coalescer, quantil...   
380       [encoding, minority_coalescer, quantile_transf...   
399        [encoding, no_coalescense, quantile_transformer]   
436       [one_hot_encoding, minority_coalescer, quantil...   
442        [encoding, no_coalescense, quantile_transformer]   
481       [encoding, minority_coalescer, power_transformer]   
506       [encoding, minority_coalescer, power_transformer]   
643           [encoding, no_coalescense, power_transformer]   
663       [encoding, minority_coalescer, power_transformer]   
710       [one_hot_encoding, minority_coalescer, power_t...   
                  feature_preprocessors  
model_id                                 
25                         [polynomial]  
172                          [fast_ica]  
348                        [polynomial]  
351                        [polynomial]  
380                        [kernel_pca]  
399                        [polynomial]  
436                        [kernel_pca]  
442                        [polynomial]  
481       [select_rates_classification]  
506       [select_rates_classification]  
643                        [kernel_pca]  
663             [feature_agglomeration]  
710                        [kernel_pca]  

We can get a little bit more information on each of the models tested as below. Here I am particularly interested in model_id 25. In this output, you’ll see all of the model hyper parameters listed. The below is from the GitHub question I raised on the main project page.

wanted_model_id = 25
wanted_model = None
for (seed, model_id, budget), x in model.automl_.models_.items():
    if model_id == wanted_model_id:
        wanted_model = x

SimpleClassificationPipeline({'balancing:strategy': 'none', 'classifier:__choice__': 'adaboost', 'data_preprocessing:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessing:categorical_transformer:category_coalescence:__choice__': 'minority_coalescer', 'data_preprocessing:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessing:numerical_transformer:rescaling:__choice__': 'normalize', 'feature_preprocessor:__choice__': 'polynomial', 'classifier:adaboost:algorithm': 'SAMME', 'classifier:adaboost:learning_rate': 0.13167493237005792, 'classifier:adaboost:max_depth': 2, 'classifier:adaboost:n_estimators': 56, 'data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction': 0.05897357701860171, 'feature_preprocessor:polynomial:degree': 3, 'feature_preprocessor:polynomial:include_bias': 'False', 'feature_preprocessor:polynomial:interaction_only': 'False'},
dataset_properties={
  'task': 1,
  'sparse': False,
  'multilabel': False,
  'multiclass': False,
  'target_type': 'classification',
  'signed': False})

We can also look at all of the models as below.

print(model.show_models())
[(1.000000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'classifier:__choice__': 'lda', 'data_preprocessing:categorical_transformer:categorical_encoding:__choice__': 'encoding', 'data_preprocessing:categorical_transformer:category_coalescence:__choice__': 'minority_coalescer', 'data_preprocessing:numerical_transformer:imputation:strategy': 'most_frequent', 'data_preprocessing:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:__choice__': 'kernel_pca', 'classifier:lda:shrinkage': 'manual', 'classifier:lda:tol': 1.2307224406357523e-05, 'data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction': 0.009593734803209422, 'data_preprocessing:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 880, 'data_preprocessing:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'normal', 'feature_preprocessor:kernel_pca:kernel': 'poly', 'feature_preprocessor:kernel_pca:n_components': 70, 'classifier:lda:shrinkage_factor': 0.4278351678874593, 'feature_preprocessor:kernel_pca:coef0': -0.7133494187560384, 'feature_preprocessor:kernel_pca:degree': 3, 'feature_preprocessor:kernel_pca:gamma': 0.02316439941089123},
dataset_properties={
  'task': 1,
  'sparse': False,
  'multilabel': False,
  'multiclass': False,
  'target_type': 'classification',
  'signed': False})),
]

A worked example

In the below, we’re doing a test/train split and using auto-sklearn, with an ensemble size of 5. That means, it’ll use 5 models to build the ensemble.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
import autosklearn.classification
train_features, test_features, train_labels, test_labels = train_test_split(features, output, test_size = 0.2, random_state = 42)
model = autosklearn.classification.AutoSklearnClassifier(ensemble_size=5)
model.fit(train_features, train_labels)
predictions = model.predict(test_features)

We can now check the leaderboard to see those 5 models. We can see the weighting that is assigned to each of them.

print(model.leaderboard())
          rank  ensemble_weight type    cost  duration
model_id                                              
236          1              0.4  mlp  0.1250  1.053711
329          2              0.2  mlp  0.1375  0.985523
688          3              0.2  mlp  0.1375  1.322353
724          4              0.2  mlp  0.1375  1.118769

We can look at the detailed configuration for each model. In the example below, we can see that it’s using one hot encoding; the imputation strategy is the mean and so on…

wanted_model_id = 236
wanted_model = None
for (seed, model_id, budget), x in model.automl_.models_.items():
    if model_id == wanted_model_id:
        wanted_model = x
        
        
print(wanted_model)

SimpleClassificationPipeline({'balancing:strategy': 'weighting', 'classifier:__choice__': 'mlp', 'data_preprocessing:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessing:categorical_transformer:category_coalescence:__choice__': 'minority_coalescer', 'data_preprocessing:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessing:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:__choice__': 'fast_ica', 'classifier:mlp:activation': 'tanh', 'classifier:mlp:alpha': 4.472366621925168e-05, 'classifier:mlp:batch_size': 'auto', 'classifier:mlp:beta_1': 0.9, 'classifier:mlp:beta_2': 0.999, 'classifier:mlp:early_stopping': 'valid', 'classifier:mlp:epsilon': 1e-08, 'classifier:mlp:hidden_layer_depth': 1, 'classifier:mlp:learning_rate_init': 0.0010918286117142493, 'classifier:mlp:n_iter_no_change': 32, 'classifier:mlp:num_nodes_per_layer': 57, 'classifier:mlp:shuffle': 'True', 'classifier:mlp:solver': 'adam', 'classifier:mlp:tol': 0.0001, 'data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction': 0.0222450229324372, 'data_preprocessing:numerical_transformer:rescaling:quantile_transformer:n_quantiles': 886, 'data_preprocessing:numerical_transformer:rescaling:quantile_transformer:output_distribution': 'normal', 'feature_preprocessor:fast_ica:algorithm': 'parallel', 'feature_preprocessor:fast_ica:fun': 'exp', 'feature_preprocessor:fast_ica:whiten': 'False', 'classifier:mlp:validation_fraction': 0.1},
dataset_properties={
  'task': 1,
  'sparse': False,
  'multilabel': False,
  'multiclass': False,
  'target_type': 'classification',
  'signed': False})

We can look at the classification report for the model:

print(metrics.classification_report(test_labels, predictions))
              precision    recall  f1-score   support
           0       0.80      0.83      0.81        29
           1       0.84      0.81      0.83        32
    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61

We can then store the entire ensemble of models into a pickle, which can be loaded & used for predictions at a later date.

import pickle
pickle.dump(model, open('model.pickle', 'wb'))
model = pickle.load(open('model.pickle', 'rb'))
model.predict(test_features)

array([0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])

I hope this was a useful introduction to auto-sklearn. From my perspective, I think this is a very powerful library – although, it does lack the complete visibility you would expect, as a data scientist.

Share the Post:

Related Posts