hyper_tuning.py

# %% [markdown]
# ## Title :
#
# Exercise: Hyperparameter tuning
#
# ## Description :
#
# ### Tuning the hyperparameters
#
# Random Forests perform very well out-of-the-box, with the pre-set hyperparameters in sklearn. Some of the tunable parameters are:
#
# - The number of trees in the forest: n_estimators, int, default=100
# - The complexity of each tree: stop when a leaf has <= min_samples_leaf samples
# - The sampling scheme: number of features to consider at any given split: max_features {“auto”, “sqrt”, “log2”}, int or float, default=”auto”.
#
# ## Instructions:
#
# - Read the datafile diabetes.csv as a Pandas data frame.
# - Assign the predictor and response variable as mentioned in the scaffold.
# - Split the data into train and validation sets.
# - Define a vanilla Random Forest and fit the model on the entire data.
# - For various hyper parameters of the model, define different Random Forest models and train on the data.
# - Compare the results with each model.
#
# ## Hints:
#
# <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html" target="_blank">RandomForestClassifier()</a>
# Defines the RandomForestClassifier and includes more details on the definition and range of values for its tunable parameters.
#
# <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba" target="_blank">model.predict_proba(X)</a>
# Predict class probabilities for X
#
# <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html" target="_blank">roc_auc(y_test, y_proba)</a>
# Calculates the area under the receiver operating curve (AUC).
#
# <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html" target="_blank">GridSearchCV()</a>
# Performes exhaustive search over specified parameter values for an estimator.

# %%
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
%matplotlib inline


# %%
# Read the dataset and take a quick look
df = pd.read_csv("diabetes.csv")
df.head()


# %%
# Assign the predictor and response variables.
# Outcome is the response and all the other columns are the predictors
X = df.drop("Outcome", axis=1)
y = df['Outcome']


# %%
# Set the seed for reproducibility of results
seed = 0

# Split the data into train and test sets with the mentioned seed
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.33,
                                                    random_state=seed)


# %% [markdown]
# ### Vanila random forest
#
# Start by training a Random Forest Classifier using the default parameters and calculate the Receiver Operating Characteristic Area Under the Curve (ROC AUC). As we know, this metric is better than accuracy for a classification problem, since it covers the case of an imbalanced dataset.

# %%
### edTest(test_vanilla) ###

# Define a Random Forest classifier with randon_state = seed
vanilla_rf = RandomForestClassifier(random_state=seed)

# Fit the model on the entire data
vanilla_rf.fit(X=X, y=y)

# Calculate AUC/ROC on the test set
y_proba = vanilla_rf.predict_proba(X_test)[:, 1]
auc = np.round(roc_auc_score(y_test, y_proba),2)
print(f'Plain RF AUC on test set:{auc}')


# %%
# Number of samples and features
num_features = X_train.shape[1]
num_samples = X_train.shape[0]
num_samples, num_features


# %% [markdown]
# ### 1. Number of trees, `num_iterators`, default = 100
#
# The number of trees needs to be large enough for the $oob$ error to stabilize in its lowest possible value. Plot the $oob$ error of a random forest as a function of the number of trees. Trees in a RF are called `estimators`. A good start is 10 times the number of features, however, adjusting other hyperparameters will influence the optimum number of trees.

# %%
%%time
from collections import OrderedDict
clf = RandomForestClassifier(warm_start=True,
                            oob_score=True,
                            min_samples_leaf=40,
                            max_depth = 10,
                            random_state=seed)

error_rate = {}

# Range of `n_estimators` values to explore.
min_estimators = 80
max_estimators = 500

for i in range(min_estimators, max_estimators + 1):
    clf.set_params(n_estimators=i)
    clf.fit(X_train.values, y_train.values)

    # Record the OOB error for each `n_estimators=i` setting.
    oob_error = 1 - clf.oob_score_
    error_rate[i] = oob_error


# %%
%%time
# Generate the "OOB error rate" vs. "n_estimators" plot.
# OOB error rate = num_missclassified/total observations (%)\
xs = []
ys = []
for label, clf_err in error_rate.items():
    xs.append(label)
    ys.append(clf_err)
plt.plot(xs, ys)
plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.show();


# %% [markdown]
# ### 2. `min_samples_leaf`, default = 1
#
# The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. We will plot various values of the `min_samples_leaf` with `num_iterators`.

# %%
%%time
from collections import OrderedDict
ensemble_clfs = [
    (1,
        RandomForestClassifier(warm_start=True,
                            min_samples_leaf=1,
                            oob_score=True,
                            max_depth = 10,
                            random_state=seed)),
    (5,
        RandomForestClassifier(warm_start=True,
                            min_samples_leaf=5,
                            oob_score=True,
                            max_depth = 10,
                            random_state=seed))
]

# Map a label (the value of `min_samples_leaf`) to a list of (model, oob error) tuples.
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)

min_estimators = 80
max_estimators = 500

for label, clf in ensemble_clfs:
    for i in range(min_estimators, max_estimators + 1):
        clf.set_params(n_estimators=i)
        clf.fit(X_train.values, y_train.values)

        # Record the OOB error for each model. Error is 1 - oob_score
        # oob_score: score of the training dataset obtained using an
        # out-of-bag estimate.
        # OOB error rate is % of num_missclassified/total observations
        oob_error = 1 - clf.oob_score_
        error_rate[label].append((i, oob_error))

for label, clf_err in error_rate.items():
    xs, ys = zip(*clf_err)
    plt.plot(xs, ys, label=f'min_samples_leaf={label}')

plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
plt.show();


# %%
err = 100
best_num_estimators = 0
for label, clf_err in error_rate.items():
    num_estimators, error = min(clf_err, key=lambda n: (n[1], -n[0]))
    if error<err: err=error; best_num_estimators = num_estimators; best_leaf = label

print(f'Optimum num of estimators: {best_num_estimators} \nmin_samples_leaf: {best_leaf}')


# %% [markdown]
# Re-train the Random Forest Classifier using the new values for the parameters and calculate the AUC/ROC. Include another parameter, the `max_features`, the number of features to consider when looking for the best split.

# %%
### edTest(test_estimators) ###
estimators_rf = RandomForestClassifier(n_estimators= best_num_estimators,
                                    random_state=seed,
                                    oob_score=True,
                                    min_samples_leaf=best_leaf,
                                    max_features='sqrt')

# Fit the model on the entire data
estimators_rf.fit(X_train, y_train);

# Calculate AUC/ROC on the test set
y_proba = estimators_rf.predict_proba(X_test)[:, 1]
estimators_auc = np.round(roc_auc_score(y_test, y_proba),2)
print(f'Educated RF AUC on test set:{estimators_auc}')


# %% [markdown]
# Look at the model's parameters

# %%
estimators_rf.get_params()


# %% [markdown]
# ### 3. Performing a cross-validation search
#
# After we have some idea of the range of optimum values for the number of trees and maybe a couple of other parameters, and have enough computing power, you may perform an exhaustive search over other parameter values.

# %%
from sklearn.model_selection import GridSearchCV

do_grid_search = True

if do_grid_search:
    rf = RandomForestClassifier(n_jobs=-1,
                            n_estimators = best_num_estimators,
                            oob_score =True,
                            max_features = 'sqrt',
                            min_samples_leaf=best_leaf,
                            random_state=seed).fit(X_train,y_train)

    param_grid = {
        'min_samples_split': [2,5]}

    scoring = {'AUC': 'roc_auc'}

    grid_search = GridSearchCV(rf,
                            param_grid,
                            scoring=scoring,
                            refit='AUC',
                            return_train_score=True,
                            n_jobs=-1)

    results = grid_search.fit(X_train, y_train)
    print(results.best_estimator_.get_params())
    best_rf = results.best_estimator_
    # Calculate AUC/ROC
    y_proba = best_rf.predict_proba(X_test)[:, 1]
    auc = np.round(roc_auc_score(y_test, y_proba),2)
    print(f'GridSearchCV RF AUC on test set:{auc}')