Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- # %% [markdown]
- # ## Title :
- #
- # Exercise: Hyperparameter tuning
- #
- # ## Description :
- #
- # ### Tuning the hyperparameters
- #
- # Random Forests perform very well out-of-the-box, with the pre-set hyperparameters in sklearn. Some of the tunable parameters are:
- #
- # - The number of trees in the forest: n_estimators, int, default=100
- # - The complexity of each tree: stop when a leaf has <= min_samples_leaf samples
- # - The sampling scheme: number of features to consider at any given split: max_features {“auto”, “sqrt”, “log2”}, int or float, default=”auto”.
- #
- # ## Instructions:
- #
- # - Read the datafile diabetes.csv as a Pandas data frame.
- # - Assign the predictor and response variable as mentioned in the scaffold.
- # - Split the data into train and validation sets.
- # - Define a vanilla Random Forest and fit the model on the entire data.
- # - For various hyper parameters of the model, define different Random Forest models and train on the data.
- # - Compare the results with each model.
- #
- # ## Hints:
- #
- # <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html" target="_blank">RandomForestClassifier()</a>
- # Defines the RandomForestClassifier and includes more details on the definition and range of values for its tunable parameters.
- #
- # <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba" target="_blank">model.predict_proba(X)</a>
- # Predict class probabilities for X
- #
- # <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html" target="_blank">roc_auc(y_test, y_proba)</a>
- # Calculates the area under the receiver operating curve (AUC).
- #
- # <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html" target="_blank">GridSearchCV()</a>
- # Performes exhaustive search over specified parameter values for an estimator.
- # %%
- # Import necessary libraries
- import numpy as np
- import pandas as pd
- import matplotlib.pyplot as plt
- from sklearn.metrics import roc_auc_score
- from sklearn.tree import DecisionTreeClassifier
- from sklearn.ensemble import RandomForestClassifier
- from sklearn.model_selection import train_test_split
- from sklearn.inspection import permutation_importance
- %matplotlib inline
- # %%
- # Read the dataset and take a quick look
- df = pd.read_csv("diabetes.csv")
- df.head()
- # %%
- # Assign the predictor and response variables.
- # Outcome is the response and all the other columns are the predictors
- X = df.drop("Outcome", axis=1)
- y = df['Outcome']
- # %%
- # Set the seed for reproducibility of results
- seed = 0
- # Split the data into train and test sets with the mentioned seed
- X_train, X_test, y_train, y_test = train_test_split(X, y,
- test_size=0.33,
- random_state=seed)
- # %% [markdown]
- # ### Vanila random forest
- #
- # Start by training a Random Forest Classifier using the default parameters and calculate the Receiver Operating Characteristic Area Under the Curve (ROC AUC). As we know, this metric is better than accuracy for a classification problem, since it covers the case of an imbalanced dataset.
- # %%
- ### edTest(test_vanilla) ###
- # Define a Random Forest classifier with randon_state = seed
- vanilla_rf = RandomForestClassifier(random_state=seed)
- # Fit the model on the entire data
- vanilla_rf.fit(X=X, y=y)
- # Calculate AUC/ROC on the test set
- y_proba = vanilla_rf.predict_proba(X_test)[:, 1]
- auc = np.round(roc_auc_score(y_test, y_proba),2)
- print(f'Plain RF AUC on test set:{auc}')
- # %%
- # Number of samples and features
- num_features = X_train.shape[1]
- num_samples = X_train.shape[0]
- num_samples, num_features
- # %% [markdown]
- # ### 1. Number of trees, `num_iterators`, default = 100
- #
- # The number of trees needs to be large enough for the $oob$ error to stabilize in its lowest possible value. Plot the $oob$ error of a random forest as a function of the number of trees. Trees in a RF are called `estimators`. A good start is 10 times the number of features, however, adjusting other hyperparameters will influence the optimum number of trees.
- # %%
- %%time
- from collections import OrderedDict
- clf = RandomForestClassifier(warm_start=True,
- oob_score=True,
- min_samples_leaf=40,
- max_depth = 10,
- random_state=seed)
- error_rate = {}
- # Range of `n_estimators` values to explore.
- min_estimators = 80
- max_estimators = 500
- for i in range(min_estimators, max_estimators + 1):
- clf.set_params(n_estimators=i)
- clf.fit(X_train.values, y_train.values)
- # Record the OOB error for each `n_estimators=i` setting.
- oob_error = 1 - clf.oob_score_
- error_rate[i] = oob_error
- # %%
- %%time
- # Generate the "OOB error rate" vs. "n_estimators" plot.
- # OOB error rate = num_missclassified/total observations (%)\
- xs = []
- ys = []
- for label, clf_err in error_rate.items():
- xs.append(label)
- ys.append(clf_err)
- plt.plot(xs, ys)
- plt.xlim(min_estimators, max_estimators)
- plt.xlabel("n_estimators")
- plt.ylabel("OOB error rate")
- plt.show();
- # %% [markdown]
- # ### 2. `min_samples_leaf`, default = 1
- #
- # The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. We will plot various values of the `min_samples_leaf` with `num_iterators`.
- # %%
- %%time
- from collections import OrderedDict
- ensemble_clfs = [
- (1,
- RandomForestClassifier(warm_start=True,
- min_samples_leaf=1,
- oob_score=True,
- max_depth = 10,
- random_state=seed)),
- (5,
- RandomForestClassifier(warm_start=True,
- min_samples_leaf=5,
- oob_score=True,
- max_depth = 10,
- random_state=seed))
- ]
- # Map a label (the value of `min_samples_leaf`) to a list of (model, oob error) tuples.
- error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)
- min_estimators = 80
- max_estimators = 500
- for label, clf in ensemble_clfs:
- for i in range(min_estimators, max_estimators + 1):
- clf.set_params(n_estimators=i)
- clf.fit(X_train.values, y_train.values)
- # Record the OOB error for each model. Error is 1 - oob_score
- # oob_score: score of the training dataset obtained using an
- # out-of-bag estimate.
- # OOB error rate is % of num_missclassified/total observations
- oob_error = 1 - clf.oob_score_
- error_rate[label].append((i, oob_error))
- for label, clf_err in error_rate.items():
- xs, ys = zip(*clf_err)
- plt.plot(xs, ys, label=f'min_samples_leaf={label}')
- plt.xlim(min_estimators, max_estimators)
- plt.xlabel("n_estimators")
- plt.ylabel("OOB error rate")
- plt.legend(loc="upper right")
- plt.show();
- # %%
- err = 100
- best_num_estimators = 0
- for label, clf_err in error_rate.items():
- num_estimators, error = min(clf_err, key=lambda n: (n[1], -n[0]))
- if error<err: err=error; best_num_estimators = num_estimators; best_leaf = label
- print(f'Optimum num of estimators: {best_num_estimators} \nmin_samples_leaf: {best_leaf}')
- # %% [markdown]
- # Re-train the Random Forest Classifier using the new values for the parameters and calculate the AUC/ROC. Include another parameter, the `max_features`, the number of features to consider when looking for the best split.
- # %%
- ### edTest(test_estimators) ###
- estimators_rf = RandomForestClassifier(n_estimators= best_num_estimators,
- random_state=seed,
- oob_score=True,
- min_samples_leaf=best_leaf,
- max_features='sqrt')
- # Fit the model on the entire data
- estimators_rf.fit(X_train, y_train);
- # Calculate AUC/ROC on the test set
- y_proba = estimators_rf.predict_proba(X_test)[:, 1]
- estimators_auc = np.round(roc_auc_score(y_test, y_proba),2)
- print(f'Educated RF AUC on test set:{estimators_auc}')
- # %% [markdown]
- # Look at the model's parameters
- # %%
- estimators_rf.get_params()
- # %% [markdown]
- # ### 3. Performing a cross-validation search
- #
- # After we have some idea of the range of optimum values for the number of trees and maybe a couple of other parameters, and have enough computing power, you may perform an exhaustive search over other parameter values.
- # %%
- from sklearn.model_selection import GridSearchCV
- do_grid_search = True
- if do_grid_search:
- rf = RandomForestClassifier(n_jobs=-1,
- n_estimators = best_num_estimators,
- oob_score =True,
- max_features = 'sqrt',
- min_samples_leaf=best_leaf,
- random_state=seed).fit(X_train,y_train)
- param_grid = {
- 'min_samples_split': [2,5]}
- scoring = {'AUC': 'roc_auc'}
- grid_search = GridSearchCV(rf,
- param_grid,
- scoring=scoring,
- refit='AUC',
- return_train_score=True,
- n_jobs=-1)
- results = grid_search.fit(X_train, y_train)
- print(results.best_estimator_.get_params())
- best_rf = results.best_estimator_
- # Calculate AUC/ROC
- y_proba = best_rf.predict_proba(X_test)[:, 1]
- auc = np.round(roc_auc_score(y_test, y_proba),2)
- print(f'GridSearchCV RF AUC on test set:{auc}')
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement