Advertisement
jules0707

hyper_tuning.py

Jan 13th, 2025
64
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 9.92 KB | None | 0 0
  1. # %% [markdown]
  2. # ## Title :
  3. #
  4. # Exercise: Hyperparameter tuning
  5. #
  6. # ## Description :
  7. #
  8. # ### Tuning the hyperparameters
  9. #
  10. # Random Forests perform very well out-of-the-box, with the pre-set hyperparameters in sklearn. Some of the tunable parameters are:
  11. #
  12. # - The number of trees in the forest: n_estimators, int, default=100
  13. # - The complexity of each tree: stop when a leaf has <= min_samples_leaf samples
  14. # - The sampling scheme: number of features to consider at any given split: max_features {“auto”, “sqrt”, “log2”}, int or float, default=”auto”.
  15. #
  16. # ## Instructions:
  17. #
  18. # - Read the datafile diabetes.csv as a Pandas data frame.
  19. # - Assign the predictor and response variable as mentioned in the scaffold.
  20. # - Split the data into train and validation sets.
  21. # - Define a vanilla Random Forest and fit the model on the entire data.
  22. # - For various hyper parameters of the model, define different Random Forest models and train on the data.
  23. # - Compare the results with each model.
  24. #
  25. # ## Hints:
  26. #
  27. # <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html" target="_blank">RandomForestClassifier()</a>
  28. # Defines the RandomForestClassifier and includes more details on the definition and range of values for its tunable parameters.
  29. #
  30. # <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba" target="_blank">model.predict_proba(X)</a>
  31. # Predict class probabilities for X
  32. #
  33. # <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html" target="_blank">roc_auc(y_test, y_proba)</a>
  34. # Calculates the area under the receiver operating curve (AUC).
  35. #
  36. # <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html" target="_blank">GridSearchCV()</a>
  37. # Performes exhaustive search over specified parameter values for an estimator.
  38.  
  39. # %%
  40. # Import necessary libraries
  41. import numpy as np
  42. import pandas as pd
  43. import matplotlib.pyplot as plt
  44. from sklearn.metrics import roc_auc_score
  45. from sklearn.tree import DecisionTreeClassifier
  46. from sklearn.ensemble import RandomForestClassifier
  47. from sklearn.model_selection import train_test_split
  48. from sklearn.inspection import permutation_importance
  49. %matplotlib inline
  50.  
  51.  
  52. # %%
  53. # Read the dataset and take a quick look
  54. df = pd.read_csv("diabetes.csv")
  55. df.head()
  56.  
  57.  
  58. # %%
  59. # Assign the predictor and response variables.
  60. # Outcome is the response and all the other columns are the predictors
  61. X = df.drop("Outcome", axis=1)
  62. y = df['Outcome']
  63.  
  64.  
  65. # %%
  66. # Set the seed for reproducibility of results
  67. seed = 0
  68.  
  69. # Split the data into train and test sets with the mentioned seed
  70. X_train, X_test, y_train, y_test = train_test_split(X, y,
  71.                                                     test_size=0.33,
  72.                                                     random_state=seed)
  73.  
  74.  
  75. # %% [markdown]
  76. # ### Vanila random forest
  77. #
  78. # Start by training a Random Forest Classifier using the default parameters and calculate the Receiver Operating Characteristic Area Under the Curve (ROC AUC). As we know, this metric is better than accuracy for a classification problem, since it covers the case of an imbalanced dataset.
  79.  
  80. # %%
  81. ### edTest(test_vanilla) ###
  82.  
  83. # Define a Random Forest classifier with randon_state = seed
  84. vanilla_rf = RandomForestClassifier(random_state=seed)
  85.  
  86. # Fit the model on the entire data
  87. vanilla_rf.fit(X=X, y=y)
  88.  
  89. # Calculate AUC/ROC on the test set
  90. y_proba = vanilla_rf.predict_proba(X_test)[:, 1]
  91. auc = np.round(roc_auc_score(y_test, y_proba),2)
  92. print(f'Plain RF AUC on test set:{auc}')
  93.  
  94.  
  95. # %%
  96. # Number of samples and features
  97. num_features = X_train.shape[1]
  98. num_samples = X_train.shape[0]
  99. num_samples, num_features
  100.  
  101.  
  102. # %% [markdown]
  103. # ### 1. Number of trees, `num_iterators`, default = 100
  104. #
  105. # The number of trees needs to be large enough for the $oob$ error to stabilize in its lowest possible value. Plot the $oob$ error of a random forest as a function of the number of trees. Trees in a RF are called `estimators`. A good start is 10 times the number of features, however, adjusting other hyperparameters will influence the optimum number of trees.
  106.  
  107. # %%
  108. %%time
  109. from collections import OrderedDict
  110. clf = RandomForestClassifier(warm_start=True,
  111.                             oob_score=True,
  112.                             min_samples_leaf=40,
  113.                             max_depth = 10,
  114.                             random_state=seed)
  115.  
  116. error_rate = {}
  117.  
  118. # Range of `n_estimators` values to explore.
  119. min_estimators = 80
  120. max_estimators = 500
  121.  
  122. for i in range(min_estimators, max_estimators + 1):
  123.     clf.set_params(n_estimators=i)
  124.     clf.fit(X_train.values, y_train.values)
  125.  
  126.     # Record the OOB error for each `n_estimators=i` setting.
  127.     oob_error = 1 - clf.oob_score_
  128.     error_rate[i] = oob_error
  129.    
  130.  
  131. # %%
  132. %%time
  133. # Generate the "OOB error rate" vs. "n_estimators" plot.
  134. # OOB error rate = num_missclassified/total observations (%)\
  135. xs = []
  136. ys = []
  137. for label, clf_err in error_rate.items():
  138.     xs.append(label)
  139.     ys.append(clf_err)  
  140. plt.plot(xs, ys)
  141. plt.xlim(min_estimators, max_estimators)
  142. plt.xlabel("n_estimators")
  143. plt.ylabel("OOB error rate")
  144. plt.show();
  145.  
  146.  
  147. # %% [markdown]
  148. # ### 2. `min_samples_leaf`, default = 1
  149. #
  150. # The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. We will plot various values of the `min_samples_leaf` with `num_iterators`.
  151.  
  152. # %%
  153. %%time
  154. from collections import OrderedDict
  155. ensemble_clfs = [
  156.     (1,
  157.         RandomForestClassifier(warm_start=True,
  158.                             min_samples_leaf=1,
  159.                             oob_score=True,
  160.                             max_depth = 10,
  161.                             random_state=seed)),
  162.     (5,
  163.         RandomForestClassifier(warm_start=True,
  164.                             min_samples_leaf=5,
  165.                             oob_score=True,
  166.                             max_depth = 10,
  167.                             random_state=seed))
  168. ]
  169.  
  170. # Map a label (the value of `min_samples_leaf`) to a list of (model, oob error) tuples.
  171. error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)
  172.  
  173. min_estimators = 80
  174. max_estimators = 500
  175.  
  176. for label, clf in ensemble_clfs:
  177.     for i in range(min_estimators, max_estimators + 1):
  178.         clf.set_params(n_estimators=i)
  179.         clf.fit(X_train.values, y_train.values)
  180.  
  181.         # Record the OOB error for each model. Error is 1 - oob_score
  182.         # oob_score: score of the training dataset obtained using an
  183.         # out-of-bag estimate.
  184.         # OOB error rate is % of num_missclassified/total observations
  185.         oob_error = 1 - clf.oob_score_
  186.         error_rate[label].append((i, oob_error))
  187.  
  188. for label, clf_err in error_rate.items():
  189.     xs, ys = zip(*clf_err)
  190.     plt.plot(xs, ys, label=f'min_samples_leaf={label}')
  191.  
  192. plt.xlim(min_estimators, max_estimators)
  193. plt.xlabel("n_estimators")
  194. plt.ylabel("OOB error rate")
  195. plt.legend(loc="upper right")
  196. plt.show();
  197.  
  198.  
  199. # %%
  200. err = 100
  201. best_num_estimators = 0
  202. for label, clf_err in error_rate.items():
  203.     num_estimators, error = min(clf_err, key=lambda n: (n[1], -n[0]))
  204.     if error<err: err=error; best_num_estimators = num_estimators; best_leaf = label
  205.  
  206. print(f'Optimum num of estimators: {best_num_estimators} \nmin_samples_leaf: {best_leaf}')
  207.  
  208.  
  209. # %% [markdown]
  210. # Re-train the Random Forest Classifier using the new values for the parameters and calculate the AUC/ROC. Include another parameter, the `max_features`, the number of features to consider when looking for the best split.
  211.  
  212. # %%
  213. ### edTest(test_estimators) ###
  214. estimators_rf = RandomForestClassifier(n_estimators= best_num_estimators,
  215.                                     random_state=seed,
  216.                                     oob_score=True,
  217.                                     min_samples_leaf=best_leaf,
  218.                                     max_features='sqrt')
  219.  
  220. # Fit the model on the entire data
  221. estimators_rf.fit(X_train, y_train);
  222.  
  223. # Calculate AUC/ROC on the test set
  224. y_proba = estimators_rf.predict_proba(X_test)[:, 1]
  225. estimators_auc = np.round(roc_auc_score(y_test, y_proba),2)
  226. print(f'Educated RF AUC on test set:{estimators_auc}')
  227.  
  228.  
  229. # %% [markdown]
  230. # Look at the model's parameters
  231.  
  232. # %%
  233. estimators_rf.get_params()
  234.  
  235.  
  236. # %% [markdown]
  237. # ### 3. Performing a cross-validation search
  238. #
  239. # After we have some idea of the range of optimum values for the number of trees and maybe a couple of other parameters, and have enough computing power, you may perform an exhaustive search over other parameter values.
  240.  
  241. # %%
  242. from sklearn.model_selection import GridSearchCV
  243.  
  244. do_grid_search = True
  245.  
  246. if do_grid_search:
  247.     rf = RandomForestClassifier(n_jobs=-1,
  248.                             n_estimators = best_num_estimators,
  249.                             oob_score =True,
  250.                             max_features = 'sqrt',
  251.                             min_samples_leaf=best_leaf,
  252.                             random_state=seed).fit(X_train,y_train)
  253.  
  254.     param_grid = {
  255.         'min_samples_split': [2,5]}
  256.    
  257.     scoring = {'AUC': 'roc_auc'}
  258.    
  259.     grid_search = GridSearchCV(rf,
  260.                             param_grid,
  261.                             scoring=scoring,
  262.                             refit='AUC',
  263.                             return_train_score=True,
  264.                             n_jobs=-1)
  265.    
  266.     results = grid_search.fit(X_train, y_train)
  267.     print(results.best_estimator_.get_params())
  268.     best_rf = results.best_estimator_
  269.     # Calculate AUC/ROC
  270.     y_proba = best_rf.predict_proba(X_test)[:, 1]
  271.     auc = np.round(roc_auc_score(y_test, y_proba),2)
  272.     print(f'GridSearchCV RF AUC on test set:{auc}')
  273.    
  274.  
  275.  
  276.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement