Advertisement
jules0707

Bagging_Classification_Decision_Boundary

Dec 16th, 2024
43
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 9.34 KB | None | 0 0
  1. # %% [markdown]
  2. # ## Title :
  3. # Bagging Classification with Decision Boundary
  4. #
  5. # ## Description :
  6. # The goal of this exercise is to use **Bagging** (Bootstrap Aggregated) to solve a classification problem and visualize the influence of Bagging on trees with varying depths.
  7. #
  8. # Your final plot will resemble the one below.
  9. #
  10. # <img src="./fig2.png" style="width: 1500px;">
  11. #
  12. # ## Instructions:
  13. #
  14. # - Read the dataset `agriland.csv`.
  15. # - Assign the predictor and response variables as `X` and `y`.
  16. # - Split the data into train and test sets with `test_split=0.2` and `random_state=44`.
  17. # - Fit a single `DecisionTreeClassifier()` and find the accuracy of your prediction.
  18. # - Complete the helper function `prediction_by_bagging()` to find the average predictions for a given number of bootstraps.
  19. # - Perform `Bagging` using the helper function, and compute the new accuracy.
  20. # - Plot the accuracy as a function of the number of bootstraps.
  21. # - Use the helper code to plot the decision boundaries for varying max_depth along with `num_bootstraps`. Investigate the effect of increasing bootstraps on the variance.
  22. #
  23. # ## Hints:
  24. #
  25. # <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn-tree-decisiontreeclassifier" target="_blank">sklearn.tree.DecisionTreeClassifier()</a>
  26. # A decision tree classifier.
  27. #
  28. # <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit" target="_blank">DecisionTreeClassifier.fit()</a>
  29. # Build a decision tree classifier from the training set (X, y).
  30. #
  31. # <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict" target="_blank">DecisionTreeClassifier.predict()</a>
  32. # Predict class or regression value for X.
  33. #
  34. # <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank">train_test_split()</a>
  35. # Split arrays or matrices into random train and test subsets.
  36. #
  37. # <a href="https://docs.scipy.org/doc//numpy-1.10.4/reference/generated/numpy.random.choice.html" target="_blank">np.random.choice</a>
  38. # Generates a random sample from a given 1-D array.
  39. #
  40. # <a href="https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.subplots.html" target="_blank">plt.subplots()</a>
  41. # Create a figure and a set of subplots.
  42. #
  43. # <a href="https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.axes.Axes.plot.html" target="_blank">ax.plot()</a>
  44. # Plot y versus x as lines and/or markers
  45. #
  46. # **Note: This exercise is auto-graded and you can try multiple attempts.**
  47.  
  48. # %%
  49. # Import necessary libraries
  50. %matplotlib inline
  51. import numpy as np
  52. import pandas as pd
  53. from sklearn import metrics
  54. import scipy.optimize as opt
  55. import scipy.stats as stats
  56. import matplotlib.pyplot as plt
  57. from sklearn.metrics import accuracy_score
  58. from sklearn.tree import DecisionTreeClassifier
  59. from sklearn.model_selection import train_test_split
  60.  
  61. # Used for plotting later
  62. from matplotlib.colors import ListedColormap
  63. cmap_bold = ListedColormap(['#F7345E','#80C3BD'])
  64. cmap_light = ListedColormap(['#FFF4E5','#D2E3EF'])
  65.  
  66.  
  67. # %%
  68. # Read the file 'agriland.csv' as a Pandas dataframe
  69. df = pd.read_csv('../DATA/agriland.csv')
  70.  
  71. # Take a quick look at the data
  72. # Note that the latitude & longitude values are normalized
  73. df.head()
  74.  
  75.  
  76. # %%
  77. # Set the values of latitude & longitude predictor variables
  78. X = df.iloc[:,:-1].values
  79.  
  80. # Use the column "land_type" as the response variable
  81. y = df.iloc[:,-1].values
  82.  
  83. # %%
  84. # Split data in train an test, with test size = 0.2
  85. # and set random state as 44
  86. X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=44)
  87.  
  88.  
  89. # %%
  90. # Define the max_depth of the decision tree
  91. max_depth = 2
  92.  
  93. # Define a decision tree classifier with a max depth as defined above
  94. # and set the random_state as 44
  95. clf = DecisionTreeClassifier(max_depth=max_depth,random_state=44)
  96.  
  97. # Fit the model on the training data
  98. clf.fit(X_train,y_train)
  99.  
  100.  
  101. # %%
  102. # Use the trained model to predict on the test set
  103. prediction = clf.predict(X_test)
  104.  
  105. # Calculate the accuracy of the test predictions of a single tree
  106. single_acc = accuracy_score(y_test,prediction)
  107.  
  108. # Print the accuracy of the tree
  109. print(f'Single tree Accuracy is {single_acc*100}%')
  110.  
  111.  
  112. # %%
  113. # Complete the function below to get the prediction by bagging
  114.  
  115. # Inputs: X_train, y_train to train your data
  116. # X_to_evaluate: Samples that you are goin to predict (evaluate)
  117. # num_bootstraps: how many trees you want to train
  118. # Output: An array of predicted classes for X_to_evaluate
  119.  
  120. def prediction_by_bagging(X_train, y_train, X_to_evaluate, num_bootstraps):
  121.    
  122.     # List to store every array of predictions
  123.     predictions = []
  124.    
  125.     # Generate num_bootstraps number of trees
  126.     for i in range(num_bootstraps):
  127.        
  128.         # Sample data to perform first bootstrap, here, we actually bootstrap indices,
  129.         # because we want the same subset for X_train and y_train
  130.         resample_indexes = np.random.choice(np.arange(y_train.shape[0]), size=y_train.shape[0])
  131.        
  132.         # Get a bootstrapped version of the data using the above indices
  133.         X_boot = X_train[resample_indexes]
  134.         y_boot = y_train[resample_indexes]
  135.        
  136.         # Initialize a Decision Tree on bootstrapped data
  137.         # Use the same max_depth and random_state as above
  138.         clf = DecisionTreeClassifier(max_depth=2,random_state=44)
  139.        
  140.         # Fit the model on bootstrapped training set
  141.         clf.fit(X_boot,y_boot)
  142.        
  143.         # Use the trained model to predict on X_to_evaluate samples
  144.         pred = clf.predict(X_to_evaluate)
  145.        
  146.         # Append the predictions to the predictions list
  147.         predictions.append(pred)
  148.  
  149.     # The list "predictions" has [prediction_array_0, prediction_array_1, ..., prediction_array_n]
  150.     # To get the majority vote for each sample, we can find the average
  151.     # prediction and threshold them by 0.5
  152.     predictions_avg = [np.average(values) for values in predictions[0]] # we want the column average
  153.     average_prediction = [1 if avg > 0.5 else 0 for avg in predictions_avg]
  154.     h
  155.     # Return the average prediction
  156.     return average_prediction
  157.  
  158.  
  159. # %%
  160. ### edTest(test_bag_acc) ###        
  161.  
  162. # Define the number of bootstraps
  163. num_bootstraps = 200
  164.  
  165. # Calling the prediction_by_bagging function with appropriate parameters
  166. y_pred = prediction_by_bagging(X_train,y_train,X_test,num_bootstraps=num_bootstraps)
  167.  
  168. # Compare the average predictions to the true test set values
  169. # and compute the accuracy
  170. bagging_accuracy = accuracy_score(y_test,y_pred)
  171.  
  172. # Print the bagging accuracy
  173. print(f'Accuracy with Bootstrapped Aggregation is  {bagging_accuracy*100}%')
  174.  
  175.  
  176. # %%
  177. # Helper code to plot accuracy vs number of bagged trees
  178.  
  179. n = np.linspace(1,250,250).astype(int)
  180. acc = []
  181. for n_i in n:
  182.     acc.append(np.mean(prediction_by_bagging(X_train, y_train, X_test, n_i)==y_test))
  183. plt.figure(figsize=(10,8))
  184. plt.plot(n,acc,alpha=0.7,linewidth=3,color='#50AEA4', label='Model Prediction')
  185. plt.title('Accuracy vs. Number of trees in Bagging ',fontsize=24)
  186. plt.xlabel('Number of trees',fontsize=16)
  187. plt.ylabel('Accuracy',fontsize=16)
  188. plt.xticks(fontsize=12)
  189. plt.yticks(fontsize=12)
  190. plt.legend(loc='best',fontsize=12)
  191. plt.show();
  192.  
  193.  
  194. # %% [markdown]
  195. # ## Bagging Visualization
  196. #
  197. # Bagging does well to reduce overfitting, but only upto a certain extent.
  198. #
  199. # Vary the `max_depth` and `numboot` variables to see how Bagging helps reduce overfitting with the help of the visualization below
  200.  
  201. # %%
  202. # Making plots for three different values of `max_depth`
  203. fig,axes = plt.subplots(1,3,figsize=(20,6))
  204.  
  205. # Make a list of three max_depths to investigate
  206. max_depth = [2,5,100]
  207.  
  208. # Fix the number of bootstraps
  209. numboot = 100
  210.  
  211. for index,ax in enumerate(axes):
  212.  
  213.     for i in range(numboot):
  214.         df_new = df.sample(frac=1,replace=True)
  215.         y = df_new.land_type.values
  216.         X = df_new[['latitude', 'longitude']].values
  217.         dtree = DecisionTreeClassifier(max_depth=max_depth[index])
  218.         dtree.fit(X, y)
  219.         ax.scatter(X[:, 0], X[:, 1], c=y-1, s=50,alpha=0.5,edgecolor="k",cmap=cmap_bold)
  220.         plot_step_x1= 0.1
  221.         plot_step_x2= 0.1
  222.         x1min, x1max= X[:,0].min(), X[:,0].max()
  223.         x2min, x2max= X[:,1].min(), X[:,1].max()
  224.         x1, x2 = np.meshgrid(np.arange(x1min, x1max, plot_step_x1), np.arange(x2min, x2max, plot_step_x2) )
  225.         # Re-cast every coordinate in the meshgrid as a 2D point
  226.         Xplot= np.c_[x1.ravel(), x2.ravel()]
  227.  
  228.         # Predict the class
  229.         y = dtree.predict( Xplot )
  230.         y= y.reshape( x1.shape )
  231.         cs = ax.contourf(x1, x2, y, alpha=0.02)
  232.        
  233.     ax.set_xlabel('Latitude',fontsize=14)
  234.     ax.set_ylabel('Longitude',fontsize=14)
  235.     ax.set_title(f'Max depth = {max_depth[index]}',fontsize=20)
  236.  
  237.  
  238.  
  239. # %% [markdown]
  240. # ## Mindchow 🍲
  241. # Play around with the following parameters:
  242. #
  243. # - max_depth
  244. # - numboot
  245. #
  246. # Based on your observations, answer the questions below:
  247. #
  248. # - How does the plot change with varying `max_depth`
  249. #
  250. # - How does the plot change with varying `numboot`
  251. #
  252. # - How are the three plots essentially different?
  253. #
  254. # - Does more bootstraps reduce overfitting for
  255. #     - High depth
  256. #     - Low depth
  257.  
  258. # %%
  259.  
  260.  
  261.  
  262.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement