Bagging_Classification_Decision_Boundary

# %% [markdown]
# ## Title :
# Bagging Classification with Decision Boundary
#
# ## Description :
# The goal of this exercise is to use **Bagging** (Bootstrap Aggregated) to solve a classification problem and visualize the influence of Bagging on trees with varying depths.
#
# Your final plot will resemble the one below.
#
# <img src="./fig2.png" style="width: 1500px;">
#
# ## Instructions:
#
# - Read the dataset `agriland.csv`.
# - Assign the predictor and response variables as `X` and `y`.
# - Split the data into train and test sets with `test_split=0.2` and `random_state=44`.
# - Fit a single `DecisionTreeClassifier()` and find the accuracy of your prediction.
# - Complete the helper function `prediction_by_bagging()` to find the average predictions for a given number of bootstraps.
# - Perform `Bagging` using the helper function, and compute the new accuracy.
# - Plot the accuracy as a function of the number of bootstraps.
# - Use the helper code to plot the decision boundaries for varying max_depth along with `num_bootstraps`. Investigate the effect of increasing bootstraps on the variance.
#
# ## Hints:
#
# <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn-tree-decisiontreeclassifier" target="_blank">sklearn.tree.DecisionTreeClassifier()</a>
# A decision tree classifier.
#
# <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit" target="_blank">DecisionTreeClassifier.fit()</a>
# Build a decision tree classifier from the training set (X, y).
#
# <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict" target="_blank">DecisionTreeClassifier.predict()</a>
# Predict class or regression value for X.
#
# <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank">train_test_split()</a>
# Split arrays or matrices into random train and test subsets.
#
# <a href="https://docs.scipy.org/doc//numpy-1.10.4/reference/generated/numpy.random.choice.html" target="_blank">np.random.choice</a>
# Generates a random sample from a given 1-D array.
#
# <a href="https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.subplots.html" target="_blank">plt.subplots()</a>
# Create a figure and a set of subplots.
#
# <a href="https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.axes.Axes.plot.html" target="_blank">ax.plot()</a>
# Plot y versus x as lines and/or markers
#
# **Note: This exercise is auto-graded and you can try multiple attempts.**

# %%
# Import necessary libraries
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn import metrics
import scipy.optimize as opt
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Used for plotting later
from matplotlib.colors import ListedColormap
cmap_bold = ListedColormap(['#F7345E','#80C3BD'])
cmap_light = ListedColormap(['#FFF4E5','#D2E3EF'])


# %%
# Read the file 'agriland.csv' as a Pandas dataframe
df = pd.read_csv('../DATA/agriland.csv')

# Take a quick look at the data
# Note that the latitude & longitude values are normalized
df.head()


# %%
# Set the values of latitude & longitude predictor variables
X = df.iloc[:,:-1].values

# Use the column "land_type" as the response variable
y = df.iloc[:,-1].values

# %%
# Split data in train an test, with test size = 0.2
# and set random state as 44
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=44)


# %%
# Define the max_depth of the decision tree
max_depth = 2

# Define a decision tree classifier with a max depth as defined above
# and set the random_state as 44
clf = DecisionTreeClassifier(max_depth=max_depth,random_state=44)

# Fit the model on the training data
clf.fit(X_train,y_train)


# %%
# Use the trained model to predict on the test set
prediction = clf.predict(X_test)

# Calculate the accuracy of the test predictions of a single tree
single_acc = accuracy_score(y_test,prediction)

# Print the accuracy of the tree
print(f'Single tree Accuracy is {single_acc*100}%')


# %%
# Complete the function below to get the prediction by bagging

# Inputs: X_train, y_train to train your data
# X_to_evaluate: Samples that you are goin to predict (evaluate)
# num_bootstraps: how many trees you want to train
# Output: An array of predicted classes for X_to_evaluate

def prediction_by_bagging(X_train, y_train, X_to_evaluate, num_bootstraps):

    # List to store every array of predictions
    predictions = []

    # Generate num_bootstraps number of trees
    for i in range(num_bootstraps):

        # Sample data to perform first bootstrap, here, we actually bootstrap indices,
        # because we want the same subset for X_train and y_train
        resample_indexes = np.random.choice(np.arange(y_train.shape[0]), size=y_train.shape[0])

        # Get a bootstrapped version of the data using the above indices
        X_boot = X_train[resample_indexes]
        y_boot = y_train[resample_indexes]

        # Initialize a Decision Tree on bootstrapped data
        # Use the same max_depth and random_state as above
        clf = DecisionTreeClassifier(max_depth=2,random_state=44)

        # Fit the model on bootstrapped training set
        clf.fit(X_boot,y_boot)

        # Use the trained model to predict on X_to_evaluate samples
        pred = clf.predict(X_to_evaluate)

        # Append the predictions to the predictions list
        predictions.append(pred)

    # The list "predictions" has [prediction_array_0, prediction_array_1, ..., prediction_array_n]
    # To get the majority vote for each sample, we can find the average
    # prediction and threshold them by 0.5
    predictions_avg = [np.average(values) for values in predictions[0]] # we want the column average
    average_prediction = [1 if avg > 0.5 else 0 for avg in predictions_avg]
    h
    # Return the average prediction
    return average_prediction


# %%
### edTest(test_bag_acc) ###

# Define the number of bootstraps
num_bootstraps = 200

# Calling the prediction_by_bagging function with appropriate parameters
y_pred = prediction_by_bagging(X_train,y_train,X_test,num_bootstraps=num_bootstraps)

# Compare the average predictions to the true test set values
# and compute the accuracy
bagging_accuracy = accuracy_score(y_test,y_pred)

# Print the bagging accuracy
print(f'Accuracy with Bootstrapped Aggregation is  {bagging_accuracy*100}%')


# %%
# Helper code to plot accuracy vs number of bagged trees

n = np.linspace(1,250,250).astype(int)
acc = []
for n_i in n:
    acc.append(np.mean(prediction_by_bagging(X_train, y_train, X_test, n_i)==y_test))
plt.figure(figsize=(10,8))
plt.plot(n,acc,alpha=0.7,linewidth=3,color='#50AEA4', label='Model Prediction')
plt.title('Accuracy vs. Number of trees in Bagging ',fontsize=24)
plt.xlabel('Number of trees',fontsize=16)
plt.ylabel('Accuracy',fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='best',fontsize=12)
plt.show();


# %% [markdown]
# ## Bagging Visualization
#
# Bagging does well to reduce overfitting, but only upto a certain extent.
#
# Vary the `max_depth` and `numboot` variables to see how Bagging helps reduce overfitting with the help of the visualization below

# %%
# Making plots for three different values of `max_depth`
fig,axes = plt.subplots(1,3,figsize=(20,6))

# Make a list of three max_depths to investigate
max_depth = [2,5,100]

# Fix the number of bootstraps
numboot = 100

for index,ax in enumerate(axes):

    for i in range(numboot):
        df_new = df.sample(frac=1,replace=True)
        y = df_new.land_type.values
        X = df_new[['latitude', 'longitude']].values
        dtree = DecisionTreeClassifier(max_depth=max_depth[index])
        dtree.fit(X, y)
        ax.scatter(X[:, 0], X[:, 1], c=y-1, s=50,alpha=0.5,edgecolor="k",cmap=cmap_bold)
        plot_step_x1= 0.1
        plot_step_x2= 0.1
        x1min, x1max= X[:,0].min(), X[:,0].max()
        x2min, x2max= X[:,1].min(), X[:,1].max()
        x1, x2 = np.meshgrid(np.arange(x1min, x1max, plot_step_x1), np.arange(x2min, x2max, plot_step_x2) )
        # Re-cast every coordinate in the meshgrid as a 2D point
        Xplot= np.c_[x1.ravel(), x2.ravel()]

        # Predict the class
        y = dtree.predict( Xplot )
        y= y.reshape( x1.shape )
        cs = ax.contourf(x1, x2, y, alpha=0.02)

    ax.set_xlabel('Latitude',fontsize=14)
    ax.set_ylabel('Longitude',fontsize=14)
    ax.set_title(f'Max depth = {max_depth[index]}',fontsize=20)


# %% [markdown]
# ## Mindchow 🍲
# Play around with the following parameters:
#
# - max_depth
# - numboot
#
# Based on your observations, answer the questions below:
#
# - How does the plot change with varying `max_depth`
#
# - How does the plot change with varying `numboot`
#
# - How are the three plots essentially different?
#
# - Does more bootstraps reduce overfitting for
#     - High depth
#     - Low depth

# %%