Advertisement
joemccray

Machine Learning with Python

Oct 31st, 2017
839
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 10.18 KB | None | 0 0
  1. ################################
  2. # Machine Learning with Python #
  3. ################################
  4. Reference:
  5. http://machinelearningmastery.com/machine-learning-in-python-step-by-step/
  6.  
  7.  
  8.  
  9.  
  10.  
  11.  
  12.  
  13. ---------------------------Type This-----------------------------------
  14. sudo apt install -y python-scipy python-numpy python-matplotlib python-matplotlib-data python-pandas python-sklearn python-sklearn-pandas python-sklearn-lib python-scikits-learn
  15. -----------------------------------------------------------------------
  16.  
  17. ---------------------------Type This-----------------------------------
  18. vi libcheck.py
  19.  
  20. -----------------------------------------------------------------------
  21. #!/usr/bin/env python
  22.  
  23. # Check the versions of libraries
  24.  
  25. # Python version
  26. import sys
  27. print('Python: {}'.format(sys.version))
  28. # scipy
  29. import scipy
  30. print('scipy: {}'.format(scipy.__version__))
  31. # numpy
  32. import numpy
  33. print('numpy: {}'.format(numpy.__version__))
  34. # matplotlib
  35. import matplotlib
  36. print('matplotlib: {}'.format(matplotlib.__version__))
  37. # pandas
  38. import pandas
  39. print('pandas: {}'.format(pandas.__version__))
  40. # scikit-learn
  41. import sklearn
  42. print('sklearn: {}'.format(sklearn.__version__))
  43. -----------------------------------------------------------------------
  44.  
  45.  
  46.  
  47.  
  48.  
  49.  
  50.  
  51.  
  52.  
  53.  
  54.  
  55.  
  56. ---------------------------Type This-----------------------------------
  57. python
  58.  
  59. import pandas csv
  60. from pandas.tools.plotting import scatter_matrix
  61. import matplotlib.pyplot as plt
  62. from sklearn import model_selection
  63. from sklearn.metrics import classification_report
  64. from sklearn.metrics import confusion_matrix
  65. from sklearn.metrics import accuracy_score
  66. from sklearn.linear_model import LogisticRegression
  67. from sklearn.tree import DecisionTreeClassifier
  68. from sklearn.neighbors import KNeighborsClassifier
  69. from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
  70. from sklearn.naive_bayes import GaussianNB
  71. from sklearn.svm import SVC
  72.  
  73.  
  74.  
  75.  
  76.  
  77. url = "https://raw.githubusercontent.com/johnniev5/Microsoft-Malware-Classification-Challenge/master/sampleSubmission.csv"
  78. names = ["Id", "Prediction1", "Prediction2", "Prediction3", "Prediction4", "Prediction5", "Prediction6", "Prediction7", "Prediction8", "Prediction9"]
  79. dataset = pandas.read_csv(url, names=names)
  80.  
  81. -----------------------------------------------------------------------
  82.  
  83.  
  84.  
  85. Summarize the Dataset
  86. ---------------------
  87. We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.
  88.  
  89. ---------------------------Type This-----------------------------------
  90.  
  91. >>> print(dataset.shape)
  92.  
  93.  
  94.  
  95.  
  96. >>> print(dataset.head(20))
  97. -----------------------------------------------------------------------
  98.  
  99. You should see the first 20 rows of the data:
  100.  
  101.  
  102.  
  103.  
  104.  
  105. Statistical Summary
  106. -------------------
  107.  
  108. Now we can take a look at a summary of each attribute.
  109.  
  110. This includes the count, mean, the min and max values as well as some percentiles.
  111.  
  112. ---------------------------Type This-----------------------------------
  113.  
  114. >>> print(dataset.describe())
  115.  
  116. -----------------------------------------------------------------------
  117.  
  118.  
  119.  
  120.  
  121. Class Distribution
  122. ------------------
  123. Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.
  124.  
  125. ---------------------------Type This-----------------------------------
  126.  
  127. >>> print(dataset.groupby('class').size())
  128.  
  129. -----------------------------------------------------------------------
  130.  
  131. We can see that each class has the same number of instances
  132.  
  133.  
  134.  
  135.  
  136. Data Visualization
  137. ------------------
  138.  
  139. We now have a basic idea about the data. We need to extend that with some visualizations.
  140.  
  141. We are going to look at two types of plots:
  142.  
  143. - Univariate plots to better understand each attribute.
  144. - Multivariate plots to better understand the relationships between attributes.
  145.  
  146.  
  147. Univariate Plots
  148.  
  149. We start with some univariate plots, that is, plots of each individual variable.
  150.  
  151. Given that the input variables are numeric, we can create box and whisker plots of each.
  152.  
  153. ---------------------------Type This-----------------------------------
  154.  
  155. >>> dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
  156. >>> plt.show()
  157.  
  158. -----------------------------------------------------------------------
  159.  
  160. This gives us a much clearer idea of the distribution of the input attributes:
  161.  
  162.  
  163.  
  164. ******************* INSERT DIAGRAM SCREENSHOT *******************
  165.  
  166.  
  167.  
  168. We can also create a histogram of each input variable to get an idea of the distribution.
  169.  
  170. ---------------------------Type This-----------------------------------
  171.  
  172. >>> dataset.hist()
  173. >>> plt.show()
  174. -----------------------------------------------------------------------
  175.  
  176. It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.
  177.  
  178.  
  179. ******************* INSERT DIAGRAM SCREENSHOT *******************
  180.  
  181.  
  182.  
  183.  
  184. Multivariate Plots
  185. ------------------
  186. Now we can look at the interactions between the variables.
  187.  
  188. First let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.
  189.  
  190. ---------------------------Type This-----------------------------------
  191.  
  192. >>> scatter_matrix(dataset)
  193. >>> plt.show()
  194. -----------------------------------------------------------------------
  195.  
  196. Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.
  197.  
  198. ******************* INSERT DIAGRAM SCREENSHOT *******************
  199.  
  200.  
  201.  
  202.  
  203. Create a Validation Dataset
  204. ---------------------------
  205.  
  206. We need to know that the model we created is any good.
  207.  
  208. Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.
  209.  
  210. That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.
  211.  
  212. We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.
  213.  
  214. ---------------------------Type This-----------------------------------
  215.  
  216. >>> array = dataset.values
  217. >>> X = array[:,0:4]
  218. >>> Y = array[:,4]
  219. >>> validation_size = 0.20
  220. >>> seed = 7
  221. >>> X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
  222. -----------------------------------------------------------------------
  223.  
  224.  
  225.  
  226. Test Harness
  227. ------------
  228. We will use 10-fold cross validation to estimate accuracy.
  229.  
  230. This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.
  231.  
  232. ---------------------------Type This-----------------------------------
  233.  
  234. >>> seed = 7
  235. >>> scoring = 'accuracy'
  236. -----------------------------------------------------------------------
  237.  
  238. We are using the metric of ‘accuracy‘ to evaluate models.
  239. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate).
  240. We will be using the scoring variable when we run build and evaluate each model next.
  241.  
  242.  
  243.  
  244.  
  245. Build Models
  246. ------------
  247. We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.
  248.  
  249. Let’s evaluate 6 different algorithms:
  250.  
  251. - Logistic Regression (LR)
  252. - Linear Discriminant Analysis (LDA)
  253. - K-Nearest Neighbors (KNN).
  254. - Classification and Regression Trees (CART).
  255. - Gaussian Naive Bayes (NB).
  256. - Support Vector Machines (SVM).
  257.  
  258. This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms. We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.
  259.  
  260. Let’s build and evaluate our five models:
  261.  
  262. -----------------------------------------------------------------------
  263.  
  264.  
  265. # Spot Check Algorithms
  266. models = []
  267. models.append(('LR', LogisticRegression()))
  268. models.append(('LDA', LinearDiscriminantAnalysis()))
  269. models.append(('KNN', KNeighborsClassifier()))
  270. models.append(('CART', DecisionTreeClassifier()))
  271. models.append(('NB', GaussianNB()))
  272. models.append(('SVM', SVC()))
  273. # evaluate each model in turn
  274. results = []
  275. names = []
  276. for name, model in models:
  277. kfold = model_selection.KFold(n_splits=10, random_state=seed)
  278. cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
  279. results.append(cv_results)
  280. names.append(name)
  281. msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
  282. print(msg)
  283.  
  284.  
  285. -----------------------------------------------------------------------
  286.  
  287.  
  288. Select Best Model
  289. -----------------
  290. We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.
  291.  
  292. Running the example above, we get the following raw results:
  293.  
  294.  
  295. ******************* INSERT DIAGRAM SCREENSHOT *******************
  296.  
  297.  
  298. We can see that it looks like KNN has the largest estimated accuracy score.
  299.  
  300. We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model.
  301. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).
  302.  
  303.  
  304. -----------------------------------------------------------------------
  305.  
  306. # Compare Algorithms
  307. fig = plt.figure()
  308. fig.suptitle('Algorithm Comparison')
  309. ax = fig.add_subplot(111)
  310. plt.boxplot(results)
  311. ax.set_xticklabels(names)
  312. plt.show()
  313. -----------------------------------------------------------------------
  314.  
  315. You can see that the box and whisker plots are squashed at the top of the range, with many samples achieving 100% accuracy.
  316.  
  317.  
  318. Make Predictions
  319. ----------------
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement