Advertisement
makispaiktis

PREPROCESSING BASIC RULES AND CODE

Jun 23rd, 2023 (edited)
856
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 6.00 KB | None | 0 0
  1. #### 1. Read the dataset and separate target from predictors
  2.  
  3. # Read the dataset
  4. X = pd.read_csv('../input/train.csv', index_col='Id')  
  5. # Remove rows with missing target
  6. X.dropna(axis=0, subset=['SalePrice'], inplace=True)
  7. # Separate target from predictors
  8. y = X.SalePrice
  9. X.drop(['SalePrice'], axis=1, inplace=True)
  10.  
  11.  
  12. #### 2. Split the dataset
  13.  
  14. X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
  15.  
  16.  
  17. #### 3. Separate into numerical and categorical columns
  18.  
  19. X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
  20.  
  21. numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]
  22. categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype == "object" and X_train_full[cname].nunique() < 10]        # for OH encoding
  23. categorical_cols_ALL = [cname for cname in X_train_full.columns if X_train_full[cname].dtype == "object"]
  24. print('Numerical columns:', numerical_cols, '\nCategorical columns:', categorical_cols, '\n\n')
  25.  
  26. my_cols = categorical_cols + numerical_cols
  27. X_train = X_train_full[my_cols].copy()
  28. X_valid = X_valid_full[my_cols].copy()
  29.  
  30.  
  31.  
  32. #### 4. Cleaning - Drop cols with missing data - Φτιάχνω μεταβλητή με τα ονόματα των στηλών αυτών
  33.  
  34. cols_with_missing = [col for col in X.columns if X[col].isnull().any()]
  35. X.drop(cols_with_missing, axis=1, inplace=True)
  36. X_test.drop(cols_with_missing, axis=1, inplace=True)
  37.  
  38.  
  39. #### 5. Preprocessing - Drop columns with categorical data
  40.  
  41. drop_X_train = X_train.select_dtypes(exclude=['object'])
  42. drop_X_valid = X_valid.select_dtypes(exclude=['object'])
  43. print("MAE from Approach 1 (Drop categorical variables):")
  44. print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
  45.  
  46.  
  47. #### 6. Preprocessing - Ordinal encoding ----> (object_cols = good_label_cols + bad_label_cols)
  48. #### If I have some categorical variables in which there is the 'order attribute' (like "Beginner", "Amateur", "Advanced"), I can
  49. #### make ordinal encoding. But first, I have to check if the values of these features are consistent through both training and
  50. #### validation dataset. I can do so, by creating the following 2 variables named: good_label_cols and bad_label_cols
  51.  
  52. object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]
  53. good_label_cols = [col for col in object_cols if set(X_valid[col]).issubset(set(X_train[col]))]
  54. bad_label_cols = list(set(object_cols)-set(good_label_cols))
  55. print('Categorical columns (ALL OF THEM):', object_cols, '\n')
  56. print('Categorical columns that will be ordinal encoded:', good_label_cols, '\n')
  57. print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols, '\n')
  58.  
  59. label_X_train = X_train.drop(bad_label_cols, axis=1)
  60. label_X_valid = X_valid.drop(bad_label_cols, axis=1)
  61.  
  62. ordinal_encoder = OrdinalEncoder()
  63. label_X_train[good_label_cols] = ordinal_encoder.fit_transform(label_X_train[good_label_cols])
  64. label_X_valid[good_label_cols] = ordinal_encoder.transform(label_X_valid[good_label_cols])
  65.  
  66.  
  67.  
  68. #### 7. Preprocessing - One Hot encoding ----> (object_cols = low_cardinality_cols + high_cardinality_cols)
  69. #### For features that do not have the 'order-attribute' (Company: 'Microsoft', 'Amazon', 'Google'). OH Encoding creates new columns in
  70. #### the dataset. The number of the columns created are equal to the number this categorical variable can take. In the above example, #### there are 3 companies in the 'Company' column, so OH Encoder will create 3 columns. But after that, we have to delete the old
  71. #### column that contained this information. So, the new dataset has 2 more columns than the original one. Ofc, if there are more than
  72. #### one columns that have this kind of values, we do the same thing.
  73.  
  74. object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]
  75. object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
  76. d = dict(zip(object_cols, object_nunique))
  77. sorted(d.items(), key=lambda x: x[1])
  78. print(d)
  79.  
  80. low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]
  81. high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
  82. print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
  83. print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)
  84.  
  85. OH_X_train = X_train.drop(high_cardinality_cols, axis=1)
  86. OH_X_valid = X_valid.drop(high_cardinality_cols, axis=1)
  87.  
  88. oh_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
  89. OH_cols_train = pd.DataFrame(oh_encoder.fit_transform(X_train[low_cardinality_cols]))
  90. OH_cols_valid = pd.DataFrame(oh_encoder.transform(X_valid[low_cardinality_cols]))
  91. OH_cols_train.index = X_train.index
  92. OH_cols_valid.index = X_valid.index
  93.  
  94. number_X_train = X_train.drop(object_cols, axis=1)
  95. number_X_valid = X_valid.drop(object_cols, axis=1)
  96.  
  97. OH_X_train = pd.concat([OH_cols_train, number_X_train], axis=1)
  98. OH_X_valid = pd.concat([OH_cols_valid, number_X_valid], axis=1)
  99.  
  100.  
  101.  
  102.  
  103. #### 8. Pipelines - Preprocessor and Model
  104.  
  105. numerical_transformer = SimpleImputer(strategy='constant')
  106. categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
  107.     ('onehot', OneHotEncoder(handle_unknown='ignore'))])
  108.  
  109. preprocessor = ColumnTransformer(transformers=[
  110.         ('num', numerical_transformer, numerical_cols),
  111.         ('cat', categorical_transformer, categorical_cols)])
  112. model = RandomForestRegressor(n_estimators=100, random_state=0)
  113.  
  114.  
  115. clf = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])
  116. clf.fit(X_train, y_train)
  117. preds = clf.predict(X_valid)
  118. print('MAE:', mean_absolute_error(y_valid, preds))
  119.  
  120.  
  121.  
  122. #### 9. Output to CSV file - Requires a 2nd dataset ONLY FOR TESTING after your having validated your model
  123. preds_test = my_pipeline.predict(X_test)
  124. output = pd.DataFrame({'Id': X_test.index, 'SalePrice': preds_test})
  125. output.to_csv('submission.csv', index=False)
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement