Kaggle: Credit risk (Model: Decision Tree)
A commonly used model for exploring classification problems is the decision tree. A detailed explanation is given in the post on What are decision trees and CARTS?.
Decision trees are the building blocks for random forests and gradient boosted trees
Loading in required modules¶
# importing all system modules
import os
import sys
import warnings
from pathlib import Path
warnings.filterwarnings('ignore')
if sys.platform == 'linux':
sys.path.append('/home/randlow/github/blog/listings/machine-learning/') # linux
elif sys.platform == 'win32':
sys.path.append('\\Users\\randl\\github\\blog2\\listings\\machine-learning\\') # win32
# importing data science modules
import pandas as pd
import numpy as np
import sklearn
import scipy as sp
import pickleshare
# importing graphics modules
import matplotlib.pyplot as plt
import seaborn as sns
# importing personal data science modules
import rand_eda as eda
Loading pickled dataframes¶
To see how the below dataframes were obtained see the post on the Kaggle: Credit risk (Feature Engineering)
home = str(Path.home())
if sys.platform == 'linux':
inputDir = "/datasets/kaggle/home-credit-default-risk" # linux
elif sys.platform == 'win32':
inputDir = "\datasets\kaggle\home-credit-default-risk" # windows
storeDir = home+inputDir+'/pickleshare'
db = pickleshare.PickleShareDB(storeDir)
print(db.keys())
df_app_test_align = db['df_app_test_align']
df_app_train_align = db['df_app_train_align']
#df_app_train_align_expert = db['df_app_train_align_expert']
#df_app_test_align_expert = db['df_app_test_align_expert']
#df_app_train_poly_align = db['df_app_train_poly_align']
#df_app_test_poly_align = db['df_app_test_poly_align']
Selection of feature set for model training & testing¶
Assign which ever datasets you want to train
and test
. This is because as part of feature engineering, you will often build new and different feature datasets and would like to test each one out to evaluate whether it improves model performance.
As the imputer is being fitted on the training data and used to transform both the training and test datasets, the training data needs to have the same number of features as the test dataset. This means that the TARGET
column must be removed from the training dataset, and stored in train_labels
for use later.
train = df_app_train_align.copy()
test = df_app_test_align.copy()
train_labels = train.pop('TARGET')
feat_names = list(train.columns)
Feature set preprocessing¶
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='median')
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range= (0,1))
We fit the imputer and scaler on the training data, and perform the imputer and scaling transformations on both the training and test datasets.
Scikit-learn models only accept arrays. So the imputer and scalers can accept DataFrames as inputs and they output the train
and test
variables as arrays for use into Scikit-Learn's machine learning models.
imputer.fit(train)
train = imputer.transform(train)
test = imputer.transform(test)
scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)
Model implementation (Decision Tree)¶
from sklearn import tree
from sklearn.model_selection import train_test_split
dtree = tree.DecisionTreeClassifier(random_state=50,max_depth=5)
dtree = dtree.fit(train,train_labels)
Drawing the decision tree¶
-
export_graphviz
generates the decision tree classifier into a dot file. -
graphviz.Source().render()
accepts a dot file and renders it into the specified format. -
Image()
displays it in the JupyterNotebook
import graphviz
from IPython.display import Image
dot_data = tree.export_graphviz(dtree,out_file=None,
filled=True, rounded=True,
special_characters=True,
feature_names=feat_names,
class_names=['FALSE','TRUE'])
graphviz.Source(dot_data,format='png').render('dtree_render')
Image('dtree_render.png')
We apply our fitted decision tree to predict the TARGET
outcomes from the test dataset
dtree_pred = dtree.predict_proba(test)[:,1]
Kaggle submission¶
We create the submission dataframe as per the Kaggle home-credit-default-risk competition guidelines
submit = pd.DataFrame()
submit['SK_ID_CURR'] = df_app_test_align.index
submit['TARGET'] = dtree_pred
print(submit.head())
print(submit.shape)
Submit the csv file to Kaggle for scoring
submit.to_csv('decision-tree-home-loan-credit-risk.csv',index=False)
!kaggle competitions submit -c home-credit-default-risk -f decision-tree-home-loan-credit-risk.csv -m 'submitted'
We review our decision tree scores from Kaggle and find that there is a slight improvement to 0.697 compared to 0.662 based upon the logit model (publicScore). We will try other featured engineering datasets and other more sophisticaed machine learning models in the next posts.
!kaggle competitions submissions -c home-credit-default-risk
Converting iPython notebook to Python code¶
This allows us to run the code in Spyder.
!jupyter nbconvert ml_kaggle-home-loan-credit-risk-model-decision-tree.ipynb --output-dir='~/github/blog2/listings/machine-learning/' --to python
Comments
Comments powered by Disqus