Kaggle: Credit risk (Model: Decision Tree)

A commonly used model for exploring classification problems is the decision tree. A detailed explanation is given in the post on What are decision trees and CARTS?.

Decision trees are the building blocks for random forests and gradient boosted trees

Loading in required modules

In [2]:
# importing all system modules
import os
import sys
import warnings
from pathlib import Path
if sys.platform == 'linux':
    sys.path.append('/home/randlow/github/blog/listings/machine-learning/') # linux
elif sys.platform == 'win32':
    sys.path.append('\\Users\\randl\\github\\blog2\\listings\\machine-learning\\') # win32

# importing data science modules
import pandas as pd
import numpy as np
import sklearn
import scipy as sp
import pickleshare

# importing graphics modules
import matplotlib.pyplot as plt
import seaborn as sns

# importing personal data science modules
import rand_eda as eda

Loading pickled dataframes

To see how the below dataframes were obtained see the post on the Kaggle: Credit risk (Feature Engineering)

In [3]:
home = str(Path.home())
if sys.platform == 'linux':
    inputDir = "/datasets/kaggle/home-credit-default-risk" # linux
elif sys.platform == 'win32':
    inputDir = "\datasets\kaggle\home-credit-default-risk" # windows

storeDir = home+inputDir+'/pickleshare'

db = pickleshare.PickleShareDB(storeDir)

df_app_test_align = db['df_app_test_align'] 
df_app_train_align = db['df_app_train_align'] 
#df_app_train_align_expert  = db['df_app_train_align_expert'] 
#df_app_test_align_expert = db['df_app_test_align_expert'] 
#df_app_train_poly_align = db['df_app_train_poly_align']
#df_app_test_poly_align = db['df_app_test_poly_align'] 
['df_app_test_align', 'df_app_train_align', 'df_app_train_corr_target', 'df_app_train_align_expert', 'df_app_test_align_expert', 'df_app_train_poly_align', 'df_app_test_poly_align']

Selection of feature set for model training & testing

Assign which ever datasets you want to train and test. This is because as part of feature engineering, you will often build new and different feature datasets and would like to test each one out to evaluate whether it improves model performance.

As the imputer is being fitted on the training data and used to transform both the training and test datasets, the training data needs to have the same number of features as the test dataset. This means that the TARGET column must be removed from the training dataset, and stored in train_labels for use later.

In [4]:
train = df_app_train_align.copy()
test = df_app_test_align.copy()

train_labels = train.pop('TARGET')
feat_names = list(train.columns)

Feature set preprocessing

In [5]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='median')

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range= (0,1))

We fit the imputer and scaler on the training data, and perform the imputer and scaling transformations on both the training and test datasets.

Scikit-learn models only accept arrays. So the imputer and scalers can accept DataFrames as inputs and they output the train and test variables as arrays for use into Scikit-Learn's machine learning models.

In [6]:
train = imputer.transform(train)
test = imputer.transform(test)

train = scaler.transform(train)
test = scaler.transform(test)

Model implementation (Decision Tree)

In [7]:
from sklearn import tree
from sklearn.model_selection import train_test_split

dtree = tree.DecisionTreeClassifier(random_state=50,max_depth=5)
dtree = dtree.fit(train,train_labels)

Drawing the decision tree

  • export_graphviz generates the decision tree classifier into a dot file.
  • graphviz.Source().render() accepts a dot file and renders it into the specified format.
  • Image() displays it in the JupyterNotebook
In [8]:
import graphviz
from IPython.display import Image
dot_data = tree.export_graphviz(dtree,out_file=None,
                    filled=True, rounded=True,
We apply our fitted decision tree to predict the TARGET outcomes from the test dataset

In [9]:
dtree_pred = dtree.predict_proba(test)[:,1]

Kaggle submission

We create the submission dataframe as per the Kaggle home-credit-default-risk competition guidelines

In [10]:
submit = pd.DataFrame()
submit['SK_ID_CURR'] = df_app_test_align.index
submit['TARGET'] = dtree_pred
0      100001  0.079853
1      100005  0.103296
2      100013  0.022893
3      100028  0.034714
4      100038  0.088519
(48744, 2)

Submit the csv file to Kaggle for scoring

In [11]:
!kaggle competitions submit -c home-credit-default-risk -f decision-tree-home-loan-credit-risk.csv -m 'submitted'
100%|██████████████████████████████████████| 1.25M/1.25M [00:00<00:00, 1.43MB/s]
Successfully submitted to Home Credit Default Risk

We review our decision tree scores from Kaggle and find that there is a slight improvement to 0.697 compared to 0.662 based upon the logit model (publicScore). We will try other featured engineering datasets and other more sophisticaed machine learning models in the next posts.

In [12]:
!kaggle competitions submissions -c home-credit-default-risk
fileName                                 date                 description  status    publicScore  privateScore  
---------------------------------------  -------------------  -----------  --------  -----------  ------------  
decision-tree-home-loan-credit-risk.csv  2019-02-23 17:11:25  submitted    complete  0.69792      0.68776       
lightgbm-home-loan-credit-risk.csv       2019-02-19 04:16:44  submitted    complete  0.74158      0.74351       
random-forest-home-loan-credit-risk.csv  2019-02-19 04:15:01  submitted    complete  0.74158      0.74351       
random-forest-home-loan-credit-risk.csv  2019-02-11 17:31:03  submitted    complete  0.68694      0.68886       
random-forest-home-loan-credit-risk.csv  2019-02-11 05:24:55  submitted    complete  0.68694      0.68886       
random-forest-home-loan-credit-risk.csv  2019-02-11 05:10:40  submitted    complete  0.68694      0.68886       
logit-home-loan-credit-risk.csv          2019-02-11 04:52:51  submitted    complete  0.66223      0.67583       
random-forest-home-loan-credit-risk.csv  2019-02-11 04:44:50  submitted    complete  0.68694      0.68886       
logit-home-loan-credit-risk.csv          2019-02-08 04:08:33  submitted    complete  0.66223      0.67583       

Converting iPython notebook to Python code

This allows us to run the code in Spyder.

In [13]:
!jupyter nbconvert ml_kaggle-home-loan-credit-risk-model-decision-tree.ipynb --output-dir='~/github/blog2/listings/machine-learning/' --to python
In [ ]:


