Kaggle: Credit risk (Exploratory Data Analysis)

An important topic in regulatory capital modelling in banking is the concept of credit risk. Credit risk is the loss to a bank's portfolio of loans when their customers start to default on their loans (i.e., not pay their loan repayments, or missing their repayments). These loans can be home loans, credit cards, car loans, personal loans, corporate loans, etc. (i.e., mortgages, revolving lines of credit, retail loans, whole sale loans). Credit risk is also related to securitized products and a a related post is on capital modelling as applied to securitized financial products.

Typically, expected loss (i.e., credit risk) is given as follows,

$$ EL = PD \times LGD \times EAD $$

where $EL$ is Expected Loss, $PD$ is Probability of Default, $LGD$ is Loss Given Default, and $EAD$ is Exposure at Default.

According to F-IRB (Foundation-Internal Ratings Based) banks calculate their own $PD$ risk parameter, while the other risk parameters such as $LGD$ and $EAD$ are provided by the nation's banking supervisor (i.e., APRA in Australia, the FED/OCC in US, PRA in UK) except for retail exposures. The Basel A-IRB (Advanced-Internal Ratings Based) allows banks to calculate all of their own risk parameters subject to certain regulatory guidelines. More details can be found from the Bank of International Settlements (BIS)

Since mortgages are an important component of a bank's lending activity and business, we explore a mortgage dataset from Kaggle. In the Kaggle dataset, we are given information on customers of a bank and whether or not they have defaulted on their home loans. Thus, the task at hand is modelling the probability of default $PD$. As $PD$ is a basic modelling requirement of credit risk modelling, this is an excellent introduction into credit risk.

Machine Learning applications are made up of the following steps

  1. Exploratory Data Analysis (EDA)
  2. Feature Engineering
  3. Machine Learning Algorithm Selection
  4. Hyper-parameter tuning

EDA is an exercise where we explore the dataset given and perform several tasks

  • Understand your training and test datasets
    • How big is your training and test datasets? How many datatypes are there in your datasets?
    • Is there missing data in your datasets? How many features have more than 50% of data missing?
    • Do you have categorical datasets? How many unique categories in each categorical feature? Do you need to perform label-encoding or one-hot encoding?
    • Do you have erroneous/outlier data in your features? Do you need to replace these outlier/erroneous points with a NaN?
    • What do your two population distributions (i.e., TARGET=0 and TARGET=1) look like for each feature? For each population distribution, does any feature have KDEs/Histograms that are graphically/statistically different?
    • What are the features that exhibit the highest +ve and -ve correlations with your TARGET data?
    • For features that exhibit the highest +ve/-ve correlations with the TARGET data, do they exhibit correlations with each other?
    • Are there any interesting features that should be explored further as to how they impact the TARGET data?
    • Does your training and test datasets need to be aligned?

I make use of my own exploratory data analysis module that can be found in the /listings/machine-learning/ section.

Importing required modules

In [3]:
# importing file system modules
import os
import sys
import warnings
warnings.filterwarnings('ignore')
if sys.platform == 'linux':
    sys.path.append('/home/randlow/github/blog2/listings/machine-learning/') # linux
elif sys.platform == 'win32':
    sys.path.append('\\Users\\randl\\github\\blog2\\listings\\machine-learning\\') # win32

# importing data science modules
import pandas as pd
import numpy as np
import sklearn
from sklearn import preprocessing
import scipy as sp
import pickleshare

# importing graphics modules
import matplotlib.pyplot as plt
import seaborn as sns

# importing personal data science modules
import rand_eda as eda

Reading in training and test datasets

We read in the datasets and change the directories depending on whether I am using my Windows or ChromeOS machine. I create an additional copy of the training and test datasets as I will be making many adjustments to the dataframes and would like to compare them with the original

In [4]:
from pathlib import Path
home = str(Path.home())
if sys.platform == 'linux':
    inputDir = "/datasets/kaggle/home-credit-default-risk/" # linux
elif sys.platform == 'win32':
    inputDir = "\\datasets\\kaggle\\home-credit-default-risk\\" # windows

fullDir = home+inputDir

df_app_train = pd.read_csv('application_train.csv',index_col=0)
df_app_test = pd.read_csv('application_test.csv',index_col=0)

df_app_train_origin = df_app_train.copy()
df_app_test_origin = df_app_test.copy()

Comparing training and test datasets

Comparing both training and test datasets where column 0 is the training dataset and column 1 is test dataset. We see that the training dataset is un balanced and is as large as 570MB with a 121 columns, whereas the test dataset is 90MB with 120 columns as it does not include the TARGET column. The command also prints out the categorical features in both dataets.

Also returned is the missing value info in the dataset. We see that 67 columns in the training dataset are missing values, which is about 55% of the total number of features available. 41 columns have a critical number of missing values which means that there are more than 50% of rows in the feature with missing values.

In [3]:
comb_basic_info,comb_miss_val,comb_miss_val_app_train,comb_miss_val_app_test = eda.print_compare_df(df_app_train,df_app_test)

print('\nCombined basic info:\n{}'.format(comb_basic_info))
print('\nCombined missing info:\n{}'.format(comb_miss_val))
Combined basic info:
                                                                  0  \
Num rows                                                     307511   
Num cols                                                        121   
Dtype                          float64(65), int64(40), object(16),    
Memory (MB)                                                     570   
True (%)                                                          8   
Is Balanced                                                      No   
Categorical cols  NAME_CONTRACT_TYPE(2), CODE_GENDER(3), FLAG_OW...   

                                                                  1  
Num rows                                                      48744  
Num cols                                                        120  
Dtype                          float64(65), int64(39), object(16),   
Memory (MB)                                                      90  
True (%)                                                        N/A  
Is Balanced                                                     N/A  
Categorical cols  NAME_CONTRACT_TYPE(2), CODE_GENDER(2), FLAG_OW...  

Combined missing info:
                                   0   1
Cols Missing Values               67  64
Cols Missing Values (%)           55  53
Cols Critical Missing Values      41  29
Cols Critical Missing Values (%)  34  24

Comparing feature columns that differ between test and training datasets because there may be categorical variables that are non-existent in the testing dataset but exist in the training dataset. We can see that for the CODE_GENDER column the training dataset (i.e., DF1) has 4 entries denoted as XNA that are non-existent in the test dataset. Likewise in the NAME_INCOME_TYPE feature, Maternity Leave exists 5 times in the training dataset but does not exist in the test dataset. Thus, the function produces all feature categorical columns that have different categories between the training and test datasets.

The barcharts provide a graphical representation of the same data, and make it easier to see the frequency of each categorical observation in each feature for the training and test datasets.

This information is useful as it impacts how encoding is performed on the train and test datasets. You may use this data to also decide to remove rows in the training dataset that have no occurences in the test dataset, and so few occurences in the training dataset.

In [4]:
eda.print_compare_df(df_app_train,df_app_test,printCompareData=False)
Out[4]:
(                                                                  0  \
 Num rows                                                     307511   
 Num cols                                                        121   
 Dtype                          float64(65), int64(40), object(16),    
 Memory (MB)                                                     570   
 True (%)                                                          8   
 Is Balanced                                                      No   
 Categorical cols  NAME_CONTRACT_TYPE(2), CODE_GENDER(3), FLAG_OW...   
 
                                                                   1  
 Num rows                                                      48744  
 Num cols                                                        120  
 Dtype                          float64(65), int64(39), object(16),   
 Memory (MB)                                                      90  
 True (%)                                                        N/A  
 Is Balanced                                                     N/A  
 Categorical cols  NAME_CONTRACT_TYPE(2), CODE_GENDER(2), FLAG_OW...  ,
                                    0   1
 Cols Missing Values               67  64
 Cols Missing Values (%)           55  53
 Cols Critical Missing Values      41  29
 Cols Critical Missing Values (%)  34  24,
                               Missing Values  Percentage
 COMMONAREA_MEDI                       214865        69.9
 COMMONAREA_AVG                        214865        69.9
 COMMONAREA_MODE                       214865        69.9
 NONLIVINGAPARTMENTS_MEDI              213514        69.4
 NONLIVINGAPARTMENTS_MODE              213514        69.4
 NONLIVINGAPARTMENTS_AVG               213514        69.4
 LIVINGAPARTMENTS_MODE                 210199        68.4
 LIVINGAPARTMENTS_MEDI                 210199        68.4
 LIVINGAPARTMENTS_AVG                  210199        68.4
 FONDKAPREMONT_MODE                    210295        68.4
 FLOORSMIN_MODE                        208642        67.8
 FLOORSMIN_MEDI                        208642        67.8
 FLOORSMIN_AVG                         208642        67.8
 YEARS_BUILD_MODE                      204488        66.5
 YEARS_BUILD_MEDI                      204488        66.5
 YEARS_BUILD_AVG                       204488        66.5
 OWN_CAR_AGE                           202929        66.0
 LANDAREA_AVG                          182590        59.4
 LANDAREA_MEDI                         182590        59.4
 LANDAREA_MODE                         182590        59.4
 BASEMENTAREA_MEDI                     179943        58.5
 BASEMENTAREA_AVG                      179943        58.5
 BASEMENTAREA_MODE                     179943        58.5
 EXT_SOURCE_1                          173378        56.4
 NONLIVINGAREA_MEDI                    169682        55.2
 NONLIVINGAREA_MODE                    169682        55.2
 NONLIVINGAREA_AVG                     169682        55.2
 ELEVATORS_MEDI                        163891        53.3
 ELEVATORS_MODE                        163891        53.3
 ELEVATORS_AVG                         163891        53.3
 ...                                      ...         ...
 HOUSETYPE_MODE                        154297        50.2
 LIVINGAREA_MEDI                       154350        50.2
 LIVINGAREA_MODE                       154350        50.2
 LIVINGAREA_AVG                        154350        50.2
 FLOORSMAX_MEDI                        153020        49.8
 FLOORSMAX_AVG                         153020        49.8
 FLOORSMAX_MODE                        153020        49.8
 YEARS_BEGINEXPLUATATION_AVG           150007        48.8
 YEARS_BEGINEXPLUATATION_MEDI          150007        48.8
 YEARS_BEGINEXPLUATATION_MODE          150007        48.8
 TOTALAREA_MODE                        148431        48.3
 EMERGENCYSTATE_MODE                   145755        47.4
 OCCUPATION_TYPE                        96391        31.3
 EXT_SOURCE_3                           60965        19.8
 AMT_REQ_CREDIT_BUREAU_WEEK             41519        13.5
 AMT_REQ_CREDIT_BUREAU_DAY              41519        13.5
 AMT_REQ_CREDIT_BUREAU_MON              41519        13.5
 AMT_REQ_CREDIT_BUREAU_QRT              41519        13.5
 AMT_REQ_CREDIT_BUREAU_HOUR             41519        13.5
 AMT_REQ_CREDIT_BUREAU_YEAR             41519        13.5
 NAME_TYPE_SUITE                         1292         0.4
 DEF_30_CNT_SOCIAL_CIRCLE                1021         0.3
 OBS_60_CNT_SOCIAL_CIRCLE                1021         0.3
 DEF_60_CNT_SOCIAL_CIRCLE                1021         0.3
 OBS_30_CNT_SOCIAL_CIRCLE                1021         0.3
 EXT_SOURCE_2                             660         0.2
 AMT_GOODS_PRICE                          278         0.1
 DAYS_LAST_PHONE_CHANGE                     1         0.0
 CNT_FAM_MEMBERS                            2         0.0
 AMT_ANNUITY                               12         0.0
 
 [67 rows x 2 columns],
                               Missing Values  Percentage
 COMMONAREA_MODE                        33495        68.7
 COMMONAREA_MEDI                        33495        68.7
 COMMONAREA_AVG                         33495        68.7
 NONLIVINGAPARTMENTS_MEDI               33347        68.4
 NONLIVINGAPARTMENTS_AVG                33347        68.4
 NONLIVINGAPARTMENTS_MODE               33347        68.4
 FONDKAPREMONT_MODE                     32797        67.3
 LIVINGAPARTMENTS_MODE                  32780        67.2
 LIVINGAPARTMENTS_MEDI                  32780        67.2
 LIVINGAPARTMENTS_AVG                   32780        67.2
 FLOORSMIN_MEDI                         32466        66.6
 FLOORSMIN_MODE                         32466        66.6
 FLOORSMIN_AVG                          32466        66.6
 OWN_CAR_AGE                            32312        66.3
 YEARS_BUILD_AVG                        31818        65.3
 YEARS_BUILD_MEDI                       31818        65.3
 YEARS_BUILD_MODE                       31818        65.3
 LANDAREA_MODE                          28254        58.0
 LANDAREA_AVG                           28254        58.0
 LANDAREA_MEDI                          28254        58.0
 BASEMENTAREA_MEDI                      27641        56.7
 BASEMENTAREA_AVG                       27641        56.7
 BASEMENTAREA_MODE                      27641        56.7
 NONLIVINGAREA_MEDI                     26084        53.5
 NONLIVINGAREA_MODE                     26084        53.5
 NONLIVINGAREA_AVG                      26084        53.5
 ELEVATORS_MEDI                         25189        51.7
 ELEVATORS_MODE                         25189        51.7
 ELEVATORS_AVG                          25189        51.7
 APARTMENTS_MODE                        23887        49.0
 ...                                      ...         ...
 ENTRANCES_MODE                         23579        48.4
 ENTRANCES_MEDI                         23579        48.4
 ENTRANCES_AVG                          23579        48.4
 LIVINGAREA_AVG                         23552        48.3
 LIVINGAREA_MEDI                        23552        48.3
 LIVINGAREA_MODE                        23552        48.3
 FLOORSMAX_MEDI                         23321        47.8
 FLOORSMAX_MODE                         23321        47.8
 FLOORSMAX_AVG                          23321        47.8
 YEARS_BEGINEXPLUATATION_MEDI           22856        46.9
 YEARS_BEGINEXPLUATATION_MODE           22856        46.9
 YEARS_BEGINEXPLUATATION_AVG            22856        46.9
 TOTALAREA_MODE                         22624        46.4
 EMERGENCYSTATE_MODE                    22209        45.6
 EXT_SOURCE_1                           20532        42.1
 OCCUPATION_TYPE                        15605        32.0
 EXT_SOURCE_3                            8668        17.8
 AMT_REQ_CREDIT_BUREAU_MON               6049        12.4
 AMT_REQ_CREDIT_BUREAU_HOUR              6049        12.4
 AMT_REQ_CREDIT_BUREAU_QRT               6049        12.4
 AMT_REQ_CREDIT_BUREAU_WEEK              6049        12.4
 AMT_REQ_CREDIT_BUREAU_DAY               6049        12.4
 AMT_REQ_CREDIT_BUREAU_YEAR              6049        12.4
 NAME_TYPE_SUITE                          911         1.9
 DEF_60_CNT_SOCIAL_CIRCLE                  29         0.1
 OBS_60_CNT_SOCIAL_CIRCLE                  29         0.1
 DEF_30_CNT_SOCIAL_CIRCLE                  29         0.1
 OBS_30_CNT_SOCIAL_CIRCLE                  29         0.1
 EXT_SOURCE_2                               8         0.0
 AMT_ANNUITY                               24         0.0
 
 [64 rows x 2 columns])

Encoding categorical variables

Label encoding

Label encoding is when we have a 2-state categorical variable (i.e., Y/N, M/F) where we replace these with 1s and 0s (i.e., Y=1/N=0, M=1/F=0). One-hot encoding is when we have we create a separate feature for each category (i.e., Y, N, M, F) and where you have Y, the feature column of Y will be 1 and feature column of N is 0.

We perform label encoding for 2-state categorical variables due reduce the size of the feature set, and perform one-hot encoding for categories with N-states for accuracy.

We see below that for the training (test) dataset, x3 (x4) columns were 2-state categorical variables. This is because as found in the previous analysis on "Comparing training and test" datasets, we found that in the training dataset, the CODE_GENDER has an additional categorical entry labeled XNA that does not exist in the test dataset.

In [5]:
df_app_train, label_encode_list_train = eda.label_encoding_df(df_app_train)
df_app_test, label_encode_list_test = eda.label_encoding_df(df_app_test) # label-encoding

print('label_encode columns (Training) {}'.format(label_encode_list_train))
print('label_encode columns (Testing) {}'.format(label_encode_list_test))
NAME_CONTRACT_TYPE
FLAG_OWN_CAR
FLAG_OWN_REALTY
3 columns were label encoded
NAME_CONTRACT_TYPE
CODE_GENDER
FLAG_OWN_CAR
FLAG_OWN_REALTY
4 columns were label encoded
label_encode columns (Training) ['NAME_CONTRACT_TYPE', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']
label_encode columns (Testing) ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']
In [ ]:
 

We analyze the NAME_CONTRACT_TYPE feature before and after the label encoding. First, in df_app_train_origin we see below that NAME_CONTRACT_TYPE has Cash loans and Revolving Loans. Thus the NAME_CONTRACT_TYPE feature is a 2-state categorical variable.

In the df_app_train that has gone through label encoding processing, we see that all instances of Cash loans is now a 0, and Revolving loans is now a 1.

In [6]:
df_app_train_origin.head()
Out[6]:
TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR
100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 121 columns

In [7]:
df_app_train.head()
Out[7]:
TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR
100002 1 0 M 0 1 0 202500.0 406597.5 24700.5 351000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
100003 0 0 F 0 0 0 270000.0 1293502.5 35698.5 1129500.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
100004 0 1 M 1 1 0 67500.0 135000.0 6750.0 135000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
100006 0 0 F 0 1 0 135000.0 312682.5 29686.5 297000.0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
100007 0 0 M 0 1 0 121500.0 513000.0 21865.5 513000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 121 columns

One-hot encoding

One-hot encoding is performed by the get_dummies function in Pandas.

In [8]:
df_app_train = pd.get_dummies(df_app_train)
df_app_test = pd.get_dummies(df_app_test)

We can see that the number of feature columns has gone up from 121 to 242. This is because one-hot encoding creates an additional feature column for every entry in a categorical column. Thus label-encoding helps to reduce the number of feature variables for 2-state categorical variables.

We also see that the training and test datasets have different numbers of feature columns.

In [9]:
print('Training dataset shape: '.format(0),df_app_train.shape)
print('Testing dataset shape: '.format(0),df_app_test.shape)
Training dataset shape:  (307511, 242)
Testing dataset shape:  (48744, 237)

We would like to see what the feature differences are between the training and test datasets.We create a set to store what the feature differences are between the training and test datasets. A set is a datatype that can only accept unique entries. Thus, we find find the additional features that the training dataset has above the test dataset are to do with the NAME_INCOME_TYPE_Maternity leave, NAME_FAMILY_STATUS Unknown, and CODE_GENDER_M/F/XNA.

In [10]:
app_train_col = set(df_app_train.columns)
app_test_col = set(df_app_test.columns)

exc_col = app_train_col-app_test_col
print(exc_col)
{'NAME_INCOME_TYPE_Maternity leave', 'TARGET', 'NAME_FAMILY_STATUS_Unknown', 'CODE_GENDER_F', 'CODE_GENDER_M', 'CODE_GENDER_XNA'}

Aligning training & test datasets

We need to align both the training and testing datasets to have the same number of features, otherwise the machine learning technique does not work if both datasets have different sets of features.

Thus, we store the TARGET column in another variable as we need it to be in the training dataset. In the align function, we are performing an INNER join on the axis=1 that are the columns so it will only retain features that exist in both datasets. We then add in the TARGET column back to the training dataset.

In [11]:
train_target_labels = df_app_train['TARGET']
df_app_train_align, df_app_test_align = df_app_train.align(df_app_test,join='inner',axis=1)
df_app_train_align['TARGET'] = train_target_labels

We see that both the training data set has the same number of features as the testing dataset+1 (i.e., TARGET)

In [12]:
print('Training dataset shape: '.format(0),df_app_train_align.shape)
print('Testing dataset shape: '.format(0),df_app_test_align.shape)
Training dataset shape:  (307511, 237)
Testing dataset shape:  (48744, 236)

We can evalaute the difference between the features of the aligned test & training datasets compared to before alignment. We can see that to maintain consistency between both the test and training datasets, all data associated to CODE_GENDER_M/F/XNA and NAME_INCOME_TYPE_Maternity Leave and NAME_FAMILY_STATUS_Unknown are removed.

In [13]:
app_train_align_col = set(df_app_train_align.columns)
app_test_align_col = set(df_app_test_align.columns)

diff_align_train = app_train_col - app_train_align_col
diff_align_test = app_test_col - app_test_align_col

print('\nDifference in aligned datasets (Training):\n{}'.format(diff_align_train))
print('\nDifference in aligned datasets (Test):\n{}'.format(diff_align_test))
Difference in aligned datasets (Training):
{'NAME_INCOME_TYPE_Maternity leave', 'NAME_FAMILY_STATUS_Unknown', 'CODE_GENDER_F', 'CODE_GENDER_M', 'CODE_GENDER_XNA'}

Difference in aligned datasets (Test):
{'CODE_GENDER'}

Analyze outliers, missing & erroneous data

An important task is to evaluate the quality of the feature dataset for outliers, missing values, or erroneous data.

The function 'plot_kde_hist_var) performs the following:

  • Splits the dataset into two populations of TARGET==0 and TARGET==1 (i.e., 'No default', and 'Default')
  • Produces the histogram of each population
  • Produces the KDE of each population
  • Produces the rugplot of each population.

The histogram and KDEs serve to show us the shapes of each populations distribution for the feature. If the population distribution looks very different, these might be key features in identifying what the two distinct populations (i.e., TARGET=1/0).The rug plots can be less useful in this respect as there are may be too many datapoints, but rugplots are generally useful graphical univariate representations of a distribution. The most detailed apparoch to evaluate the cleanliness of your data is to produce these plots for every feature and examine each one.

As an example, we examine the first 5 features of the training dataset.

In [14]:
varList = df_app_train_align.columns[:10].values
hVal_list,pVal_list= eda.plot_kde_hist_var(df_app_train_align,varList)

We can examine outlier/erroneous values by looking at the population. For example, we can see that DAYS_BIRTH is varies evenly across the entire distribution. We can also say that those who default are generally younger as the Default population has higher frequencies for customers with lower DAYS_BIRTH.

For the DAYS_EMPLOYED, we can tell this feature exhibits erroneous data as theer is a sudden spike in occurencies above 350,000 days, so we should investigate this further.

In [15]:
varList = ['DAYS_BIRTH','DAYS_EMPLOYED']
hVal_list,pVal_list= eda.plot_kde_hist_var(df_app_train_align,varList,drawAll=True)

Analyzing the DAYS_EMPLOYED feature for errors

We can investigate DAYS_EMPLOYED with greater detail and find that it has a maximum value of 365243 which does not make sense as this corresponds to a 1000 years (i.e., noone can be employed for a 1000 years).

In [16]:
df_app_train_align['DAYS_EMPLOYED'].describe()
Out[16]:
count    307511.000000
mean      63815.045904
std      141275.766519
min      -17912.000000
25%       -2760.000000
50%       -1213.000000
75%        -289.000000
max      365243.000000
Name: DAYS_EMPLOYED, dtype: float64

How often does this erroneous value occur? 55374 times in the DAYS_EMPLOYED column.

In [17]:
maxVal = max(df_app_train_align['DAYS_EMPLOYED'])
numErrorVal = df_app_train_align[df_app_train_align['DAYS_EMPLOYED']==maxVal].shape[0]
print(numErrorVal)
55374

We need to analyze whether the erroneous value of 55374 occurs with low or high frequency in the datasets. We need to find out whether it occurs in any other columns besides DAYS_EMPLOYED and whether it is with high frequency. We find it only appears for DAYS_EMPLOYED and constitutes 18% of that feature in the training dataset, which is quite high.

Each column that exhibits the erroneous value is given in errCol_list_train and this is only the DAYS_EMPLOYED feature.

In [18]:
df_col_errVal_train, errCol_list_train = eda.chk_val_col(df_app_train_align,maxVal)
df_col_errVal_test, errCol_list_test = eda.chk_val_col(df_app_test_align,maxVal)


print('\nTraining dataset:\n{}'.format(df_col_errVal_train))
print('\nTesting dataset:\n{}'.format(df_col_errVal_test))

print('\nColumns containing error values (Training):\n{}'.format(errCol_list_train))
print('\nColumns containing error values (Test):\n{}'.format(errCol_list_test))

eda.plot_kde_hist_var(df_app_train_align,errCol_list_train)
Training dataset:
               Error val %
DAYS_EMPLOYED    18.007161

Testing dataset:
               Error val %
DAYS_EMPLOYED    19.025931

Columns containing error values (Training):
['DAYS_EMPLOYED']

Columns containing error values (Test):
['DAYS_EMPLOYED']
Out[18]:
([True], [2.9178193590053594e-149])

As we've identified how often and which columns that erroneous value occurs, we need to replace it with NaN.

In [19]:
df_app_train_align = eda.fill_errorVal_df(df_app_train_align,errCol_list_train,maxVal)
df_app_test_align = eda.fill_errorVal_df(df_app_test_align,errCol_list_test,maxVal)

We see that now that we've replaced the erroneous value with NaN values, the resulting histogram of DAYS_EMPLOYED makes more sense.

In [20]:
eda.plot_kde_hist_var(df_app_train_align,errCol_list_train,drawAll=False)
Out[20]:
([True], [2.9178193590053594e-149])

Correlation analysis

The simplest analysis to evalute which are the most important features relevant to TARGET is correlation. We obtain a correlation matrix of the training dataset, and sort it to see the features that have the highest positive and negative correlation with TARGET.

For +ve correlation, DAYS_BIRTH and REGION_RATING_CLIENT_x are highest, but have a low magnitude since correlation is around 0.06-0.07.

For -ve correlation, we have sronger results as the EXT_SOURCE-x are between -0.15-0.18.

In [21]:
df_app_train_corr = df_app_train_align.corr()
df_app_train_corr_target = df_app_train_corr['TARGET'].sort_values()
print('+ve corr: \n{0}'.format(df_app_train_corr_target.tail(20)))
print('-ve corr: \n{0}'.format(df_app_train_corr_target.head(20)))
+ve corr: 
OCCUPATION_TYPE_Drivers                              0.030303
DEF_60_CNT_SOCIAL_CIRCLE                             0.031276
DEF_30_CNT_SOCIAL_CIRCLE                             0.032248
LIVE_CITY_NOT_WORK_CITY                              0.032518
OWN_CAR_AGE                                          0.037612
DAYS_REGISTRATION                                    0.041975
OCCUPATION_TYPE_Laborers                             0.043019
FLAG_DOCUMENT_3                                      0.044346
REG_CITY_NOT_LIVE_CITY                               0.044395
FLAG_EMP_PHONE                                       0.045982
NAME_EDUCATION_TYPE_Secondary / secondary special    0.049824
REG_CITY_NOT_WORK_CITY                               0.050994
DAYS_ID_PUBLISH                                      0.051457
DAYS_LAST_PHONE_CHANGE                               0.055218
NAME_INCOME_TYPE_Working                             0.057481
REGION_RATING_CLIENT                                 0.058899
REGION_RATING_CLIENT_W_CITY                          0.060893
DAYS_EMPLOYED                                        0.074958
DAYS_BIRTH                                           0.078239
TARGET                                               1.000000
Name: TARGET, dtype: float64
-ve corr: 
EXT_SOURCE_3                           -0.178919
EXT_SOURCE_2                           -0.160472
EXT_SOURCE_1                           -0.155317
NAME_EDUCATION_TYPE_Higher education   -0.056593
NAME_INCOME_TYPE_Pensioner             -0.046209
ORGANIZATION_TYPE_XNA                  -0.045987
FLOORSMAX_AVG                          -0.044003
FLOORSMAX_MEDI                         -0.043768
FLOORSMAX_MODE                         -0.043226
EMERGENCYSTATE_MODE_No                 -0.042201
HOUSETYPE_MODE_block of flats          -0.040594
AMT_GOODS_PRICE                        -0.039645
REGION_POPULATION_RELATIVE             -0.037227
ELEVATORS_AVG                          -0.034199
ELEVATORS_MEDI                         -0.033863
FLOORSMIN_AVG                          -0.033614
FLOORSMIN_MEDI                         -0.033394
WALLSMATERIAL_MODE_Panel               -0.033119
LIVINGAREA_AVG                         -0.032997
LIVINGAREA_MEDI                        -0.032739
Name: TARGET, dtype: float64

Population distribution graphical analysis

We extract the list of top 10 positive and negatively correlated features to TARGET and plot the graphical KDEs/Histograms to see how the shapes of the TARGET=1 and TARGET=0 population distributions differ for these more salient features.

In [22]:
var_pos_corr_list = df_app_train_corr_target.head(10).index.values
var_neg_corr_list = df_app_train_corr_target[-2:-10:-1].index.values

print(var_pos_corr_list)
print(var_neg_corr_list)

eda.plot_kde_hist_var(df_app_train_align,var_pos_corr_list,drawAll=True)
eda.plot_kde_hist_var(df_app_train_align,var_neg_corr_list,drawAll=True)
['EXT_SOURCE_3' 'EXT_SOURCE_2' 'EXT_SOURCE_1'
 'NAME_EDUCATION_TYPE_Higher education' 'NAME_INCOME_TYPE_Pensioner'
 'ORGANIZATION_TYPE_XNA' 'FLOORSMAX_AVG' 'FLOORSMAX_MEDI' 'FLOORSMAX_MODE'
 'EMERGENCYSTATE_MODE_No']
['DAYS_BIRTH' 'DAYS_EMPLOYED' 'REGION_RATING_CLIENT_W_CITY'
 'REGION_RATING_CLIENT' 'NAME_INCOME_TYPE_Working' 'DAYS_LAST_PHONE_CHANGE'
 'DAYS_ID_PUBLISH' 'REG_CITY_NOT_WORK_CITY']
Out[22]:
([True, True, True, True, True, True, True, True],
 [2.5656843067790899e-295,
  2.9178193590053594e-149,
  1.03323934151479e-81,
  3.5676464026189418e-82,
  3.570558750907294e-221,
  1.8654513632029434e-186,
  1.3904472512309756e-144,
  8.401013793736072e-124])

We can see that for the EXT_SOUCE_x and the DAYS_EMPLOYED and DAYS_BIRTH features that the population distributions for TARGET=1 and TARGET=0 are quite different. Thus we take a closer look at these by generating a heatmap.

Additional analysis for DAYS_EMPLOYED

A more careful analysis is made for DAYS_EMPLOYED by turning it into years which is easier to understand and synthesize. We see that the maximum is 50 years.

In [26]:
daysEmp_data = df_app_train_align[['TARGET','DAYS_EMPLOYED']]
daysEmp_data.loc[:,'YEARS_EMPLOYED'] = daysEmp_data['DAYS_EMPLOYED']/365
daysEmp_data['YEARS_EMPLOYED'].hist()
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7eb6351a02b0>

We cut up the dataset into 10 equal portions of 5 years each, and calculate the mean of the TARGET in each bin.

In [28]:
daysEmp_data['YEARS_BINNED'] = pd.cut(daysEmp_data['YEARS_EMPLOYED'],bins=np.linspace(-50,0,num=11))
daysEmp_group = daysEmp_data.groupby('YEARS_BINNED').mean()
daysEmp_group
Out[28]:
TARGET DAYS_EMPLOYED YEARS_EMPLOYED
YEARS_BINNED
(-50.0, -45.0] 0.000000 -16862.916667 -46.199772
(-45.0, -40.0] 0.004762 -15325.461905 -41.987567
(-40.0, -35.0] 0.019231 -13517.634615 -37.034615
(-35.0, -30.0] 0.041237 -11791.317526 -32.304980
(-30.0, -25.0] 0.039903 -9960.885429 -27.290097
(-25.0, -20.0] 0.047566 -8092.456438 -22.171114
(-20.0, -15.0] 0.048457 -6306.064394 -17.276889
(-15.0, -10.0] 0.057786 -4465.836653 -12.235169
(-10.0, -5.0] 0.073740 -2629.270088 -7.203480
(-5.0, 0.0] 0.105712 -864.017884 -2.367172

A barplot shows us that those who default most are usually have the lowest number of days/years in employment

In [29]:
sns.barplot(daysEmp_group.index,daysEmp_group['TARGET']*100)
plt.xticks(rotation=60)
plt.ylabel('% default')
plt.xlabel('Days Employed Groups (Years)')
Out[29]:
Text(0.5,0,'Days Employed Groups (Years)')

We can see that tha above histogram with the bins is more informative than the KDE below

In [30]:
eda.plot_kde_hist_var(df_app_train_align,['DAYS_EMPLOYED'],drawAll=True)
Out[30]:
([True], [2.9178193590053594e-149])

Correlation heatmap of salient features

In [23]:
df_ext_src = df_app_train_align[['TARGET','EXT_SOURCE_3','EXT_SOURCE_2','EXT_SOURCE_1','DAYS_BIRTH','DAYS_EMPLOYED']] 
df_ext_src_corr = df_ext_src.corr()
sns.heatmap(df_ext_src_corr,vmin=-1.0,vmax=1.0,annot=True)
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7eb67aa65be0>

We want to create a pairplot and a pairgrid to have a graphical analysis of the most important features of the dataset. As the original dataset is quite large, we take a sample of it such that we remove all the rows that have NaN and then take a random sample of 5000 points.

We have a 6x6 grid in pairplot as TARGET is explicitly considered via the coloring scheme (i.e., hue). In this pairgrid TARGET is denoted by the hue. Orange is TARGET=1 (default), and Blue is TARGET=0 (no default).

The pairgrid can be explained as follows:

  • Upper triangle: This is a scatter plot between the two variables in the X & Y axes, and has the TARGET variable as a different hue.
  • Diagonal: This is a kde plot of the distribution of each variable.
  • Bottom triangle: This is a kde plot.

Looking at DAYS_BIRTH and EXT_SOURCE_, we can see that for TARGET=1 (i.e., orange) there is a high negative correlation. It is harder to make similar comments for the correlations between the other variables and that of the TARGET.

In [34]:
df_ext_src_sample = df_ext_src.dropna().sample(5000)
grid = sns.PairGrid(data = df_ext_src_sample, diag_sharey=True,
                    hue = 'TARGET', 
                    vars = [x for x in list(df_ext_src_sample.columns) if x != 'TARGET'])

grid.map_upper(plt.scatter, alpha = 0.2)
grid.map_diag(sns.kdeplot)
grid.map_lower(sns.kdeplot, cmap = plt.cm.OrRd_r);

Pickling data

We end the EDA at this point and pickle all the necessary dataframes for the next step which is feature engineering.

In [35]:
storeDir = fullDir+'/pickleshare'

db = pickleshare.PickleShareDB(storeDir)
db['df_app_test_align'] = df_app_test_align
db['df_app_train_align'] = df_app_train_align
db['df_app_train_corr_target'] = df_app_train_corr_target

Summary

Remember EDA is about the following:

  • Understanding basic characteristics of your dataset
    • Size of your dataset and number of features
    • Number of datatypes in your feature set.
    • Number of categorical data types.
    • Is dataset unbalanced?
  • Performing label-encoding or one-hot label encoding
  • Analyzing correlations within your dataset
    • Correlations between features and the TARGET
    • Correlations between features themselves
  • Analyzing missing/erroneous/outlier data.
    • Analyzing population distributions (i.e., TARGET=1, TARGET=0) for each feature for differences
    • Replacing erroneous/outlier with NaN.

Converting iPython notebook to Python code

This allows us to run the code in Spyder.

In [2]:
!jupyter nbconvert ml_kaggle-home-loan-credit-risk-eda.ipynb --to python
[NbConvertApp] Converting notebook ml_kaggle-home-loan-credit-risk-eda.ipynb to python
[NbConvertApp] Writing 22055 bytes to ml_kaggle-home-loan-credit-risk-eda.py
In [ ]:
 

Comments

Comments powered by Disqus