Kaggle: Credit risk (Exploratory Data Analysis)
An important topic in regulatory capital modelling in banking is the concept of credit risk. Credit risk is the loss to a bank's portfolio of loans when their customers start to default on their loans (i.e., not pay their loan repayments, or missing their repayments). These loans can be home loans, credit cards, car loans, personal loans, corporate loans, etc. (i.e., mortgages, revolving lines of credit, retail loans, whole sale loans). Credit risk is also related to securitized products and a a related post is on capital modelling as applied to securitized financial products.
Typically, expected loss (i.e., credit risk) is given as follows,
$$ EL = PD \times LGD \times EAD $$where $EL$ is Expected Loss, $PD$ is Probability of Default, $LGD$ is Loss Given Default, and $EAD$ is Exposure at Default.
According to F-IRB (Foundation-Internal Ratings Based) banks calculate their own $PD$ risk parameter, while the other risk parameters such as $LGD$ and $EAD$ are provided by the nation's banking supervisor (i.e., APRA in Australia, the FED/OCC in US, PRA in UK) except for retail exposures. The Basel A-IRB (Advanced-Internal Ratings Based) allows banks to calculate all of their own risk parameters subject to certain regulatory guidelines. More details can be found from the Bank of International Settlements (BIS)
Since mortgages are an important component of a bank's lending activity and business, we explore a mortgage dataset from Kaggle. In the Kaggle dataset, we are given information on customers of a bank and whether or not they have defaulted on their home loans. Thus, the task at hand is modelling the probability of default $PD$. As $PD$ is a basic modelling requirement of credit risk modelling, this is an excellent introduction into credit risk.
Machine Learning applications are made up of the following steps
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Machine Learning Algorithm Selection
- Hyper-parameter tuning
EDA is an exercise where we explore the dataset given and perform several tasks
- Understand your training and test datasets
- How big is your training and test datasets? How many datatypes are there in your datasets?
- Is there missing data in your datasets? How many features have more than 50% of data missing?
- Do you have categorical datasets? How many unique categories in each categorical feature? Do you need to perform label-encoding or one-hot encoding?
- Do you have erroneous/outlier data in your features? Do you need to replace these outlier/erroneous points with a NaN?
- What do your two population distributions (i.e., TARGET=0 and TARGET=1) look like for each feature? For each population distribution, does any feature have KDEs/Histograms that are graphically/statistically different?
- What are the features that exhibit the highest +ve and -ve correlations with your TARGET data?
- For features that exhibit the highest +ve/-ve correlations with the TARGET data, do they exhibit correlations with each other?
- Are there any interesting features that should be explored further as to how they impact the TARGET data?
- Does your training and test datasets need to be aligned?
I make use of my own exploratory data analysis module that can be found in the /listings/machine-learning/ section.
Importing required modules¶
# importing file system modules
import os
import sys
import warnings
warnings.filterwarnings('ignore')
if sys.platform == 'linux':
sys.path.append('/home/randlow/github/blog2/listings/machine-learning/') # linux
elif sys.platform == 'win32':
sys.path.append('\\Users\\randl\\github\\blog2\\listings\\machine-learning\\') # win32
# importing data science modules
import pandas as pd
import numpy as np
import sklearn
from sklearn import preprocessing
import scipy as sp
import pickleshare
# importing graphics modules
import matplotlib.pyplot as plt
import seaborn as sns
# importing personal data science modules
import rand_eda as eda
Reading in training and test datasets¶
We read in the datasets and change the directories depending on whether I am using my Windows or ChromeOS machine. I create an additional copy of the training and test datasets as I will be making many adjustments to the dataframes and would like to compare them with the original
from pathlib import Path
home = str(Path.home())
if sys.platform == 'linux':
inputDir = "/datasets/kaggle/home-credit-default-risk/" # linux
elif sys.platform == 'win32':
inputDir = "\\datasets\\kaggle\\home-credit-default-risk\\" # windows
fullDir = home+inputDir
df_app_train = pd.read_csv('application_train.csv',index_col=0)
df_app_test = pd.read_csv('application_test.csv',index_col=0)
df_app_train_origin = df_app_train.copy()
df_app_test_origin = df_app_test.copy()
Comparing training and test datasets¶
Comparing both training and test datasets where column 0 is the training dataset and column 1 is test dataset. We see that the training
dataset is un balanced and is as large as 570MB with a 121 columns, whereas the test
dataset is 90MB with 120 columns as it does not include the TARGET
column. The command also prints out the categorical features in both dataets.
Also returned is the missing value info in the dataset. We see that 67 columns in the training dataset are missing values, which is about 55% of the total number of features available. 41 columns have a critical number of missing values which means that there are more than 50% of rows in the feature with missing values.
comb_basic_info,comb_miss_val,comb_miss_val_app_train,comb_miss_val_app_test = eda.print_compare_df(df_app_train,df_app_test)
print('\nCombined basic info:\n{}'.format(comb_basic_info))
print('\nCombined missing info:\n{}'.format(comb_miss_val))
Comparing feature columns that differ between test and training datasets because there may be categorical variables that are non-existent in the testing dataset but exist in the training dataset. We can see that for the CODE_GENDER
column the training dataset (i.e., DF1) has 4 entries denoted as XNA
that are non-existent in the test dataset. Likewise in the NAME_INCOME_TYPE
feature, Maternity Leave
exists 5 times in the training dataset but does not exist in the test dataset. Thus, the function produces all feature categorical columns that have different categories between the training and test datasets.
The barcharts provide a graphical representation of the same data, and make it easier to see the frequency of each categorical observation in each feature for the training and test datasets.
This information is useful as it impacts how encoding is performed on the train and test datasets. You may use this data to also decide to remove rows in the training dataset that have no occurences in the test dataset, and so few occurences in the training dataset.
eda.print_compare_df(df_app_train,df_app_test,printCompareData=False)
Encoding categorical variables¶
Label encoding¶
Label encoding is when we have a 2-state categorical variable (i.e., Y/N, M/F) where we replace these with 1s and 0s (i.e., Y=1/N=0, M=1/F=0). One-hot encoding is when we have we create a separate feature for each category (i.e., Y, N, M, F) and where you have Y, the feature column of Y will be 1 and feature column of N is 0.
We perform label encoding for 2-state categorical variables due reduce the size of the feature set, and perform one-hot encoding for categories with N-states for accuracy.
We see below that for the training (test) dataset, x3 (x4) columns were 2-state categorical variables. This is because as found in the previous analysis on "Comparing training and test" datasets, we found that in the training dataset, the CODE_GENDER
has an additional categorical entry labeled XNA
that does not exist in the test dataset.
df_app_train, label_encode_list_train = eda.label_encoding_df(df_app_train)
df_app_test, label_encode_list_test = eda.label_encoding_df(df_app_test) # label-encoding
print('label_encode columns (Training) {}'.format(label_encode_list_train))
print('label_encode columns (Testing) {}'.format(label_encode_list_test))
We analyze the NAME_CONTRACT_TYPE
feature before and after the label encoding. First, in df_app_train_origin
we see below that NAME_CONTRACT_TYPE
has Cash loans
and Revolving Loans
. Thus the NAME_CONTRACT_TYPE
feature is a 2-state categorical variable.
In the df_app_train
that has gone through label encoding processing, we see that all instances of Cash loans
is now a 0, and Revolving loans
is now a 1.
df_app_train_origin.head()
df_app_train.head()
One-hot encoding¶
One-hot encoding is performed by the get_dummies
function in Pandas.
df_app_train = pd.get_dummies(df_app_train)
df_app_test = pd.get_dummies(df_app_test)
We can see that the number of feature columns has gone up from 121 to 242. This is because one-hot encoding creates an additional feature column for every entry in a categorical column. Thus label-encoding helps to reduce the number of feature variables for 2-state categorical variables.
We also see that the training and test datasets have different numbers of feature columns.
print('Training dataset shape: '.format(0),df_app_train.shape)
print('Testing dataset shape: '.format(0),df_app_test.shape)
We would like to see what the feature differences are between the training and test datasets.We create a set to store what the feature differences are between the training and test datasets. A set is a datatype that can only accept unique entries. Thus, we find find the additional features that the training dataset has above the test dataset are to do with the NAME_INCOME_TYPE_Maternity leave
, NAME_FAMILY_STATUS Unknown
, and CODE_GENDER_M/F/XNA
.
app_train_col = set(df_app_train.columns)
app_test_col = set(df_app_test.columns)
exc_col = app_train_col-app_test_col
print(exc_col)
Aligning training & test datasets¶
We need to align both the training and testing datasets to have the same number of features, otherwise the machine learning technique does not work if both datasets have different sets of features.
Thus, we store the TARGET
column in another variable as we need it to be in the training dataset. In the align
function, we are performing an INNER
join on the axis=1
that are the columns so it will only retain features that exist in both datasets. We then add in the TARGET
column back to the training dataset.
train_target_labels = df_app_train['TARGET']
df_app_train_align, df_app_test_align = df_app_train.align(df_app_test,join='inner',axis=1)
df_app_train_align['TARGET'] = train_target_labels
We see that both the training data set has the same number of features as the testing dataset+1 (i.e., TARGET
)
print('Training dataset shape: '.format(0),df_app_train_align.shape)
print('Testing dataset shape: '.format(0),df_app_test_align.shape)
We can evalaute the difference between the features of the aligned test & training datasets compared to before alignment. We can see that to maintain consistency between both the test and training datasets, all data associated to CODE_GENDER_M/F/XNA
and NAME_INCOME_TYPE_Maternity Leave
and NAME_FAMILY_STATUS_Unknown
are removed.
app_train_align_col = set(df_app_train_align.columns)
app_test_align_col = set(df_app_test_align.columns)
diff_align_train = app_train_col - app_train_align_col
diff_align_test = app_test_col - app_test_align_col
print('\nDifference in aligned datasets (Training):\n{}'.format(diff_align_train))
print('\nDifference in aligned datasets (Test):\n{}'.format(diff_align_test))
Analyze outliers, missing & erroneous data¶
An important task is to evaluate the quality of the feature dataset for outliers, missing values, or erroneous data.
The function 'plot_kde_hist_var) performs the following:
- Splits the dataset into two populations of
TARGET==0
andTARGET==1
(i.e., 'No default', and 'Default') - Produces the histogram of each population
- Produces the KDE of each population
- Produces the rugplot of each population.
The histogram and KDEs serve to show us the shapes of each populations distribution for the feature. If the population distribution looks very different, these might be key features in identifying what the two distinct populations (i.e., TARGET=1/0).The rug plots can be less useful in this respect as there are may be too many datapoints, but rugplots are generally useful graphical univariate representations of a distribution. The most detailed apparoch to evaluate the cleanliness of your data is to produce these plots for every feature and examine each one.
As an example, we examine the first 5 features of the training dataset.
varList = df_app_train_align.columns[:10].values
hVal_list,pVal_list= eda.plot_kde_hist_var(df_app_train_align,varList)
We can examine outlier/erroneous values by looking at the population. For example, we can see that DAYS_BIRTH
is varies evenly across the entire distribution. We can also say that those who default are generally younger as the Default
population has higher frequencies for customers with lower DAYS_BIRTH
.
For the DAYS_EMPLOYED
, we can tell this feature exhibits erroneous data as theer is a sudden spike in occurencies above 350,000 days, so we should investigate this further.
varList = ['DAYS_BIRTH','DAYS_EMPLOYED']
hVal_list,pVal_list= eda.plot_kde_hist_var(df_app_train_align,varList,drawAll=True)
Analyzing the DAYS_EMPLOYED
feature for errors¶
We can investigate DAYS_EMPLOYED
with greater detail and find that it has a maximum value of 365243
which does not make sense as this corresponds to a 1000 years (i.e., noone can be employed for a 1000 years).
df_app_train_align['DAYS_EMPLOYED'].describe()
How often does this erroneous value occur? 55374
times in the DAYS_EMPLOYED
column.
maxVal = max(df_app_train_align['DAYS_EMPLOYED'])
numErrorVal = df_app_train_align[df_app_train_align['DAYS_EMPLOYED']==maxVal].shape[0]
print(numErrorVal)
We need to analyze whether the erroneous value of 55374
occurs with low or high frequency in the datasets. We need to find out whether it occurs in any other columns besides DAYS_EMPLOYED
and whether it is with high frequency. We find it only appears for DAYS_EMPLOYED
and constitutes 18% of that feature in the training dataset, which is quite high.
Each column that exhibits the erroneous value is given in errCol_list_train
and this is only the DAYS_EMPLOYED
feature.
df_col_errVal_train, errCol_list_train = eda.chk_val_col(df_app_train_align,maxVal)
df_col_errVal_test, errCol_list_test = eda.chk_val_col(df_app_test_align,maxVal)
print('\nTraining dataset:\n{}'.format(df_col_errVal_train))
print('\nTesting dataset:\n{}'.format(df_col_errVal_test))
print('\nColumns containing error values (Training):\n{}'.format(errCol_list_train))
print('\nColumns containing error values (Test):\n{}'.format(errCol_list_test))
eda.plot_kde_hist_var(df_app_train_align,errCol_list_train)
As we've identified how often and which columns that erroneous value occurs, we need to replace it with NaN
.
df_app_train_align = eda.fill_errorVal_df(df_app_train_align,errCol_list_train,maxVal)
df_app_test_align = eda.fill_errorVal_df(df_app_test_align,errCol_list_test,maxVal)
We see that now that we've replaced the erroneous value with NaN
values, the resulting histogram of DAYS_EMPLOYED
makes more sense.
eda.plot_kde_hist_var(df_app_train_align,errCol_list_train,drawAll=False)
Correlation analysis¶
The simplest analysis to evalute which are the most important features relevant to TARGET
is correlation. We obtain a correlation matrix of the training dataset, and sort it to see the features that have the highest positive and negative correlation with TARGET
.
For +ve correlation, DAYS_BIRTH
and REGION_RATING_CLIENT_x
are highest, but have a low magnitude since correlation is around 0.06-0.07.
For -ve correlation, we have sronger results as the EXT_SOURCE-x
are between -0.15-0.18.
df_app_train_corr = df_app_train_align.corr()
df_app_train_corr_target = df_app_train_corr['TARGET'].sort_values()
print('+ve corr: \n{0}'.format(df_app_train_corr_target.tail(20)))
print('-ve corr: \n{0}'.format(df_app_train_corr_target.head(20)))
Population distribution graphical analysis¶
We extract the list of top 10 positive and negatively correlated features to TARGET
and plot the graphical KDEs/Histograms to see how the shapes of the TARGET=1
and TARGET=0
population distributions differ for these more salient features.
var_pos_corr_list = df_app_train_corr_target.head(10).index.values
var_neg_corr_list = df_app_train_corr_target[-2:-10:-1].index.values
print(var_pos_corr_list)
print(var_neg_corr_list)
eda.plot_kde_hist_var(df_app_train_align,var_pos_corr_list,drawAll=True)
eda.plot_kde_hist_var(df_app_train_align,var_neg_corr_list,drawAll=True)
We can see that for the EXT_SOUCE_x
and the DAYS_EMPLOYED
and DAYS_BIRTH
features that the population distributions for TARGET=1
and TARGET=0
are quite different. Thus we take a closer look at these by generating a heatmap.
Additional analysis for DAYS_EMPLOYED
¶
A more careful analysis is made for DAYS_EMPLOYED
by turning it into years which is easier to understand and synthesize. We see that the maximum is 50 years.
daysEmp_data = df_app_train_align[['TARGET','DAYS_EMPLOYED']]
daysEmp_data.loc[:,'YEARS_EMPLOYED'] = daysEmp_data['DAYS_EMPLOYED']/365
daysEmp_data['YEARS_EMPLOYED'].hist()
We cut
up the dataset into 10 equal portions of 5 years each, and calculate the mean of the TARGET
in each bin.
daysEmp_data['YEARS_BINNED'] = pd.cut(daysEmp_data['YEARS_EMPLOYED'],bins=np.linspace(-50,0,num=11))
daysEmp_group = daysEmp_data.groupby('YEARS_BINNED').mean()
daysEmp_group
A barplot shows us that those who default most are usually have the lowest number of days/years in employment
sns.barplot(daysEmp_group.index,daysEmp_group['TARGET']*100)
plt.xticks(rotation=60)
plt.ylabel('% default')
plt.xlabel('Days Employed Groups (Years)')
We can see that tha above histogram with the bins is more informative than the KDE below
eda.plot_kde_hist_var(df_app_train_align,['DAYS_EMPLOYED'],drawAll=True)
Correlation heatmap of salient features¶
df_ext_src = df_app_train_align[['TARGET','EXT_SOURCE_3','EXT_SOURCE_2','EXT_SOURCE_1','DAYS_BIRTH','DAYS_EMPLOYED']]
df_ext_src_corr = df_ext_src.corr()
sns.heatmap(df_ext_src_corr,vmin=-1.0,vmax=1.0,annot=True)
We want to create a pairplot
and a pairgrid
to have a graphical analysis of the most important features of the dataset. As the original dataset is quite large, we take a sample of it such that we remove all the rows that have NaN
and then take a random sample of 5000 points.
We have a 6x6 grid in pairplot
as TARGET
is explicitly considered via the coloring scheme (i.e., hue). In this pairgrid TARGET
is denoted by the hue
. Orange is TARGET=1
(default), and Blue is TARGET=0
(no default).
The pairgrid can be explained as follows:
- Upper triangle: This is a scatter plot between the two variables in the X & Y axes, and has the
TARGET
variable as a different hue. - Diagonal: This is a kde plot of the distribution of each variable.
- Bottom triangle: This is a kde plot.
Looking at DAYS_BIRTH
and EXT_SOURCE_
, we can see that for TARGET=1
(i.e., orange) there is a high negative correlation. It is harder to make similar comments for the correlations between the other variables and that of the TARGET
.
df_ext_src_sample = df_ext_src.dropna().sample(5000)
grid = sns.PairGrid(data = df_ext_src_sample, diag_sharey=True,
hue = 'TARGET',
vars = [x for x in list(df_ext_src_sample.columns) if x != 'TARGET'])
grid.map_upper(plt.scatter, alpha = 0.2)
grid.map_diag(sns.kdeplot)
grid.map_lower(sns.kdeplot, cmap = plt.cm.OrRd_r);
Pickling data¶
We end the EDA at this point and pickle all the necessary dataframes for the next step which is feature engineering.
storeDir = fullDir+'/pickleshare'
db = pickleshare.PickleShareDB(storeDir)
db['df_app_test_align'] = df_app_test_align
db['df_app_train_align'] = df_app_train_align
db['df_app_train_corr_target'] = df_app_train_corr_target
Summary¶
Remember EDA is about the following:
- Understanding basic characteristics of your dataset
- Size of your dataset and number of features
- Number of datatypes in your feature set.
- Number of categorical data types.
- Is dataset unbalanced?
- Performing label-encoding or one-hot label encoding
- Analyzing correlations within your dataset
- Correlations between features and the TARGET
- Correlations between features themselves
- Analyzing missing/erroneous/outlier data.
- Analyzing population distributions (i.e., TARGET=1, TARGET=0) for each feature for differences
- Replacing erroneous/outlier with NaN.
Converting iPython notebook to Python code¶
This allows us to run the code in Spyder.
!jupyter nbconvert ml_kaggle-home-loan-credit-risk-eda.ipynb --to python
Comments
Comments powered by Disqus