Kaggle: Credit risk
An important topic in regulatory capital modelling in banking is the concept of credit risk. Credit risk is the loss to a bank's portfolio of loans when their customers start to default on their loans (i.e., not pay their loan repayments, or missing their repayments).
Typically, expected loss (i.e., credit risk) is given as follows,
$$ EL = PD \times LGD \times EAD $$where $EL$ is Expected Loss, $PD$ is Probability of Default, $LGD$ is Loss Given Default, and $EAD$ is Exposure at Default.
In the Basel Foundation-Internal Ratings Based approach (F-IRB) banks calculate their own $PD$ risk parameter, while the other risk parameters such as $LGD$ and $EAD$ are provided by the nation's banking supervisor (i.e., APRA in Australia, the FED/OCC in US, PRA in UK). The Basel Advanced-Internal Ratings Based approach (A-IRB) allows banks to calculate all of their own risk parameters subject to certain regulatory guidelines.
In the Kaggle
dataset, we are given information on customers of a bank and whether or not they have defaulted on their home loans. Thus, we this task at hand is modelling the probability of default (i.e., PD
). A related post is on capital modelling as applied to securitized financial products. Since PD
is a basic modelling requirement of the F-IRB and A-IRB approach, we provide an example of how to model PD
as below.
Import all required modules¶
# datascience
import pandas as pd
import numpy as np
import sklearn
from sklearn import preprocessing
# File management
import os
# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
import bokeh as bk
# Suppress warnings
import warnings
warnings.filterwarnings('ignore')
Read files into workspace¶
from pathlib import Path
home = str(Path.home())
inputDir = "/datasets/kaggle/home-credit-default-risk" # linux
#inputDir = "\datasets\kaggle\home-credit-default-risk" # windows
fullDir = home+inputDir
os.chdir(fullDir)
Read in the application training dataset as a dataframe, and set the index column to the SK_ID_CURR
.
df_app_train = pd.read_csv('application_train.csv',index_col=0)
df_app_test = pd.read_csv('application_test.csv',index_col=0)
Exploratory Data Analysis (EDA)¶
We will explore the characteristics of the dataset. Generally, this means the following
- How many rows and columns in your dataset?
- Is your dataset unbalanced?
- How many columns have missing data and the percentage of missing data?
- How many columns are floats, ints, categorical, strings?
Print the first five rows of the dataframe
df_app_train.head()
df_app_train.info()
We see that there are > 300,000 entries in the dataset. There are also 122 columns. Most of these columns are floats (i.e., 65) and integers (i.e., 41). The rest are strings as they are given by object (i.e., 16). The total amount of memory used by the dataframe is almost 300MB.
df_app_train.describe()
The above gives us descriptive statistics of each column that is a float or integer. We get a sense that the dataset is probably unbalalnced because the mean
of the Target
variable is 0.08, which means it is close to zero.
df_app_train['TARGET'].value_counts()
This is verified since we see that there are almost 300,000 accounts with no default and about 25,000 accounts with default.
Check for missing values¶
df_app_train.shape[0]
def missing_val_table(df,miss_val_thresh=50):
num_miss_val = df.isnull().sum()
pct_miss_val = num_miss_val/df.shape[0]*100
tab_miss_val = pd.concat([num_miss_val,pct_miss_val],axis=1)
tab_miss_val.columns = ['Missing Values','Percentage']
tab_miss_val = tab_miss_val[tab_miss_val['Missing Values']>0]
tab_miss_val['Percentage'] = tab_miss_val['Percentage'].round(1)
tab_miss_val.sort_values(['Percentage'],ascending=False,inplace=True)
numCol_miss_val = tab_miss_val.shape[0]
numCol_total = df.shape[1]
pct_col_miss_val = round((numCol_miss_val/numCol_total)*100)
numCol_Hi_MissVal = tab_miss_val[tab_miss_val['Percentage'] > miss_val_thresh].shape[0]
pct_col_gt_threshold = round(numCol_Hi_MissVal/numCol_miss_val*100)
print('\nYour dataframe has {0} columns, and {1} ({2}%) columns having missing values'
.format(numCol_total,numCol_miss_val,pct_col_miss_val))
print('\nOf these missing value columns, a total of {0} ({2}%) columns have greater than {1} pct of missing values'
.format(numCol_Hi_MissVal,miss_val_thresh,pct_col_gt_threshold))
tab_miss_val
return tab_miss_val
df_app_train_missing = missing_val_table(df_app_train)
df_app_train_missing.head(20)
About half of our feature columns have missing values, and about 30 of them have more than 50 percent of missing values.
df = df_app_train
Function to describe dataframe attributes
(numRow,numCol) = df.shape
# Basic information on dataframe
print('Your dataframe has {0} rows, and {1} columns'.format(numRow,numCol))
df_dtype = df.dtypes.value_counts()
# Is data set balanced?
pctTarget_true = (df['TARGET'].sum()/numRow*100).round(2)
if pctTarget_true > 70 or pctTarget_true < 30:
isBalanced=False
print('\nYour dataframe\'s target variable is True for {0}% and isBalanced: {1}'.format(pctTarget_true,isBalanced))
x = missing_val_table(df)
print(x.head(10))
print('\nThe columns are of the following types:\n{0}'.format(df_dtype))
print('\nThe number of unique entries in each object column is as follows:')
df_object = df.select_dtypes('object').nunique(dropna=False)
df_object
Encoding categorical variables¶
Since there are multiple object items that can be categorical variables, we need to encode them.
We use label encoding
for columns with 2 unique variables, and hot encoding
for variables with greater than 2 unique variables.
Label encoding¶
df_app_train.head()
df_app_train['FLAG_OWN_CAR'][:10]
We search for columns that are object
data type which have 2 unique variables and label encode them. For example, we know that 'FLAG_OWN_CAR' has 2 unique variables given by 'Y/N', thus label encoding makes it '1/0'.
le = preprocessing.LabelEncoder()
le_count = 0
for col in df_app_train:
if df_app_train[col].dtype=='object':
if df_app_train[col].nunique(dropna=False) <= 2:
print(col)
le_count += 1
le.fit(df_app_train[col])
df_app_train[col] = le.transform(df_app_train[col])
df_app_test[col] = le.transform(df_app_test[col])
print('{0} columns were label encoded'.format(le_count))
df_app_train.head()
df_app_train['FLAG_OWN_CAR'][:10]
One-hot encoding¶
One-hot encoding will increase the number of column variables in our datasets because each unique item in the categorical variable is given an additional column
df_app_train = pd.get_dummies(df_app_train)
df_app_test = pd.get_dummies(df_app_test)
Aligning training & testing data¶
print('Training dataset shape: '.format(0),df_app_train.shape)
print('Testing dataset shape: '.format(0),df_app_test.shape)
We can see that our training dataset has 242 columns and the testing dataset has 238 columns. This means the dataset is not aligned.
We can use the pd.align
command to do this with a full explanation given on Stack Overflow.
Remember, both the training and testing datasets should have the same number of features (i.e., $x$), however the total number of columns in the training dataset will include the TARGET
column which results in (i.e., $x+1$). Thus, we save the TARGET
column of our training data elsewhere, as when we perform the align
between the training and testing datasets, we will be performing an inner join (i.e., any columns not in either of the datasets will be discarded). We then add the TARGET
column back to the training dataset at a later stage.
train_labels = df_app_train['TARGET']
df_app_train, df_app_test = df_app_train.align(df_app_test,join='inner',axis=1)
Now when we evaluate the shapes of both the training and testing datasets, we see they have the same number of feature columns.
print('Training dataset shape: '.format(0),df_app_train.shape)
print('Testing dataset shape: '.format(0),df_app_test.shape)
df_app_train['TARGET'] = train_labels
print('Training dataset shape: '.format(0),df_app_train.shape)
print('Testing dataset shape: '.format(0),df_app_test.shape)
We can see that our dataset has no more strings or objects since we've confirmed them all to integers during the label and one-hot encoding process. We only have integers and floats.
df_app_train.dtypes.value_counts()
Erroneous data¶
We want to evaluate our floating point columns to see if the distributions make sense or if there are errors. For each floating point column, we perform the following
- Generating a histogram.
- Generating the descriptive statistics.
Age¶
The DAYS_BIRTH
column reports the days from the current loan application, which is incomprehensible in its current format. So we divide by -365 to get a better understanding.
df_app_train['DAYS_BIRTH'].describe()
We see that the average age of the applicant is 43 years old, and the minimum is 20 years old, and the maximum is 69 years old which make sense.
(df_app_train['DAYS_BIRTH']/-365).describe()
(df_app_train['DAYS_BIRTH']/365).hist()
plt.xlabel('years')
plt.ylabel('frequency')
Although the DAYS_BIRTH
variable looks fine, having the -ve values is a little strange, so we take the absolute value.
# df_app_train['DAYS_BIRTH'] = abs(df_app_train['DAYS_BIRTH']) # keep it consistent with tutorial
Employment history¶
For days employed, we see that the maximum definitely doesn't make sense as it is +1000 years! First, there shouldn't be any positive numbers since it is days before the application was submitted. Secondly, 1000 is obviously too long.
df_app_train['DAYS_EMPLOYED'].describe()
So we want to have a graphical representation of DAYS_EMPLOYED
so we can see if it is due to a few small outliers or many datapoints. What we find from the dataset is that all the datapoints have negative values except for approximately >50,000 points that are erroneous values.
We should analyze what the anomalous data looks like. We find there are precisely 55374 points that are anomalous, and the anomalous points have a lower default rate than the normal. Also, the number of anomalous points are about 18% which is quite high. This means we should probably do something about it.
anom_days_employed = df_app_train[df_app_train['DAYS_EMPLOYED']==365243]
norm_days_employed = df_app_train[df_app_train['DAYS_EMPLOYED']!=365243]
print(anom_days_employed.shape)
dr_anom = anom_days_employed['TARGET'].mean()*100
dr_norm = norm_days_employed['TARGET'].mean()*100
print('Default rate (Anomaly): {:.2f}'.format(dr_anom))
print('Default rate (Normal): {:.2f}'.format(dr_norm))
pct_anom_days_employed = (anom_days_employed.shape[0]/df_app_train.shape[0])*100
print(pct_anom_days_employed)
- Create an additional column that is TRUE for the anomalous data-points
- Replace the
DAYS_EMPLOYED
for the anomalous rows to a NaN - Produce the histogram of
DAYS_EMPLOYED
.
By setting the anomalous datapoints to have a NaN for days employed, we see that the days employed has a distribution that we would expect. However, we have still retained information about whether the anomalous points are and we can choose to replace tne NaNs with imputation or the median later.
df_app_train['DAYS_EMPLOYED_ANOM'] = df_app_train['DAYS_EMPLOYED'] == 365243
df_app_train['DAYS_EMPLOYED'].replace({365243:np.nan}, inplace=True)
# df_app_train['DAYS_EMPLOYED'] = abs(df_app_train['DAYS_EMPLOYED']) # commented out for consistency with tutorial
df_app_train['DAYS_EMPLOYED'].hist()
Any changes to the training dataset need to be changed in the testing dataset too. We can see that the test dataset also exhibits the same strange outliers (i.e., 9274)
df_app_test['DAYS_EMPLOYED'].hist()
print(df_app_test['DAYS_EMPLOYED'].describe())
print('\nTotal number of anomalous points: {0}\n'.format((df_app_test['DAYS_EMPLOYED'] == 365243).sum()))
We perform the same change on the test dataset as the training dataset and can see that the distribution of DAYS_EMPLOYED
is more logical
df_app_test['DAYS_EMPLOYED_ANOM'] = df_app_test['DAYS_EMPLOYED'] == 365243
df_app_test['DAYS_EMPLOYED'].replace({365243:np.nan},inplace=True)
df_app_test['DAYS_EMPLOYED'].hist()
#df_app_test['DAYS_EMPLOYED'] = abs(df_app_test['DAYS_EMPLOYED']) # commented out for consistency with tutorial
Car anomalies¶
It seems quite strange that people would drive a car that is 91 years old.
(df_app_train['OWN_CAR_AGE']).describe()
We display a histogram and find that about 3000 cars are above 40 years old, which is quite old.
df_app_train['OWN_CAR_AGE'].hist()
We see that those who have cars over 60 years old are more likely to default (i.e., 8.38% vs 7.20%). There are quite alot of applications (i.e., 3339) that have cars over 60 years old.
However, as a percentage of all the points in our training data, the anomalous points due to the age of the car is only around 1.1% so we can choose to ignore it. Also, using a correlation analysis we can see whether this feature is important or not.
anom = df_app_train[df_app_train['OWN_CAR_AGE']>=60]
norm = df_app_train[df_app_train['OWN_CAR_AGE']<60]
print(anom.shape)
print('Default rate (Car Old):{:2.2f}'.format(anom['TARGET'].mean()*100))
print('Default rate (Car Norm):{:2.2f}'.format(norm['TARGET'].mean()*100))
print('Pct of anomalies:{:2.2f}'.format((anom.shape[0]/df_app_train.shape[0])*100))
Correlation analysis¶
We now want to observe the correlations between our features and the target
df_app_train_corr = df_app_train.corr()
df_app_train_corr.head()
Once we have the correlation matrix, we focus on which features have the positive (negative) correlations with the target variable. We can see below that the TARGET
is positively correlated to REGION_RATING_**
, and negatively correlated with EXT_SOURCE_*
and the DAYS_BIRTH
and DAYS_EMPLOYED
variables. This makes sense as the probability of default is should be lower if the applicant is older or has been employed longer.
THe NAME_INCOME_TYPE_Working
, and the REGION_RATING_*
variables are categorical variables.
df_app_train_corr_target = df_app_train_corr['TARGET'].sort_values()
print('+ve corr: \n{0}'.format(df_app_train_corr_target.tail(20)))
print('-ve corr: \n{0}'.format(df_app_train_corr_target.head(20)))
Analyze features with greatest correlation magnitude¶
At this point, since we know that features like age, time in employment, ext_src1, etc. will impact the likelihood of default. We analyze the KDEs of the different feature distributions and compare between those that defaulted and did not default to see if we can ascertain any insightful information.
var_pos_corr = df_app_train_corr_target.head(10).index.values
var_neg_corr = df_app_train_corr_target[-2:-10:-1].index.values
print(var_pos_corr)
print(var_neg_corr)
We plot the KDEs of the most positively (negatively) correlated features with the TARGET
. This is to evaluate whether there are any strange distributions between the default
and do not default
items.
If the distributions for each feature are very different for default
and do not default
, this is good and we should look out for this.
So we can see that EXT_SOURCE_3
has the most different distributions between default and no default.
numVar = var_pos_corr.shape[0]
plt.figure(figsize=(10,40))
for i,var in enumerate(var_pos_corr):
dflt_var = df_app_train.loc[df_app_train['TARGET']==1,var]
dflt_non_var = df_app_train.loc[df_app_train['TARGET']==0,var]
plt.subplot(numVar,1,i+1)
sns.kdeplot(dflt_var,label='Default')
sns.kdeplot(dflt_non_var,label='No Default')
#plt.xlabel(var)
plt.ylabel('Density')
plt.title(var)
numVar = var_neg_corr.shape[0]
plt.figure(figsize=(10,40))
for i,var in enumerate(var_neg_corr):
dflt_var = df_app_train.loc[df_app_train['TARGET']==1,var]
dflt_non_var = df_app_train.loc[df_app_train['TARGET']==0,var]
plt.subplot(numVar,1,i+1)
sns.kdeplot(dflt_var,label='Default')
sns.kdeplot(dflt_non_var,label='No Default')
#plt.xlabel(var)
plt.ylabel('Density')
plt.title(var)
Analyze DAYS_EMPLOYED
¶
We take the number of DAYS_EMPLOYED
, and add an additional column to make it YEARS_EMPLOYED
. We then create a new column that allows us to bin each observation based on the quantile/qunintile it is in. Since there about 50 employable years, we create 10 bins. Now that each observation is in a bin, we can use a groupby
command to group each set of obserations.
daysEmp_data = df_app_train[['TARGET','DAYS_EMPLOYED']]
daysEmp_data.loc[:,'YEARS_EMPLOYED'] = daysEmp_data['DAYS_EMPLOYED']/365
daysEmp_data['YEARS_EMPLOYED'].hist()
daysEmp_data['YEARS_BINNED'] = pd.cut(daysEmp_data['YEARS_EMPLOYED'],bins=np.linspace(0,50,num=11))
daysEmp_data.head(10)
daysEmp_data['YEARS_BINNED'].unique()
Since we do the group by, we can see that the less the amount of time you've been employed, you're more likely to default.
daysEmp_group = daysEmp_data.groupby('YEARS_BINNED').mean()
daysEmp_group
sns.barplot(daysEmp_group.index,daysEmp_group['TARGET']*100)
plt.xticks(rotation=60)
plt.ylabel('% default')
plt.xlabel('Days Employed Groups (Years)')
dflt_daysEmp = df_app_train.loc[df_app_train['TARGET']==1,'DAYS_EMPLOYED']
dflt_non_daysEmp = df_app_train.loc[df_app_train['TARGET']==0,'DAYS_EMPLOYED']
sns.kdeplot(dflt_daysEmp/365,label='Defaulted (Target==1)')
sns.kdeplot(dflt_non_daysEmp/365,label='Not Defaulted (Target==0)')
plt.xlabel('Time in employment (years)')
plt.ylabel('Density')
plt.title('Employment distribution for default & non-default')
Analyzing credit scores¶
We saw that external sources had the highest correlations with TARGET
, followed by DAYS_BIRTH
and DAYS_EMPLOYED
. So we want to take a closer look at these features and their interplay with TARGET
.
Feature Name | Corr. with TARGET |
---|---|
EXT_SOURCE_3 | -0.178919 | EXT_SOURCE_2 | -0.160472 | EXT_SOURCE_1 | -0.155317 | DAYS_BIRTH | -0.078239 | DAYS_EMPLOYED | -0.074958 |
df_ext_src = df_app_train[['TARGET','EXT_SOURCE_3','EXT_SOURCE_2','EXT_SOURCE_1','DAYS_BIRTH']] # 'DAYS_EMPLOYED'
df_ext_src_corr = df_ext_src.corr()
sns.heatmap(df_ext_src_corr,vmin=-1.0,vmax=1.0,annot=True)
Additional graphical analysis for major features¶
We want to create a pairplot
and a pairgrid
to have a graphical analysis of the most important features of the dataset. As the original dataset is quite large, we take a sample of it such that we remove all the rows that have NaN
and then take a random sample of 5000 points.
We have a 6x6 grid inpairplot
as TARGET
is explicitly considered.
df_ext_src.shape
df_ext_src_sample = df_ext_src.dropna().sample(5000)
sns.pairplot(df_ext_src_sample)
We use pairgrid
to create a more informative plot. In this pairgrid TARGET
is denoted by the hue
. Orange is TARGET==1
(default), and Blue is TARGET==0
(no default).
The pairgrid can be explained as follows:
- Upper triangle: This is a scatter plot between the two variables in the X & Y axes, and has the
TARGET
variable as a different hue. - Diagonal: This is a kde plot of the distribution of each variable
- Bottom triangle: This is a kde plot
grid = sns.PairGrid(data = df_ext_src_sample, diag_sharey=True,
hue = 'TARGET',
vars = [x for x in list(df_ext_src_sample.columns) if x != 'TARGET'])
# Upper is a scatter plot
grid.map_upper(plt.scatter, alpha = 0.2)
# Diagonal is a histogram
grid.map_diag(sns.kdeplot)
# Bottom is density plot
grid.map_lower(sns.kdeplot, cmap = plt.cm.OrRd_r);
Feature Engineering¶
Based on our empirical analysis of just analyzing correlations between our target variable and the feature variables, we can perform feature engineering. Typically feature engineering means we perform operations such as:
- Polynomial features: THis includes all interactions and powers of each feature variable
- Expert knowledge features:
print(var_pos_corr[0:10])
imp_var = var_pos_corr[0:4]
print(imp_var)
poly_features_train = df_app_train[imp_var]
poly_features_test = df_app_test[imp_var]
poly_target_train = df_app_train['TARGET']
poly_features_train.columns
Imputing NaN points¶
imputer = preprocessing.Imputer(strategy='median')
poly_features_train = imputer.fit_transform(poly_features_train) # fitting means we find the median then apply
poly_features_test = imputer.transform(poly_features_test) # we only transform, as we want to use the fit used on the training dataset
Creating polynomial features¶
poly_transformer = preprocessing.PolynomialFeatures(degree=3)
poly_transformer.fit(poly_features_train)
poly_features_train = poly_transformer.transform(poly_features_train)
poly_features_test = poly_transformer.transform(poly_features_test)
print('Polynomial features: {}'.format(poly_features_train.shape))
poly_transformer.get_feature_names()[:15]
poly_transformer.get_feature_names(input_features=imp_var)
Since we now have a bigger dataset with additional artificially created features, let's evaluate if these new features have higher correlation than the original set of features. poly_features_train
is an array, so we need to create a DataFrame out of it, and add the Target
column to it.
df_poly_features_train = pd.DataFrame(poly_features_train, columns = poly_transformer.get_feature_names(input_features=imp_var))
df_poly_features_train['TARGET'] = poly_target_train
poly_corrs = df_poly_features_train.corr()['TARGET'].sort_values()
print('+ve correlations:\n{}'.format(poly_corrs.tail(20)))
print('-ve correlations:\n{}'.format(poly_corrs.head(20)))
Now that we have a new df_poly_features_train
that includes all the existing features, and polynomial features, we need to add in the SK_ID_CURR
column too
df_app_train.index.values
df_poly_features_train['SK_ID_CURR'] = df_app_train['SK_ID_CURR'] # Now that we have
#df_app_poly_train = df_app_train.merge(df_poly_features_train, on = 'SK_ID_CURR', how='left')
Add new features to the test
dataset¶
df_poly_features_test = pf.DataFrame(poly_features_test, columns = poly_transformer.get_feature_names(input_features=imp_var))
df_poly_features_test['SK_ID_CURR'] = df_app_test['SK_ID_CURR']
df_app_poly_test = df_app_test.merge(df_poly_features_test, on='SK_ID_CURR', how='left')
Align the train
and test
datasets¶
df_app_poly_train, df_app_poly_test = app_train_poly_train.align(df_app_poly_test, join='inner',axis=1)
haha
hehe
!jupyter nbconvert --to script ml_kaggle_home-loan-credit-risk.ipynb
Comments
Comments powered by Disqus