An important topic in regulatory capital modelling in banking is the concept of credit risk. Credit risk is the loss to a bank's portfolio of loans when their customers start to default on their loans (i.e., not pay their loan repayments, or missing their repayments).

Typically, expected loss (i.e., credit risk) is given as follows,

$$ EL = PD \times LGD \times EAD $$

where $EL$ is Expected Loss, $PD$ is Probability of Default, $LGD$ is Loss Given Default, and $EAD$ is Exposure at Default.

In the Basel Foundation-Internal Ratings Based approach (F-IRB) banks calculate their own $PD$ risk parameter, while the other risk parameters such as $LGD$ and $EAD$ are provided by the nation's banking supervisor (i.e., APRA in Australia, the FED/OCC in US, PRA in UK). The Basel Advanced-Internal Ratings Based approach (A-IRB) allows banks to calculate all of their own risk parameters subject to certain regulatory guidelines.

In the Kaggle dataset, we are given information on customers of a bank and whether or not they have defaulted on their home loans. Thus, we this task at hand is modelling the probability of default (i.e., PD). A related post is on capital modelling as applied to securitized financial products. Since PD is a basic modelling requirement of the F-IRB and A-IRB approach, we provide an example of how to model PD as below.

Import all required modules¶

In [1]:

# datascience
import pandas as pd
import numpy as np
import sklearn
from sklearn import preprocessing

# File management
import os

# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
import bokeh as bk

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

Read files into workspace¶

In [2]:

from pathlib import Path
home = str(Path.home())
inputDir = "/datasets/kaggle/home-credit-default-risk" # linux
#inputDir = "\datasets\kaggle\home-credit-default-risk" # windows
fullDir = home+inputDir
os.chdir(fullDir)

Read in the application training dataset as a dataframe, and set the index column to the SK_ID_CURR.

In [4]:

df_app_train = pd.read_csv('application_train.csv',index_col=0)
df_app_test = pd.read_csv('application_test.csv',index_col=0)

Exploratory Data Analysis (EDA)¶

We will explore the characteristics of the dataset. Generally, this means the following

How many rows and columns in your dataset?
Is your dataset unbalanced?
How many columns have missing data and the percentage of missing data?
How many columns are floats, ints, categorical, strings?

Print the first five rows of the dataframe

In [5]:

df_app_train.head()

Out[5]:

	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE	...	FLAG_DOCUMENT_18	FLAG_DOCUMENT_19	FLAG_DOCUMENT_20	FLAG_DOCUMENT_21	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR
100002	1	Cash loans	M	N	Y	0	202500.0	406597.5	24700.5	351000.0	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	1.0
100003	0	Cash loans	F	N	N	0	270000.0	1293502.5	35698.5	1129500.0	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
100004	0	Revolving loans	M	Y	Y	0	67500.0	135000.0	6750.0	135000.0	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
100006	0	Cash loans	F	N	Y	0	135000.0	312682.5	29686.5	297000.0	...	0	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN
100007	0	Cash loans	M	N	Y	0	121500.0	513000.0	21865.5	513000.0	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 121 columns

In [6]:

df_app_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 307511 entries, 100002 to 456255
Columns: 121 entries, TARGET to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 286.2+ MB

We see that there are > 300,000 entries in the dataset. There are also 122 columns. Most of these columns are floats (i.e., 65) and integers (i.e., 41). The rest are strings as they are given by object (i.e., 16). The total amount of memory used by the dataframe is almost 300MB.

In [7]:

df_app_train.describe()

Out[7]:

	TARGET	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE	REGION_POPULATION_RELATIVE	DAYS_BIRTH	DAYS_EMPLOYED	DAYS_REGISTRATION	...	FLAG_DOCUMENT_18	FLAG_DOCUMENT_19	FLAG_DOCUMENT_20	FLAG_DOCUMENT_21	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR
count	307511.000000	307511.000000	3.075110e+05	3.075110e+05	307499.000000	3.072330e+05	307511.000000	307511.000000	307511.000000	307511.000000	...	307511.000000	307511.000000	307511.000000	307511.000000	265992.000000	265992.000000	265992.000000	265992.000000	265992.000000	265992.000000
mean	0.080729	0.417052	1.687979e+05	5.990260e+05	27108.573909	5.383962e+05	0.020868	-16036.995067	63815.045904	-4986.120328	...	0.008130	0.000595	0.000507	0.000335	0.006402	0.007000	0.034362	0.267395	0.265474	1.899974
std	0.272419	0.722121	2.371231e+05	4.024908e+05	14493.737315	3.694465e+05	0.013831	4363.988632	141275.766519	3522.886321	...	0.089798	0.024387	0.022518	0.018299	0.083849	0.110757	0.204685	0.916002	0.794056	1.869295
min	0.000000	0.000000	2.565000e+04	4.500000e+04	1615.500000	4.050000e+04	0.000290	-25229.000000	-17912.000000	-24672.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	1.125000e+05	2.700000e+05	16524.000000	2.385000e+05	0.010006	-19682.000000	-2760.000000	-7479.500000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.000000	0.000000	1.471500e+05	5.135310e+05	24903.000000	4.500000e+05	0.018850	-15750.000000	-1213.000000	-4504.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
75%	0.000000	1.000000	2.025000e+05	8.086500e+05	34596.000000	6.795000e+05	0.028663	-12413.000000	-289.000000	-2010.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	3.000000
max	1.000000	19.000000	1.170000e+08	4.050000e+06	258025.500000	4.050000e+06	0.072508	-7489.000000	365243.000000	0.000000	...	1.000000	1.000000	1.000000	1.000000	4.000000	9.000000	8.000000	27.000000	261.000000	25.000000

8 rows × 105 columns

The above gives us descriptive statistics of each column that is a float or integer. We get a sense that the dataset is probably unbalalnced because the mean of the Target variable is 0.08, which means it is close to zero.

In [8]:

df_app_train['TARGET'].value_counts()

Out[8]:

0    282686
1     24825
Name: TARGET, dtype: int64

This is verified since we see that there are almost 300,000 accounts with no default and about 25,000 accounts with default.

Check for missing values¶

In [9]:

df_app_train.shape[0]

Out[9]:

In [10]:

def missing_val_table(df,miss_val_thresh=50):
    num_miss_val = df.isnull().sum()
    pct_miss_val = num_miss_val/df.shape[0]*100

    tab_miss_val = pd.concat([num_miss_val,pct_miss_val],axis=1) 
    tab_miss_val.columns = ['Missing Values','Percentage']
    tab_miss_val  = tab_miss_val[tab_miss_val['Missing Values']>0] 
    tab_miss_val['Percentage'] = tab_miss_val['Percentage'].round(1)
    tab_miss_val.sort_values(['Percentage'],ascending=False,inplace=True)
    
    numCol_miss_val = tab_miss_val.shape[0]
    numCol_total = df.shape[1]
    pct_col_miss_val = round((numCol_miss_val/numCol_total)*100)
    
    numCol_Hi_MissVal = tab_miss_val[tab_miss_val['Percentage'] > miss_val_thresh].shape[0]
    pct_col_gt_threshold = round(numCol_Hi_MissVal/numCol_miss_val*100)
    
    print('\nYour dataframe has {0} columns, and {1} ({2}%) columns having missing values'
      .format(numCol_total,numCol_miss_val,pct_col_miss_val))
    print('\nOf these missing value columns, a total of {0} ({2}%) columns  have greater than {1} pct of missing values'
          .format(numCol_Hi_MissVal,miss_val_thresh,pct_col_gt_threshold))
    
    tab_miss_val
    
    return tab_miss_val

In [11]:

df_app_train_missing = missing_val_table(df_app_train)
df_app_train_missing.head(20)

Your dataframe has 121 columns, and 67 (55%) columns having missing values

Of these missing value columns, a total of 41 (61%) columns  have greater than 50 pct of missing values

Out[11]:

	Missing Values	Percentage
COMMONAREA_MEDI	214865	69.9
COMMONAREA_AVG	214865	69.9
COMMONAREA_MODE	214865	69.9
NONLIVINGAPARTMENTS_MEDI	213514	69.4
NONLIVINGAPARTMENTS_MODE	213514	69.4
NONLIVINGAPARTMENTS_AVG	213514	69.4
LIVINGAPARTMENTS_MODE	210199	68.4
LIVINGAPARTMENTS_MEDI	210199	68.4
LIVINGAPARTMENTS_AVG	210199	68.4
FONDKAPREMONT_MODE	210295	68.4
FLOORSMIN_MODE	208642	67.8
FLOORSMIN_MEDI	208642	67.8
FLOORSMIN_AVG	208642	67.8
YEARS_BUILD_MODE	204488	66.5
YEARS_BUILD_MEDI	204488	66.5
YEARS_BUILD_AVG	204488	66.5
OWN_CAR_AGE	202929	66.0
LANDAREA_AVG	182590	59.4
LANDAREA_MEDI	182590	59.4
LANDAREA_MODE	182590	59.4

About half of our feature columns have missing values, and about 30 of them have more than 50 percent of missing values.

In [12]:

df = df_app_train

Function to describe dataframe attributes

In [12]:

(numRow,numCol) = df.shape

# Basic information on dataframe
print('Your dataframe has {0} rows, and {1} columns'.format(numRow,numCol))

df_dtype = df.dtypes.value_counts()

# Is data set balanced?
pctTarget_true = (df['TARGET'].sum()/numRow*100).round(2)
if pctTarget_true > 70 or pctTarget_true < 30:
    isBalanced=False

print('\nYour dataframe\'s target variable is True for {0}% and isBalanced: {1}'.format(pctTarget_true,isBalanced))

x = missing_val_table(df)
print(x.head(10))

print('\nThe columns are of the following types:\n{0}'.format(df_dtype))

print('\nThe number of unique entries in each object column is as follows:')
df_object = df.select_dtypes('object').nunique(dropna=False)
df_object

Your dataframe has 307511 rows, and 121 columns

Your dataframe's target variable is True for 8.07% and isBalanced: False

Your dataframe has 121 columns, and 67 (55%) columns having missing values

Of these missing value columns, a total of 41 (61%) columns  have greater than 50 pct of missing values
                          Missing Values  Percentage
COMMONAREA_MEDI                   214865        69.9
COMMONAREA_AVG                    214865        69.9
COMMONAREA_MODE                   214865        69.9
NONLIVINGAPARTMENTS_MEDI          213514        69.4
NONLIVINGAPARTMENTS_MODE          213514        69.4
NONLIVINGAPARTMENTS_AVG           213514        69.4
LIVINGAPARTMENTS_MODE             210199        68.4
LIVINGAPARTMENTS_MEDI             210199        68.4
LIVINGAPARTMENTS_AVG              210199        68.4
FONDKAPREMONT_MODE                210295        68.4

The columns are of the following types:
float64    65
int64      40
object     16
dtype: int64

The number of unique entries in each object column is as follows:

Out[12]:

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                8
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               19
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             5
HOUSETYPE_MODE                 4
WALLSMATERIAL_MODE             8
EMERGENCYSTATE_MODE            3
dtype: int64

Encoding categorical variables¶

Since there are multiple object items that can be categorical variables, we need to encode them.

We use label encoding for columns with 2 unique variables, and hot encoding for variables with greater than 2 unique variables.

Label encoding¶

In [13]:

df_app_train.head()
df_app_train['FLAG_OWN_CAR'][:10]

Out[13]:

SK_ID_CURR
100002    N
100003    N
100004    Y
100006    N
100007    N
100008    N
100009    Y
100010    Y
100011    N
100012    N
Name: FLAG_OWN_CAR, dtype: object

We search for columns that are object data type which have 2 unique variables and label encode them. For example, we know that 'FLAG_OWN_CAR' has 2 unique variables given by 'Y/N', thus label encoding makes it '1/0'.

In [14]:

le = preprocessing.LabelEncoder()
le_count = 0

for col in df_app_train:
    if df_app_train[col].dtype=='object':
        if df_app_train[col].nunique(dropna=False) <= 2:
            print(col)
            le_count += 1
            le.fit(df_app_train[col])
            df_app_train[col] = le.transform(df_app_train[col])
            df_app_test[col] = le.transform(df_app_test[col])
            
print('{0} columns were label encoded'.format(le_count))

NAME_CONTRACT_TYPE
FLAG_OWN_CAR
FLAG_OWN_REALTY
3 columns were label encoded

In [15]:

df_app_train.head()
df_app_train['FLAG_OWN_CAR'][:10]

Out[15]:

SK_ID_CURR
100002    0
100003    0
100004    1
100006    0
100007    0
100008    0
100009    1
100010    1
100011    0
100012    0
Name: FLAG_OWN_CAR, dtype: int64

One-hot encoding¶

One-hot encoding will increase the number of column variables in our datasets because each unique item in the categorical variable is given an additional column

In [16]:

df_app_train = pd.get_dummies(df_app_train)
df_app_test = pd.get_dummies(df_app_test)

Aligning training & testing data¶

In [17]:

print('Training dataset shape: '.format(0),df_app_train.shape)
print('Testing dataset shape: '.format(0),df_app_test.shape)

Training dataset shape:  (307511, 242)
Testing dataset shape:  (48744, 238)

We can see that our training dataset has 242 columns and the testing dataset has 238 columns. This means the dataset is not aligned.

We can use the pd.align command to do this with a full explanation given on Stack Overflow.

Remember, both the training and testing datasets should have the same number of features (i.e., $x$), however the total number of columns in the training dataset will include the TARGET column which results in (i.e., $x+1$). Thus, we save the TARGET column of our training data elsewhere, as when we perform the align between the training and testing datasets, we will be performing an inner join (i.e., any columns not in either of the datasets will be discarded). We then add the TARGET column back to the training dataset at a later stage.

In [18]:

train_labels = df_app_train['TARGET']

df_app_train, df_app_test = df_app_train.align(df_app_test,join='inner',axis=1)

Now when we evaluate the shapes of both the training and testing datasets, we see they have the same number of feature columns.

In [19]:

print('Training dataset shape: '.format(0),df_app_train.shape)
print('Testing dataset shape: '.format(0),df_app_test.shape)

Training dataset shape:  (307511, 238)
Testing dataset shape:  (48744, 238)

In [20]:

df_app_train['TARGET'] = train_labels

print('Training dataset shape: '.format(0),df_app_train.shape)
print('Testing dataset shape: '.format(0),df_app_test.shape)

Training dataset shape:  (307511, 239)
Testing dataset shape:  (48744, 238)

We can see that our dataset has no more strings or objects since we've confirmed them all to integers during the label and one-hot encoding process. We only have integers and floats.

In [21]:

df_app_train.dtypes.value_counts()

Out[21]:

uint8      131
float64     65
int64       43
dtype: int64

Erroneous data¶

We want to evaluate our floating point columns to see if the distributions make sense or if there are errors. For each floating point column, we perform the following

Generating a histogram.
Generating the descriptive statistics.

Age¶

The DAYS_BIRTH column reports the days from the current loan application, which is incomprehensible in its current format. So we divide by -365 to get a better understanding.

In [22]:

df_app_train['DAYS_BIRTH'].describe()

Out[22]:

count    307511.000000
mean     -16036.995067
std        4363.988632
min      -25229.000000
25%      -19682.000000
50%      -15750.000000
75%      -12413.000000
max       -7489.000000
Name: DAYS_BIRTH, dtype: float64

We see that the average age of the applicant is 43 years old, and the minimum is 20 years old, and the maximum is 69 years old which make sense.

In [23]:

(df_app_train['DAYS_BIRTH']/-365).describe()

Out[23]:

count    307511.000000
mean         43.936973
std          11.956133
min          20.517808
25%          34.008219
50%          43.150685
75%          53.923288
max          69.120548
Name: DAYS_BIRTH, dtype: float64

In [24]:

(df_app_train['DAYS_BIRTH']/365).hist()
plt.xlabel('years')
plt.ylabel('frequency')

Out[24]:

Text(0,0.5,'frequency')

Although the DAYS_BIRTH variable looks fine, having the -ve values is a little strange, so we take the absolute value.

In [25]:

# df_app_train['DAYS_BIRTH'] = abs(df_app_train['DAYS_BIRTH']) # keep it consistent with tutorial

Employment history¶

For days employed, we see that the maximum definitely doesn't make sense as it is +1000 years! First, there shouldn't be any positive numbers since it is days before the application was submitted. Secondly, 1000 is obviously too long.

In [26]:

df_app_train['DAYS_EMPLOYED'].describe()

Out[26]:

count    307511.000000
mean      63815.045904
std      141275.766519
min      -17912.000000
25%       -2760.000000
50%       -1213.000000
75%        -289.000000
max      365243.000000
Name: DAYS_EMPLOYED, dtype: float64

So we want to have a graphical representation of DAYS_EMPLOYED so we can see if it is due to a few small outliers or many datapoints. What we find from the dataset is that all the datapoints have negative values except for approximately >50,000 points that are erroneous values.

We should analyze what the anomalous data looks like. We find there are precisely 55374 points that are anomalous, and the anomalous points have a lower default rate than the normal. Also, the number of anomalous points are about 18% which is quite high. This means we should probably do something about it.

In [27]:

anom_days_employed = df_app_train[df_app_train['DAYS_EMPLOYED']==365243]
norm_days_employed = df_app_train[df_app_train['DAYS_EMPLOYED']!=365243]
print(anom_days_employed.shape)

dr_anom = anom_days_employed['TARGET'].mean()*100
dr_norm = norm_days_employed['TARGET'].mean()*100

print('Default rate (Anomaly): {:.2f}'.format(dr_anom))
print('Default rate (Normal): {:.2f}'.format(dr_norm))

pct_anom_days_employed = (anom_days_employed.shape[0]/df_app_train.shape[0])*100
print(pct_anom_days_employed)

(55374, 239)
Default rate (Anomaly): 5.40
Default rate (Normal): 8.66
18.00716071945394

Create an additional column that is TRUE for the anomalous data-points
Replace the DAYS_EMPLOYED for the anomalous rows to a NaN
Produce the histogram of DAYS_EMPLOYED.

By setting the anomalous datapoints to have a NaN for days employed, we see that the days employed has a distribution that we would expect. However, we have still retained information about whether the anomalous points are and we can choose to replace tne NaNs with imputation or the median later.

In [28]:

df_app_train['DAYS_EMPLOYED_ANOM'] = df_app_train['DAYS_EMPLOYED'] == 365243
df_app_train['DAYS_EMPLOYED'].replace({365243:np.nan}, inplace=True)
# df_app_train['DAYS_EMPLOYED'] = abs(df_app_train['DAYS_EMPLOYED']) # commented out for consistency with tutorial
df_app_train['DAYS_EMPLOYED'].hist()

Out[28]:

<matplotlib.axes._subplots.AxesSubplot at 0x7db540b912e8>

Any changes to the training dataset need to be changed in the testing dataset too. We can see that the test dataset also exhibits the same strange outliers (i.e., 9274)

In [29]:

df_app_test['DAYS_EMPLOYED'].hist()
print(df_app_test['DAYS_EMPLOYED'].describe())
print('\nTotal number of anomalous points: {0}\n'.format((df_app_test['DAYS_EMPLOYED'] == 365243).sum()))

count     48744.000000
mean      67485.366322
std      144348.507136
min      -17463.000000
25%       -2910.000000
50%       -1293.000000
75%        -296.000000
max      365243.000000
Name: DAYS_EMPLOYED, dtype: float64

Total number of anomalous points: 9274

We perform the same change on the test dataset as the training dataset and can see that the distribution of DAYS_EMPLOYED is more logical

In [30]:

df_app_test['DAYS_EMPLOYED_ANOM'] = df_app_test['DAYS_EMPLOYED'] == 365243
df_app_test['DAYS_EMPLOYED'].replace({365243:np.nan},inplace=True)
df_app_test['DAYS_EMPLOYED'].hist()
#df_app_test['DAYS_EMPLOYED'] = abs(df_app_test['DAYS_EMPLOYED']) # commented out for consistency with tutorial

Out[30]:

<matplotlib.axes._subplots.AxesSubplot at 0x7db540b06cf8>

Car anomalies¶

It seems quite strange that people would drive a car that is 91 years old.

In [31]:

(df_app_train['OWN_CAR_AGE']).describe()

Out[31]:

count    104582.000000
mean         12.061091
std          11.944812
min           0.000000
25%           5.000000
50%           9.000000
75%          15.000000
max          91.000000
Name: OWN_CAR_AGE, dtype: float64

We display a histogram and find that about 3000 cars are above 40 years old, which is quite old.

In [32]:

df_app_train['OWN_CAR_AGE'].hist()

Out[32]:

<matplotlib.axes._subplots.AxesSubplot at 0x7db521d8d0f0>

We see that those who have cars over 60 years old are more likely to default (i.e., 8.38% vs 7.20%). There are quite alot of applications (i.e., 3339) that have cars over 60 years old.

However, as a percentage of all the points in our training data, the anomalous points due to the age of the car is only around 1.1% so we can choose to ignore it. Also, using a correlation analysis we can see whether this feature is important or not.

In [33]:

anom = df_app_train[df_app_train['OWN_CAR_AGE']>=60]
norm = df_app_train[df_app_train['OWN_CAR_AGE']<60]
print(anom.shape)
print('Default rate (Car Old):{:2.2f}'.format(anom['TARGET'].mean()*100))
print('Default rate (Car Norm):{:2.2f}'.format(norm['TARGET'].mean()*100))
print('Pct of anomalies:{:2.2f}'.format((anom.shape[0]/df_app_train.shape[0])*100))

(3339, 240)
Default rate (Car Old):8.39
Default rate (Car Norm):7.21
Pct of anomalies:1.09

Correlation analysis¶

We now want to observe the correlations between our features and the target

In [34]:

df_app_train_corr = df_app_train.corr()
df_app_train_corr.head()

Out[34]:

	NAME_CONTRACT_TYPE	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE	REGION_POPULATION_RELATIVE	DAYS_BIRTH	...	WALLSMATERIAL_MODE_Mixed	WALLSMATERIAL_MODE_Monolithic	WALLSMATERIAL_MODE_Others	WALLSMATERIAL_MODE_Panel	WALLSMATERIAL_MODE_Stone, brick	WALLSMATERIAL_MODE_Wooden	EMERGENCYSTATE_MODE_No	EMERGENCYSTATE_MODE_Yes	TARGET	DAYS_EMPLOYED_ANOM
NAME_CONTRACT_TYPE	1.000000	0.004022	0.067177	0.029998	-0.003531	-0.221648	-0.241544	-0.185850	0.026867	0.086364	...	0.003783	0.010609	0.001571	0.011120	0.015098	0.000040	0.027372	-0.000339	-0.030896	-0.054967
FLAG_OWN_CAR	0.004022	1.000000	-0.002817	0.102023	0.083383	0.116225	0.141588	0.120310	0.041314	0.129879	...	0.000727	0.011128	-0.002715	-0.009758	-0.025584	-0.009212	-0.034754	-0.007584	-0.021851	-0.154778
FLAG_OWN_REALTY	0.067177	-0.002817	1.000000	-0.002366	0.002934	-0.039270	-0.005225	-0.045537	0.015175	-0.119146	...	0.002567	0.000833	0.000757	0.014874	0.017445	-0.014577	0.022013	-0.008535	-0.006148	0.070107
CNT_CHILDREN	0.029998	0.102023	-0.002366	1.000000	0.012882	0.002145	0.021374	-0.001827	-0.025573	0.330938	...	-0.000709	0.001607	-0.002032	-0.020892	-0.025088	0.011036	-0.038644	0.004525	0.019187	-0.240722
AMT_INCOME_TOTAL	-0.003531	0.083383	0.002934	0.012882	1.000000	0.156870	0.191657	0.159610	0.074796	0.027261	...	0.006149	0.023886	0.003886	0.032753	0.016523	-0.003369	0.050174	-0.002894	-0.003982	-0.064038

5 rows × 240 columns

Once we have the correlation matrix, we focus on which features have the positive (negative) correlations with the target variable. We can see below that the TARGET is positively correlated to REGION_RATING_**, and negatively correlated with EXT_SOURCE_* and the DAYS_BIRTH and DAYS_EMPLOYED variables. This makes sense as the probability of default is should be lower if the applicant is older or has been employed longer.

THe NAME_INCOME_TYPE_Working, and the REGION_RATING_* variables are categorical variables.

In [35]:

df_app_train_corr_target = df_app_train_corr['TARGET'].sort_values()
print('+ve corr: \n{0}'.format(df_app_train_corr_target.tail(20)))
print('-ve corr: \n{0}'.format(df_app_train_corr_target.head(20)))

+ve corr: 
DEF_60_CNT_SOCIAL_CIRCLE                             0.031276
DEF_30_CNT_SOCIAL_CIRCLE                             0.032248
LIVE_CITY_NOT_WORK_CITY                              0.032518
OWN_CAR_AGE                                          0.037612
DAYS_REGISTRATION                                    0.041975
OCCUPATION_TYPE_Laborers                             0.043019
FLAG_DOCUMENT_3                                      0.044346
REG_CITY_NOT_LIVE_CITY                               0.044395
FLAG_EMP_PHONE                                       0.045982
NAME_EDUCATION_TYPE_Secondary / secondary special    0.049824
REG_CITY_NOT_WORK_CITY                               0.050994
DAYS_ID_PUBLISH                                      0.051457
CODE_GENDER_M                                        0.054713
DAYS_LAST_PHONE_CHANGE                               0.055218
NAME_INCOME_TYPE_Working                             0.057481
REGION_RATING_CLIENT                                 0.058899
REGION_RATING_CLIENT_W_CITY                          0.060893
DAYS_EMPLOYED                                        0.074958
DAYS_BIRTH                                           0.078239
TARGET                                               1.000000
Name: TARGET, dtype: float64
-ve corr: 
EXT_SOURCE_3                           -0.178919
EXT_SOURCE_2                           -0.160472
EXT_SOURCE_1                           -0.155317
NAME_EDUCATION_TYPE_Higher education   -0.056593
CODE_GENDER_F                          -0.054704
NAME_INCOME_TYPE_Pensioner             -0.046209
DAYS_EMPLOYED_ANOM                     -0.045987
ORGANIZATION_TYPE_XNA                  -0.045987
FLOORSMAX_AVG                          -0.044003
FLOORSMAX_MEDI                         -0.043768
FLOORSMAX_MODE                         -0.043226
EMERGENCYSTATE_MODE_No                 -0.042201
HOUSETYPE_MODE_block of flats          -0.040594
AMT_GOODS_PRICE                        -0.039645
REGION_POPULATION_RELATIVE             -0.037227
ELEVATORS_AVG                          -0.034199
ELEVATORS_MEDI                         -0.033863
FLOORSMIN_AVG                          -0.033614
FLOORSMIN_MEDI                         -0.033394
WALLSMATERIAL_MODE_Panel               -0.033119
Name: TARGET, dtype: float64

Analyze features with greatest correlation magnitude¶

At this point, since we know that features like age, time in employment, ext_src1, etc. will impact the likelihood of default. We analyze the KDEs of the different feature distributions and compare between those that defaulted and did not default to see if we can ascertain any insightful information.

In [36]:

var_pos_corr = df_app_train_corr_target.head(10).index.values
var_neg_corr = df_app_train_corr_target[-2:-10:-1].index.values

print(var_pos_corr)
print(var_neg_corr)

['EXT_SOURCE_3' 'EXT_SOURCE_2' 'EXT_SOURCE_1'
 'NAME_EDUCATION_TYPE_Higher education' 'CODE_GENDER_F'
 'NAME_INCOME_TYPE_Pensioner' 'DAYS_EMPLOYED_ANOM' 'ORGANIZATION_TYPE_XNA'
 'FLOORSMAX_AVG' 'FLOORSMAX_MEDI']
['DAYS_BIRTH' 'DAYS_EMPLOYED' 'REGION_RATING_CLIENT_W_CITY'
 'REGION_RATING_CLIENT' 'NAME_INCOME_TYPE_Working' 'DAYS_LAST_PHONE_CHANGE'
 'CODE_GENDER_M' 'DAYS_ID_PUBLISH']

We plot the KDEs of the most positively (negatively) correlated features with the TARGET. This is to evaluate whether there are any strange distributions between the default and do not default items.

If the distributions for each feature are very different for default and do not default, this is good and we should look out for this.

So we can see that EXT_SOURCE_3 has the most different distributions between default and no default.

In [37]:

numVar = var_pos_corr.shape[0]

plt.figure(figsize=(10,40))
for i,var in enumerate(var_pos_corr):    
    dflt_var = df_app_train.loc[df_app_train['TARGET']==1,var]
    dflt_non_var = df_app_train.loc[df_app_train['TARGET']==0,var]
    
    plt.subplot(numVar,1,i+1)
    sns.kdeplot(dflt_var,label='Default')
    sns.kdeplot(dflt_non_var,label='No Default')
    #plt.xlabel(var)
    plt.ylabel('Density')
    plt.title(var)

In [38]:

numVar = var_neg_corr.shape[0]

plt.figure(figsize=(10,40))
for i,var in enumerate(var_neg_corr):    
    dflt_var = df_app_train.loc[df_app_train['TARGET']==1,var]
    dflt_non_var = df_app_train.loc[df_app_train['TARGET']==0,var]
    
    plt.subplot(numVar,1,i+1)
    sns.kdeplot(dflt_var,label='Default')
    sns.kdeplot(dflt_non_var,label='No Default')
    #plt.xlabel(var)
    plt.ylabel('Density')
    plt.title(var)

Analyze `DAYS_EMPLOYED`¶

We take the number of DAYS_EMPLOYED, and add an additional column to make it YEARS_EMPLOYED. We then create a new column that allows us to bin each observation based on the quantile/qunintile it is in. Since there about 50 employable years, we create 10 bins. Now that each observation is in a bin, we can use a groupby command to group each set of obserations.

In [39]:

daysEmp_data = df_app_train[['TARGET','DAYS_EMPLOYED']]
daysEmp_data.loc[:,'YEARS_EMPLOYED'] = daysEmp_data['DAYS_EMPLOYED']/365

daysEmp_data['YEARS_EMPLOYED'].hist()

daysEmp_data['YEARS_BINNED'] = pd.cut(daysEmp_data['YEARS_EMPLOYED'],bins=np.linspace(0,50,num=11))
daysEmp_data.head(10)
daysEmp_data['YEARS_BINNED'].unique()

Out[39]:

[NaN]
Categories (0, interval[float64]): []

Since we do the group by, we can see that the less the amount of time you've been employed, you're more likely to default.

In [40]:

daysEmp_group = daysEmp_data.groupby('YEARS_BINNED').mean()
daysEmp_group

Out[40]:

	TARGET	DAYS_EMPLOYED	YEARS_EMPLOYED
YEARS_BINNED
(0.0, 5.0]	NaN	NaN	NaN
(5.0, 10.0]	NaN	NaN	NaN
(10.0, 15.0]	NaN	NaN	NaN
(15.0, 20.0]	NaN	NaN	NaN
(20.0, 25.0]	NaN	NaN	NaN
(25.0, 30.0]	NaN	NaN	NaN
(30.0, 35.0]	NaN	NaN	NaN
(35.0, 40.0]	NaN	NaN	NaN
(40.0, 45.0]	NaN	NaN	NaN
(45.0, 50.0]	NaN	NaN	NaN

In [41]:

sns.barplot(daysEmp_group.index,daysEmp_group['TARGET']*100)
plt.xticks(rotation=60)
plt.ylabel('% default')
plt.xlabel('Days Employed Groups (Years)')

Out[41]:

Text(0.5,0,'Days Employed Groups (Years)')

In [42]:

dflt_daysEmp = df_app_train.loc[df_app_train['TARGET']==1,'DAYS_EMPLOYED'] 
dflt_non_daysEmp = df_app_train.loc[df_app_train['TARGET']==0,'DAYS_EMPLOYED'] 

sns.kdeplot(dflt_daysEmp/365,label='Defaulted (Target==1)')
sns.kdeplot(dflt_non_daysEmp/365,label='Not Defaulted (Target==0)')
plt.xlabel('Time in employment (years)')
plt.ylabel('Density')
plt.title('Employment distribution for default & non-default')

Out[42]:

Text(0.5,1,'Employment distribution for default & non-default')

Analyzing credit scores¶

We saw that external sources had the highest correlations with TARGET, followed by DAYS_BIRTH and DAYS_EMPLOYED. So we want to take a closer look at these features and their interplay with TARGET.

Feature Name	Corr. with TARGET

EXT_SOURCE_3 | -0.178919 | EXT_SOURCE_2 | -0.160472 | EXT_SOURCE_1 | -0.155317 | DAYS_BIRTH | -0.078239 | DAYS_EMPLOYED | -0.074958 |

In [43]:

df_ext_src = df_app_train[['TARGET','EXT_SOURCE_3','EXT_SOURCE_2','EXT_SOURCE_1','DAYS_BIRTH']] # 'DAYS_EMPLOYED'
df_ext_src_corr = df_ext_src.corr()
sns.heatmap(df_ext_src_corr,vmin=-1.0,vmax=1.0,annot=True)

Out[43]:

<matplotlib.axes._subplots.AxesSubplot at 0x7db521b3e128>

Additional graphical analysis for major features¶

We want to create a pairplot and a pairgrid to have a graphical analysis of the most important features of the dataset. As the original dataset is quite large, we take a sample of it such that we remove all the rows that have NaN and then take a random sample of 5000 points.

We have a 6x6 grid inpairplot as TARGET is explicitly considered.

In [44]:

df_ext_src.shape
df_ext_src_sample = df_ext_src.dropna().sample(5000)
sns.pairplot(df_ext_src_sample)

Out[44]:

<seaborn.axisgrid.PairGrid at 0x7db52199b748>

We use pairgrid to create a more informative plot. In this pairgrid TARGET is denoted by the hue. Orange is TARGET==1 (default), and Blue is TARGET==0 (no default).

The pairgrid can be explained as follows:

Upper triangle: This is a scatter plot between the two variables in the X & Y axes, and has the TARGET variable as a different hue.
Diagonal: This is a kde plot of the distribution of each variable
Bottom triangle: This is a kde plot

In [45]:

grid = sns.PairGrid(data = df_ext_src_sample, diag_sharey=True,
                    hue = 'TARGET', 
                    vars = [x for x in list(df_ext_src_sample.columns) if x != 'TARGET'])

# Upper is a scatter plot
grid.map_upper(plt.scatter, alpha = 0.2)

# Diagonal is a histogram
grid.map_diag(sns.kdeplot)

# Bottom is density plot
grid.map_lower(sns.kdeplot, cmap = plt.cm.OrRd_r);

Feature Engineering¶

Based on our empirical analysis of just analyzing correlations between our target variable and the feature variables, we can perform feature engineering. Typically feature engineering means we perform operations such as:

Polynomial features: THis includes all interactions and powers of each feature variable
Expert knowledge features:

In [46]:

print(var_pos_corr[0:10])

['EXT_SOURCE_3' 'EXT_SOURCE_2' 'EXT_SOURCE_1'
 'NAME_EDUCATION_TYPE_Higher education' 'CODE_GENDER_F'
 'NAME_INCOME_TYPE_Pensioner' 'DAYS_EMPLOYED_ANOM' 'ORGANIZATION_TYPE_XNA'
 'FLOORSMAX_AVG' 'FLOORSMAX_MEDI']

In [47]:

imp_var = var_pos_corr[0:4]
print(imp_var)

['EXT_SOURCE_3' 'EXT_SOURCE_2' 'EXT_SOURCE_1'
 'NAME_EDUCATION_TYPE_Higher education']

In [48]:

poly_features_train = df_app_train[imp_var]
poly_features_test = df_app_test[imp_var]

poly_target_train = df_app_train['TARGET']

poly_features_train.columns

Out[48]:

Index(['EXT_SOURCE_3', 'EXT_SOURCE_2', 'EXT_SOURCE_1',
       'NAME_EDUCATION_TYPE_Higher education'],
      dtype='object')

Imputing NaN points¶

In [49]:

imputer = preprocessing.Imputer(strategy='median')
poly_features_train = imputer.fit_transform(poly_features_train) # fitting means we find the median then apply
poly_features_test = imputer.transform(poly_features_test) # we only transform, as we want to use the fit used on the training dataset

Creating polynomial features¶

In [50]:

poly_transformer = preprocessing.PolynomialFeatures(degree=3)

poly_transformer.fit(poly_features_train)

poly_features_train = poly_transformer.transform(poly_features_train)
poly_features_test = poly_transformer.transform(poly_features_test)

print('Polynomial features: {}'.format(poly_features_train.shape))

Polynomial features: (307511, 35)

In [51]:

poly_transformer.get_feature_names()[:15]

Out[51]:

['1',
 'x0',
 'x1',
 'x2',
 'x3',
 'x0^2',
 'x0 x1',
 'x0 x2',
 'x0 x3',
 'x1^2',
 'x1 x2',
 'x1 x3',
 'x2^2',
 'x2 x3',
 'x3^2']

In [52]:

poly_transformer.get_feature_names(input_features=imp_var)

Out[52]:

['1',
 'EXT_SOURCE_3',
 'EXT_SOURCE_2',
 'EXT_SOURCE_1',
 'NAME_EDUCATION_TYPE_Higher education',
 'EXT_SOURCE_3^2',
 'EXT_SOURCE_3 EXT_SOURCE_2',
 'EXT_SOURCE_3 EXT_SOURCE_1',
 'EXT_SOURCE_3 NAME_EDUCATION_TYPE_Higher education',
 'EXT_SOURCE_2^2',
 'EXT_SOURCE_2 EXT_SOURCE_1',
 'EXT_SOURCE_2 NAME_EDUCATION_TYPE_Higher education',
 'EXT_SOURCE_1^2',
 'EXT_SOURCE_1 NAME_EDUCATION_TYPE_Higher education',
 'NAME_EDUCATION_TYPE_Higher education^2',
 'EXT_SOURCE_3^3',
 'EXT_SOURCE_3^2 EXT_SOURCE_2',
 'EXT_SOURCE_3^2 EXT_SOURCE_1',
 'EXT_SOURCE_3^2 NAME_EDUCATION_TYPE_Higher education',
 'EXT_SOURCE_3 EXT_SOURCE_2^2',
 'EXT_SOURCE_3 EXT_SOURCE_2 EXT_SOURCE_1',
 'EXT_SOURCE_3 EXT_SOURCE_2 NAME_EDUCATION_TYPE_Higher education',
 'EXT_SOURCE_3 EXT_SOURCE_1^2',
 'EXT_SOURCE_3 EXT_SOURCE_1 NAME_EDUCATION_TYPE_Higher education',
 'EXT_SOURCE_3 NAME_EDUCATION_TYPE_Higher education^2',
 'EXT_SOURCE_2^3',
 'EXT_SOURCE_2^2 EXT_SOURCE_1',
 'EXT_SOURCE_2^2 NAME_EDUCATION_TYPE_Higher education',
 'EXT_SOURCE_2 EXT_SOURCE_1^2',
 'EXT_SOURCE_2 EXT_SOURCE_1 NAME_EDUCATION_TYPE_Higher education',
 'EXT_SOURCE_2 NAME_EDUCATION_TYPE_Higher education^2',
 'EXT_SOURCE_1^3',
 'EXT_SOURCE_1^2 NAME_EDUCATION_TYPE_Higher education',
 'EXT_SOURCE_1 NAME_EDUCATION_TYPE_Higher education^2',
 'NAME_EDUCATION_TYPE_Higher education^3']

Since we now have a bigger dataset with additional artificially created features, let's evaluate if these new features have higher correlation than the original set of features. poly_features_train is an array, so we need to create a DataFrame out of it, and add the Target column to it.

In [53]:

df_poly_features_train = pd.DataFrame(poly_features_train, columns = poly_transformer.get_feature_names(input_features=imp_var))
df_poly_features_train['TARGET'] = poly_target_train

In [54]:

poly_corrs = df_poly_features_train.corr()['TARGET'].sort_values()
print('+ve correlations:\n{}'.format(poly_corrs.tail(20)))
print('-ve correlations:\n{}'.format(poly_corrs.head(20)))

+ve correlations:
EXT_SOURCE_1^2                                                    0.000900
EXT_SOURCE_3^2 NAME_EDUCATION_TYPE_Higher education               0.000982
EXT_SOURCE_2 NAME_EDUCATION_TYPE_Higher education                 0.001310
EXT_SOURCE_2 NAME_EDUCATION_TYPE_Higher education^2               0.001310
EXT_SOURCE_2 EXT_SOURCE_1 NAME_EDUCATION_TYPE_Higher education    0.001347
EXT_SOURCE_1^3                                                    0.001574
EXT_SOURCE_3 EXT_SOURCE_1 NAME_EDUCATION_TYPE_Higher education    0.001614
EXT_SOURCE_2 EXT_SOURCE_1                                         0.001634
EXT_SOURCE_2^2 EXT_SOURCE_1                                       0.001659
EXT_SOURCE_3 NAME_EDUCATION_TYPE_Higher education^2               0.001890
EXT_SOURCE_3 NAME_EDUCATION_TYPE_Higher education                 0.001890
EXT_SOURCE_2 EXT_SOURCE_1^2                                       0.002161
EXT_SOURCE_1^2 NAME_EDUCATION_TYPE_Higher education               0.002365
EXT_SOURCE_1 NAME_EDUCATION_TYPE_Higher education                 0.002674
EXT_SOURCE_1 NAME_EDUCATION_TYPE_Higher education^2               0.002674
NAME_EDUCATION_TYPE_Higher education^2                            0.002905
NAME_EDUCATION_TYPE_Higher education                              0.002905
NAME_EDUCATION_TYPE_Higher education^3                            0.002905
TARGET                                                            1.000000
1                                                                      NaN
Name: TARGET, dtype: float64
-ve correlations:
EXT_SOURCE_3^3                                                   -0.005448
EXT_SOURCE_3^2                                                   -0.004932
EXT_SOURCE_3                                                     -0.004023
EXT_SOURCE_3^2 EXT_SOURCE_1                                      -0.003921
EXT_SOURCE_3 EXT_SOURCE_1                                        -0.002701
EXT_SOURCE_3^2 EXT_SOURCE_2                                      -0.002487
EXT_SOURCE_3 EXT_SOURCE_2                                        -0.001345
EXT_SOURCE_3 EXT_SOURCE_1^2                                      -0.001202
EXT_SOURCE_3 EXT_SOURCE_2^2                                      -0.000472
EXT_SOURCE_1                                                     -0.000116
EXT_SOURCE_3 EXT_SOURCE_2 EXT_SOURCE_1                           -0.000046
EXT_SOURCE_3 EXT_SOURCE_2 NAME_EDUCATION_TYPE_Higher education    0.000356
EXT_SOURCE_2^3                                                    0.000660
EXT_SOURCE_2                                                      0.000665
EXT_SOURCE_2^2 NAME_EDUCATION_TYPE_Higher education               0.000775
EXT_SOURCE_2^2                                                    0.000807
EXT_SOURCE_1^2                                                    0.000900
EXT_SOURCE_3^2 NAME_EDUCATION_TYPE_Higher education               0.000982
EXT_SOURCE_2 NAME_EDUCATION_TYPE_Higher education                 0.001310
EXT_SOURCE_2 NAME_EDUCATION_TYPE_Higher education^2               0.001310
Name: TARGET, dtype: float64

Now that we have a new df_poly_features_train that includes all the existing features, and polynomial features, we need to add in the SK_ID_CURR column too

In [59]:

df_app_train.index.values

Out[59]:

array([100002, 100003, 100004, ..., 456253, 456254, 456255])

In [56]:

df_poly_features_train['SK_ID_CURR'] = df_app_train['SK_ID_CURR'] # Now that we have
#df_app_poly_train = df_app_train.merge(df_poly_features_train, on = 'SK_ID_CURR', how='left')

-------------------------------------------------------------------------
KeyError                                Traceback (most recent call last)
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3077             try:
-> 3078                 return self._engine.get_loc(key)
   3079             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'SK_ID_CURR'

During handling of the above exception, another exception occurred:

KeyError                                Traceback (most recent call last)
<ipython-input-56-a08ff4e312f8> in <module>()
----> 1 df_poly_features_train['SK_ID_CURR'] = df_app_train['SK_ID_CURR'] # Now that we have
      2 #df_app_poly_train = df_app_train.merge(df_poly_features_train, on = 'SK_ID_CURR', how='left')

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2686             return self._getitem_multilevel(key)
   2687         else:
-> 2688             return self._getitem_column(key)
   2689 
   2690     def _getitem_column(self, key):

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2693         # get column
   2694         if self.columns.is_unique:
-> 2695             return self._get_item_cache(key)
   2696 
   2697         # duplicate columns & possible reduce dimensionality

~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   2487         res = cache.get(item)
   2488         if res is None:
-> 2489             values = self._data.get(item)
   2490             res = self._box_item_values(item, values)
   2491             cache[item] = res

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   4113 
   4114             if not isna(item):
-> 4115                 loc = self.items.get_loc(item)
   4116             else:
   4117                 indexer = np.arange(len(self.items))[isna(self.items)]

~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3078                 return self._engine.get_loc(key)
   3079             except KeyError:
-> 3080                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3081 
   3082         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'SK_ID_CURR'

In [ ]:

Add new features to the `test` dataset¶

In [ ]:

df_poly_features_test = pf.DataFrame(poly_features_test, columns = poly_transformer.get_feature_names(input_features=imp_var))
df_poly_features_test['SK_ID_CURR'] = df_app_test['SK_ID_CURR']
df_app_poly_test = df_app_test.merge(df_poly_features_test, on='SK_ID_CURR', how='left')

Align the `train` and `test` datasets¶

In [ ]:

df_app_poly_train, df_app_poly_test = app_train_poly_train.align(df_app_poly_test, join='inner',axis=1)

haha

hehe

In [61]:

!jupyter nbconvert --to script ml_kaggle_home-loan-credit-risk.ipynb

[NbConvertApp] WARNING | pattern 'ml_kaggle_home-loan-credit-risk.ipynb' matched no files
This application is used to convert notebook files (*.ipynb) to various other
formats.

WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.

Options
-------

Arguments that take values are actually convenience aliases to full
Configurables, whose aliases are listed on the help line. For more information
on full configurables, see '--help-all'.

--debug
    set log level to logging.DEBUG (maximize logging output)
--generate-config
    generate default config file
-y
    Answer yes to any questions instead of prompting.
--execute
    Execute the notebook prior to export.
--allow-errors
    Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
--stdin
    read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
--stdout
    Write notebook output to stdout instead of files.
--inplace
    Run nbconvert in place, overwriting the existing notebook (only 
    relevant when converting to notebook format)
--clear-output
    Clear output of current file and save in place, 
    overwriting the existing notebook.
--no-prompt
    Exclude input and output prompts from converted document.
--no-input
    Exclude input cells and output prompts from converted document. 
    This mode is ideal for generating code-free reports.
--log-level=<Enum> (Application.log_level)
    Default: 30
    Choices: (0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL')
    Set the log level by value or name.
--config=<Unicode> (JupyterApp.config_file)
    Default: ''
    Full path of a config file.
--to=<Unicode> (NbConvertApp.export_format)
    Default: 'html'
    The export format to be used, either one of the built-in formats, or a
    dotted object name that represents the import path for an `Exporter` class
--template=<Unicode> (TemplateExporter.template_file)
    Default: ''
    Name of the template file to use
--writer=<DottedObjectName> (NbConvertApp.writer_class)
    Default: 'FilesWriter'
    Writer class used to write the  results of the conversion
--post=<DottedOrNone> (NbConvertApp.postprocessor_class)
    Default: ''
    PostProcessor class used to write the results of the conversion
--output=<Unicode> (NbConvertApp.output_base)
    Default: ''
    overwrite base name use for output files. can only be used when converting
    one notebook at a time.
--output-dir=<Unicode> (FilesWriter.build_directory)
    Default: ''
    Directory to write output(s) to. Defaults to output to the directory of each
    notebook. To recover previous default behaviour (outputting to the current
    working directory) use . as the flag value.
--reveal-prefix=<Unicode> (SlidesExporter.reveal_url_prefix)
    Default: ''
    The URL prefix for reveal.js (version 3.x). This defaults to the reveal CDN,
    but can be any url pointing to a copy  of reveal.js.
    For speaker notes to work, this must be a relative path to a local  copy of
    reveal.js: e.g., "reveal.js".
    If a relative path is given, it must be a subdirectory of the current
    directory (from which the server is run).
    See the usage documentation
    (https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-
    slideshow) for more details.
--nbformat=<Enum> (NotebookExporter.nbformat_version)
    Default: 4
    Choices: [1, 2, 3, 4]
    The nbformat version to write. Use this to downgrade notebooks.

To see all available configurables, use `--help-all`

Examples
--------

    The simplest way to use nbconvert is
    
    > jupyter nbconvert mynotebook.ipynb
    
    which will convert mynotebook.ipynb to the default format (probably HTML).
    
    You can specify the export format with `--to`.
    Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides']
    
    > jupyter nbconvert --to latex mynotebook.ipynb
    
    Both HTML and LaTeX support multiple output templates. LaTeX includes
    'base', 'article' and 'report'.  HTML includes 'basic' and 'full'. You
    can specify the flavor of the format used.
    
    > jupyter nbconvert --to html --template basic mynotebook.ipynb
    
    You can also pipe the output to stdout, rather than a file
    
    > jupyter nbconvert mynotebook.ipynb --stdout
    
    PDF is generated via latex
    
    > jupyter nbconvert mynotebook.ipynb --to pdf
    
    You can get (and serve) a Reveal.js-powered slideshow
    
    > jupyter nbconvert myslides.ipynb --to slides --post serve
    
    Multiple notebooks can be given at the command line in a couple of 
    different ways:
    
    > jupyter nbconvert notebook*.ipynb
    > jupyter nbconvert notebook1.ipynb notebook2.ipynb
    
    or you can specify the notebooks list in a config file, containing::
    
        c.NbConvertApp.notebooks = ["my_notebook.ipynb"]
    
    > jupyter nbconvert --config mycfg.py

In [ ]:

Import all required modules¶

Read files into workspace¶

Exploratory Data Analysis (EDA)¶

Check for missing values¶

Encoding categorical variables¶

Label encoding¶

One-hot encoding¶

Aligning training & testing data¶

Erroneous data¶

Age¶

Employment history¶

Car anomalies¶

Correlation analysis¶

Analyze features with greatest correlation magnitude¶

Analyze DAYS_EMPLOYED¶

Analyzing credit scores¶

Additional graphical analysis for major features¶

Feature Engineering¶

Imputing NaN points¶

Creating polynomial features¶

Add new features to the test dataset¶

Align the train and test datasets¶

Comments

Analyze `DAYS_EMPLOYED`¶

Add new features to the `test` dataset¶

Align the `train` and `test` datasets¶