Kaggle: Credit risk (Feature Engineering: Part 2)
Feature engineering an important part of machine-learning as we try to engineer (i.e., modify/create) new features from our existing dataset that might be meaningful in predicting the TARGET
.
In the kaggle home-credit-default-risk competition, we are given the following datasets:
application_train.csv
previous_application.csv
installments_payments.csv
bureau.csv
POS_CASH_balance.csv
bureau_balance.csv
credit_card_balance.csv
Each datasets provides more information about the loan application in terms of how prompt they have been on their instalment payments, their credit history on other loans, the amount of cash or credit card balances they have etc. A data scientist/researcher should always investigate and create new features from all the information provided.
In this intermediate exercise, we will now analyze other datasets to create new features to improve our model.
- Aggregation functions (i.e., median, mean, max, min, count, etc.) to engineer new feature variables across numeric columns .
- Use correlation estimates, graphical analysis (i.e., KDE plots) to evaluate predictive power of new featured engineers
Table of contents¶
Engineering a new feature (e.g., previous loan counts) from 'bureau.csv'
Engineering new features from all numeric columns from 'bureau.csv'
Evaluate usefulness of new features
Writing functions for aggregating numeric columns
Engineering new features from categorical columns
Writing functions for aggregating categorical columns
Applying agg. functions for bureau.csv
Loading in required modules¶
We mount Google Drive to our Colab instance so we can access any Python modules and datasets from this Colab notebook.
from google.colab import drive
drive.mount('/content/gdrive')
Add the Google Drive folder that is shared between the Crostini distribution and ChromeOS to the path
. This allows seamlessly integrating any Python code/modules to be used in this this notebook.
# importing all system modules
import os
import sys
import warnings
from pathlib import Path
warnings.filterwarnings('ignore')
if sys.platform == 'linux':
sys.path.append('/content/gdrive/My Drive/Colab Notebooks/modules/') # Google Cloud
#sys.path.append('/home/randlow/github/blog/listings/machine-learning/') # Crostini
elif sys.platform == 'win32':
sys.path.append('\\Users\\randl\\github\\blog2\\listings\\machine-learning\\') # win32
# importing data science modules
import pandas as pd
import numpy as np
import sklearn
import scipy as sp
import pickleshare
# importing graphics modules
import matplotlib.pyplot as plt
import seaborn as sns
# importing personal data science modules
import rand_eda as eda
Loading pickled dataframes¶
To see how the below dataframes were obtained see the post on the Kaggle: Credit risk (Exploratory Data Analysis)
home = str(Path.home())
if sys.platform == 'linux':
inputDir = '/content/gdrive/My Drive/Colab Notebooks/datasets/' # Google Cloud
#inputDir = '/mnt/chromeOS/GoogleDrive/MyDrive/Colab Notebooks/datasets/' # Crostini
elif sys.platform == 'win32':
inputDir = '\\datasets\\kaggle\\home-credit-default-risk\\' # windows
storeDir = inputDir+'kaggle/home-credit-default-risk/' # Google Cloud
#storeDir = home+inputDir+'kaggle/home-credit-default-risk/'
df_bureau = pd.read_csv(storeDir + 'bureau.csv')
df_bureau_bal = pd.read_csv(storeDir + 'bureau_balance.csv')
df_train = pd.read_csv(storeDir + 'application_train.csv')
'''
db = pickleshare.PickleShareDB(storeDir+'pickleshare')
print(db.keys())
df_app_test_align = db['df_app_test_align']
df_app_train_align = db['df_app_train_align']
'''
Engineering a new feature (e.g., previous loan counts) from 'bureau.csv'¶
Analyzing the df_bureau
, we see that for each loan applicant (i.e., 215354, 162297) we have a list of historical loans of each applicant. This data contains information such as whether the loan is still closed or active, how many days the loan was over due, the amount of credit taken, the credit type, and more.
We can assume that the number of previous loans can potentially impact the probability of default for each applicant. Based the loan ids (i.e., SK_ID_BUREAU
) we can perform a number of aggregation operations such as mean
, max
, min
, sum
and evalute whether these new features add any value for our model.
df_bureau.head(10)
For applicant 215354
we can see he has 7 active accounts and 5 closed accounts. We can see that x3 loans were credit card, x1 is a car loan, and the rest were consumer credit loans. It is possible that the total number of previous loans held by the applicant may have predictive power on whether the applicant should be approved for his new loan. After all, if an applicant has too many loans he might be in financial distress.
df_bureau[df_bureau['SK_ID_CURR']==215354]
We create a new dataframe where we group based on the applicants id (SK_ID_CURR
) and we apply a count operation to SK_ID_BUREAU
. We also choose to use a standard index for our dataframe and rename the column to prev_loan_cnt
.
df_prev_loan_cnt = df_bureau.groupby('SK_ID_CURR', as_index=False)['SK_ID_BUREAU'].count().rename(columns={'SK_ID_BUREAU':'prev_loan_cnt'})
df_prev_loan_cnt.head(10)
We merge our new dataframe of previous loan acounts (df_prev_loan_cnt
) on the applicant id (SK_ID_CURR
) with a left
join. A left
join means that any SK_ID_CURR
that is in df_prev_loan_cnt
but not in df_train
will not be included in the new dataframe, and all rows in df_train
will be retained. Where df_train
has an applicant, but df_prev_loan_cnt
has no information, there will be a NaN. We see this is true for SK_ID_CURR
= 100006
df_train = df_train.merge(df_prev_loan_cnt, on='SK_ID_CURR', how='left')
df_train[['SK_ID_CURR','prev_loan_cnt']].head(10)
We fill all entries such as SK_ID_CURR=100006
to 0.
df_train['prev_loan_cnt'] = df_train['prev_loan_cnt'].fillna(0)
df_train[['SK_ID_CURR','prev_loan_cnt']].head(10)
We see that our df_train
now has the new column prev_loan_cnt
appended to it.
df_train.head(10)
We use our function plot_kde_hist_var
to evaluate the potential explanatory power of the prev_loan_cnt
variable. By evaluating the plot and the correlation with Target=True
we find that prev_loan_cnt
is likely to be a weak predictor of Target
.
eda.plot_kde_hist_var(df_train,['EXT_SOURCE_3','prev_loan_cnt'])
Engineering new features from all numeric columns from 'bureau.csv'¶
We initially only obtained prev_loan_cnt
by grouping together our dataset based on loan application ID (SK_ID_CURR
) and using the count
operation on the loan IDs (SK_ID_BUREAU
).
However, in feature engineering, we don't know which variable might be important and different aggregation calculations such as median
, max
, min
, etc applied to each numeric variable may lead to a better predictor.
We drop out SK_ID_BUREAU
because it is only important for prev_loan_cnt
and has no meaning even if we applied metrics such as median
, min
, max
, etc sd SK_ID_BUREAU
is only useful when applying count
.
We perform the same groupby
operation as before in 'Previous Loan Counts' on all numeric variables and reset the index.
df_bureau_agg = df_bureau.drop(columns=['SK_ID_BUREAU']).groupby(['SK_ID_CURR'], as_index=False).agg(['count','mean','median','min','max','sum']).reset_index()
df_bureau_agg.head()
We can see that using the groupby
and agg
functions lead to a Multi-level index. The first level are items such as ['DAYS_CREDIT','CREDIT_DAY_OVERDUE']
whereas the second level are the aggregation functions such as ['count','median','mean','min','max','sum']
.
To extract specific columns from the dataframe based on a Multiindex, use the Level 0 name (e.g., DAYS_CREDIT
,CREDIT_DAY_OVEDUE
) and then the Level 1 name (e.g., mean
, median
). This returns a Series
object. Use the loc
function to return a DataFrame
object.
df_bureau_agg['DAYS_CREDIT']['count'].head(3)
Use the loc
function to return a DataFrame object
df_bureau_agg.loc[:,[('DAYS_CREDIT','mean')]].head(3)
The xs
command is another method for accessing Multi-Level indexes. It allows you to index specifically by listing the column names and the levels that you are referring to in the Multi-index dataframe
df_bureau_agg.xs(('DAYS_CREDIT','mean'),level=(0,1),axis=1).head(3)
We can see in the below instance that xs
extracted all the mean
values for all the Multi-Index entries.
df_bureau_agg.xs(('mean'), level =(1),axis=1).head()
df_bureau_agg.xs(('mean'), level =(1),axis=1).head()
Evaluate usefulness of new features¶
We merge df_train
and df_bureau_agg
to evaluate the predictive power of our new features with TARGET
. To evaluate the predictive power of the new features we perform the following actions
- Merge the new features with the training dataset.
- Calculate correlations between the new features and
TARGET
from the training dataset. - Generate KDE plots of features of populations with
Target=0/1
.
df_train = df_train.merge(df_bureau_agg, how='left', on='SK_ID_CURR' )
df_train.head()
We can see that the merged dataframe contains only one level for the index. Thus the Multi_index from df_bureau_agg
has been squashed into a tuple.
df_train.columns
We convert these tuple
into a string and update the column names of the training dataframe accordingly
colName = list(df_train.columns)
colNewName = []
for col in colName:
if isinstance(col,tuple):
newCol = '_'.join(col)
colNewName.append(newCol)
else:
colNewName.append(col)
df_train.columns = colNewName
print(df_train.columns)
We calculate the correlation of all variables in the dataframe to evaluate whether our new variables from bureau are high magnitude of correlation with TARGET
.
corrVal = df_train.corr()['TARGET']
corrValabs = abs(corrVal).sort_values(ascending=False)
print('Highest Correlation:\n{}'.format(corrValabs.head(15)))
print('Lowest Correlation:\n{}'.format(corrValabs.tail(15)))
We see that a few of our new variables have higher correlations with TARGET
such as (DAYS_CREDIT, x)
, (DAYS_CREDIT_UPDATE,x)
, thus we can run our KDE plot
We plot the distributions for these new feature variables that exhibit higher correlations with TARGET
. Bear in mind that the correlations are still around 0.06 - 0.09 which are still quite low in the scheme of things.
The shapes of the KDEs of these different features for the TARGET=0
and TARGET=1
populations are quite different but this may be due to the imbalanced population of the dataset
newFeat = ['DAYS_CREDIT_mean','DAYS_CREDIT_median','DAYS_CREDIT_min','DAYS_CREDIT_UPDATE_mean']
eda.plot_kde_hist_var(df_train,newFeat)
Functions for aggregating numeric columns¶
We write a function that aggregates all the numeric columns in a dataframe, and collapses the multi-index dataframe into a single index.
It is often easier to work with a Single Index rather than a multi index dataframe (i.e., When merging a multi-index with a single-index dataframe, multi-index values are converted into tuples).
We write a function in our rand_eda
module to perform this operation so it can be repeated for other datasets.
df = df_bureau.copy()
groupby_var = 'SK_ID_CURR'
drop_var_list= ['SK_ID_BUREAU']
df_name = 'bureau'
df_bureau_agg_num = eda.agg_num(df, groupby_var, drop_var_list, df_name)
df_bureau_agg_num.head()
The code below is included in the rand_eda.py
module. This is the module we use for all DataScience activity.
'''
Takes a two-level multi-index and flattens it into a one-level DataFrame
inputs
------
df (dataframe): a df with a multi_index column
outputs
-------
df dataframe: returns df with single index columns
'''
def flatten_multi_index(df):
colName = []
for idx0,val0 in enumerate(df.columns.levels[0]):
for idx1,val1 in enumerate(df.columns.levels[1]):
combine = (val0,val1)
colName.append('_'.join(combine))
df.columns = colName
return df
"""
Performs aggregation statistical functions on numeric columns of a dataset and
collapses the multi index into a single index.
Inputs
------
df (dataframe): dataframe to perform aggregation operations on
groupby_var (string): this is the column to group by
drop_var (list): list of columns to drop. These are columns that add no value when aggregation operations are applied (e.g., ID columns)
Outputs
-------
df_agg : a dataframe with a flattened multi_index and aggregate statistics across all numeric columns
"""
def agg_num(df, groupby_var, drop_var_list, df_name):
for col in df:
for drop_var in drop_var_list:
if col == drop_var:
df = df.drop(columns = col)
# extract out the groupby ID variable, extract all numeric columns, then add back
# the groupby ID variable
groupby_ids = df[groupby_var]
df_numeric = df.select_dtypes('number')
df_numeric[groupby_var] = groupby_ids
# group the dataframe accordingly
df_agg = df_numeric.groupby(groupby_var).agg(['mean','median','sum','count','max','min'])
# flatten the multi index naming convention to a single index
# prefix the column names with a dataframe name (important if we merge this data into a training dataframe)
# re-insert the groupby index data
df_agg = flatten_multi_index(df_agg)
df_agg = df_agg.add_prefix('{}_'.format(df_name))
return df_agg
Engineering new features from categorical columns¶
We print out basic informaiton about the df_bureau
dataframe and find that there are a x3 categorical columns.
df_bureau.head()
info = eda.print_basic_info_df(df_bureau)
print(info)
A closer inspection shows that there are x3 categorical variables that we may need to apply one hot encoding upon a few categorical columns such as CREDIT_ACTIVE
, CREDIT_CURRENCY
, etc..
df_bureau.head()
We take a closer look at the categorical variables to evalute how many unique entries, the names of these unique entries and how often they occur in the dataset.
eda.extract_cat_var(df_bureau)
Using sklearn's OneHotEncoder
¶
We apply scikit-learn's OneHotEncoder which as of version 20 can now function on string
variables.
We can see from previous cells that CREDIT_ACTIVE
consists of x4 unique variables. Thus, we expect that if we apply OneHotEncoder, this would produce a matrix with x4 columns.
Tip
- OneHotEncoder only accepts DataFrames, not a Series so use [[ ]] when selecting a single column(s).
- Use the
handle_unknown='ignore'
functionality ofOneHotEncoder
to ignore issues such as:- Unknown categories/missing values in the test set that are not in the training set.
- OneHotEncoder can perform
transform
and ignore unknown categories/missing values on the test data. - OneHotEncoder cannot perform
fit_transform
on unknown categories/missing values for the training data.
OneHotEncoder on a single DataFrame column¶
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
credit_act = df_bureau[['CREDIT_ACTIVE']]
credit_act_OHE = ohe.fit_transform(credit_act)
print(credit_act_OHE)
featnames = ohe.get_feature_names()
print(featnames)
df_credit_ohe = pd.DataFrame(credit_act_OHE,columns=featnames)
df_credit_ohe.head()
df_bureau['CREDIT_ACTIVE'].head()
We can see that the OneHotEncoder performs an accurate operation by comparing the results between df_credit_ohe
and df_bureau['CREDIT_ACTIVE']
.
We also verify this by performing the inverse_transform
applied to the transformed OHE array back to its original labeled values.
ohe.inverse_transform(credit_act_OHE)[:5]
OneHotEncoder on a full DataFrame¶
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
We perform one hot encoding on all variables of df_bureau
. We see that 23 columns are created with a 1
in the appropriate column
df_bureau_categorical = ohe.fit_transform(df_bureau.select_dtypes('object'))
print(df_bureau_categorical.shape)
df_bureau_categorical[:5]
df_bureau_categorical_origin = ohe.inverse_transform(df_bureau_categorical)
print(df_bureau_categorical_origin)
featnames = ohe.get_feature_names()
print(featnames)
We can see above that scikit-learn's OneHotEncoder
. Is useful and quite powerful.
It is particularly useful if we are using the Pipeline functionality in sckit-learn
df_bureau.head()
Using panda's get_dummies
¶
We use the get_dummies
from pandas
to perform one hot encoding.
df_categorical = pd.get_dummies(df_bureau.select_dtypes('object'))
df_categorical['SK_ID_CURR'] = df_bureau['SK_ID_CURR']
df_categorical.head()
df_categorical_grouped = df_categorical.groupby('SK_ID_CURR').agg(['sum','mean'])
df_categorical_grouped.head()
As the groupby
function has led to a multiindex, we write a function to flatten it to a single index.
df_categorical_grouped = eda.flatten_multi_index(df_categorical_grouped)
df_categorical_grouped.head()
Functions for aggregating categorical columns¶
We create a function cnt_cat
so we can perform this operation on other datases easily.
df = df_bureau.copy()
groupby_var = 'SK_ID_CURR'
df_name = 'bureau'
df_bureau_cat_cnt = eda.cnt_cat(df,groupby_var,df_name)
df_bureau_cat_cnt.head()
The code below is included in the rand_eda.py
module. This is the module we use for all DataScience activity.
'''
Performs count functions on categorical columns of a dataset and collapses the multi multi_index
into a single index
Inputs
------
df (dataframe): input DataFrame
groupby_var (string): name of variable to groupby
Outputs
-------
df (dataframe): ouptut dataframe
'''
def cnt_cat(df,groupby_var,df_name):
groupby_ids = df[groupby_var]
df_cat= pd.get_dummies(df.select_dtypes('object'))
df_cat[groupby_var] = groupby_ids
df_cat_agg = df_cat.groupby(groupby_var).agg(['sum','mean'])
df_cat_agg = flatten_multi_index(df_cat_agg)
df_cat_agg = df_cat_agg.add_prefix('{}_'.format(df_name))
return df_cat_agg
Applying agg. functions for bureau.csv
¶
df_bureau.head()
Applying agg. functions for bureau_balance.csv
¶
We apply the aggregation function to bureau balances. We see that this provides data on each loan and its status.
df_bureau_bal.head()
We see that the dataset is quite large and only contains x3 columns.
SK_ID_BUREAU
: This are the identifying keysof the database
MONTHS_BALANCE
: These are the monthly balance of each loan.
STATUS
: This is the status of each loan.
df_bureau_bal.info()
df_bureau_bal['STATUS'].unique()
We group by SK_ID_BUREAU
which is the ID of each bureau, and give each new feature a prefix of bureau_bal
. There are no columns to drop since there are only x3 columns and aggregation by any of these are not useless features.
df = df_bureau_bal
groupby_var = 'SK_ID_BUREAU'
df_name = 'bureau_bal'
drop_var_list = list()
df_bureau_agg_cnt = eda.cnt_cat(df,groupby_var,df_name)
df_bureau_agg_cnt.head()
We can see that performing the agg. functions on categorical columns has led to 16 columns as there are 8 unique entries for STATUS
.
df_bureau_agg_cnt.info()
We can see the process of performing statistical aggregation operations on the numeric columns of bureau_balance.csv
has resulted in x6 new columns.
df_bureau_agg_num = eda.agg_num(df, groupby_var,[],df_name)
df_bureau_agg_num.head()
df_bureau_agg_num.info()
The first merge we perform is between df_bureau_agg_num
and df_bureau_agg_cnt
.
We merge them both based on the SK_ID_BUREAU
column (i.e., loan ID) that is used as an index on both and based on an outer
join so that rows that are in one dataframe or the other will be added to the final dataframe regardless. This ensures we do not lose any data from either dataframe.
df_bureau_loan = pd.merge(df_bureau_agg_num,df_bureau_agg_cnt, left_index=True, right_index=True, how='outer')
df_bureau_loan.head()
The second merge we perform is to append the SK_ID_CURR
that is the ID for each loan applicant to our DataFrame of bureau loans (df_bureau_loan
).
df_bureau_loan = pd.merge(df_bureau_loan,df_bureau[['SK_ID_CURR','SK_ID_BUREAU']],left_index=True, right_on='SK_ID_BUREAU',how='left').reset_index(drop=True)
df_bureau_loan.head()
Since we have aggregated data for each loan together, and associated the loans with clients, we can group this data by a client basis
df = df_bureau_loan.copy()
groupby_var = 'SK_ID_CURR'
drop_var_list = ['SK_ID_BUREAU']
df_name = 'client'
df_bureau_loan_by_client = eda.agg_num(df,groupby_var,drop_var_list,df_name)
df_bureau_loan_by_client.head()
Converting iPython notebook to Python code¶
This allows us to run the code in Spyder.
!jupyter nbconvert ml_kaggle-home-loan-credit-risk-feat-eng-p2.ipynb --to python
Comments
Comments powered by Disqus