Kaggle: Credit risk (Feature Engineering: Part 2)

Feature engineering an important part of machine-learning as we try to engineer (i.e., modify/create) new features from our existing dataset that might be meaningful in predicting the TARGET.

In the kaggle home-credit-default-risk competition, we are given the following datasets:

  • application_train.csv
  • previous_application.csv
  • installments_payments.csv
  • bureau.csv
  • POS_CASH_balance.csv
  • bureau_balance.csv
  • credit_card_balance.csv

Each datasets provides more information about the loan application in terms of how prompt they have been on their instalment payments, their credit history on other loans, the amount of cash or credit card balances they have etc. A data scientist/researcher should always investigate and create new features from all the information provided.

In this intermediate exercise, we will now analyze other datasets to create new features to improve our model.

  • Aggregation functions (i.e., median, mean, max, min, count, etc.) to engineer new feature variables across numeric columns .
    • Use correlation estimates, graphical analysis (i.e., KDE plots) to evaluate predictive power of new featured engineers

Loading in required modules

We mount Google Drive to our Colab instance so we can access any Python modules and datasets from this Colab notebook.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive

Add the Google Drive folder that is shared between the Crostini distribution and ChromeOS to the path. This allows seamlessly integrating any Python code/modules to be used in this this notebook.

In [ ]:
# importing all system modules
import os
import sys
import warnings
from pathlib import Path
warnings.filterwarnings('ignore')
if sys.platform == 'linux':
    sys.path.append('/content/gdrive/My Drive/Colab Notebooks/modules/') # Google Cloud
    #sys.path.append('/home/randlow/github/blog/listings/machine-learning/') # Crostini
elif sys.platform == 'win32':
    sys.path.append('\\Users\\randl\\github\\blog2\\listings\\machine-learning\\') # win32

# importing data science modules
import pandas as pd
import numpy as np
import sklearn
import scipy as sp
import pickleshare

# importing graphics modules
import matplotlib.pyplot as plt
import seaborn as sns

# importing personal data science modules
import rand_eda as eda

Loading pickled dataframes

To see how the below dataframes were obtained see the post on the Kaggle: Credit risk (Exploratory Data Analysis)

In [ ]:
home = str(Path.home())
if sys.platform == 'linux':
    inputDir =  '/content/gdrive/My Drive/Colab Notebooks/datasets/' # Google Cloud
    #inputDir = '/mnt/chromeOS/GoogleDrive/MyDrive/Colab Notebooks/datasets/' # Crostini
elif sys.platform == 'win32':
    inputDir = '\\datasets\\kaggle\\home-credit-default-risk\\' # windows

storeDir = inputDir+'kaggle/home-credit-default-risk/' # Google Cloud
#storeDir = home+inputDir+'kaggle/home-credit-default-risk/'
df_bureau = pd.read_csv(storeDir + 'bureau.csv')
df_bureau_bal = pd.read_csv(storeDir + 'bureau_balance.csv')
df_train = pd.read_csv(storeDir + 'application_train.csv')
In [27]:
'''
db = pickleshare.PickleShareDB(storeDir+'pickleshare')
print(db.keys())

df_app_test_align = db['df_app_test_align'] 
df_app_train_align = db['df_app_train_align'] 
'''
Out[27]:
"\ndb = pickleshare.PickleShareDB(storeDir+'pickleshare')\nprint(db.keys())\n\ndf_app_test_align = db['df_app_test_align'] \ndf_app_train_align = db['df_app_train_align'] \n"

Engineering a new feature (e.g., previous loan counts) from 'bureau.csv'

Analyzing the df_bureau, we see that for each loan applicant (i.e., 215354, 162297) we have a list of historical loans of each applicant. This data contains information such as whether the loan is still closed or active, how many days the loan was over due, the amount of credit taken, the credit type, and more.

We can assume that the number of previous loans can potentially impact the probability of default for each applicant. Based the loan ids (i.e., SK_ID_BUREAU) we can perform a number of aggregation operations such as mean, max, min, sum and evalute whether these new features add any value for our model.

In [3]:
df_bureau.head(10)
Out[3]:
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.00 0.00 NaN 0.0 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.00 171342.00 NaN 0.0 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.50 NaN NaN 0.0 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.00 NaN NaN 0.0 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.00 NaN NaN 0.0 Consumer credit -21 NaN
5 215354 5714467 Active currency 1 -273 0 27460.0 NaN 0.0 0 180000.00 71017.38 108982.62 0.0 Credit card -31 NaN
6 215354 5714468 Active currency 1 -43 0 79.0 NaN 0.0 0 42103.80 42103.80 0.00 0.0 Consumer credit -22 NaN
7 162297 5714469 Closed currency 1 -1896 0 -1684.0 -1710.0 14985.0 0 76878.45 0.00 0.00 0.0 Consumer credit -1710 NaN
8 162297 5714470 Closed currency 1 -1146 0 -811.0 -840.0 0.0 0 103007.70 0.00 0.00 0.0 Consumer credit -840 NaN
9 162297 5714471 Active currency 1 -1146 0 -484.0 NaN 0.0 0 4500.00 0.00 0.00 0.0 Credit card -690 NaN

For applicant 215354 we can see he has 7 active accounts and 5 closed accounts. We can see that x3 loans were credit card, x1 is a car loan, and the rest were consumer credit loans. It is possible that the total number of previous loans held by the applicant may have predictive power on whether the applicant should be approved for his new loan. After all, if an applicant has too many loans he might be in financial distress.

In [4]:
df_bureau[df_bureau['SK_ID_CURR']==215354]
Out[4]:
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.0 0.00 NaN 0.0 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.0 171342.00 NaN 0.0 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN 0.0 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.0 NaN NaN 0.0 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN 0.0 Consumer credit -21 NaN
5 215354 5714467 Active currency 1 -273 0 27460.0 NaN 0.0 0 180000.0 71017.38 108982.62 0.0 Credit card -31 NaN
6 215354 5714468 Active currency 1 -43 0 79.0 NaN 0.0 0 42103.8 42103.80 0.00 0.0 Consumer credit -22 NaN
225157 215354 5714458 Closed currency 1 -1872 0 -776.0 -1203.0 NaN 0 494550.0 0.00 0.00 0.0 Consumer credit -696 NaN
225158 215354 5714459 Closed currency 1 -1734 0 -638.0 -1336.0 NaN 0 1084032.0 0.00 NaN 0.0 Car loan -1336 NaN
225159 215354 5714460 Closed currency 1 -1333 0 -1089.0 -987.0 NaN 0 102150.0 NaN NaN 0.0 Consumer credit -984 NaN
225160 215354 5714461 Closed currency 1 -1011 0 -830.0 -770.0 NaN 0 500463.0 NaN NaN 0.0 Consumer credit -767 NaN

We create a new dataframe where we group based on the applicants id (SK_ID_CURR) and we apply a count operation to SK_ID_BUREAU. We also choose to use a standard index for our dataframe and rename the column to prev_loan_cnt.

In [5]:
df_prev_loan_cnt = df_bureau.groupby('SK_ID_CURR', as_index=False)['SK_ID_BUREAU'].count().rename(columns={'SK_ID_BUREAU':'prev_loan_cnt'})
df_prev_loan_cnt.head(10)
Out[5]:
SK_ID_CURR prev_loan_cnt
0 100001 7
1 100002 8
2 100003 4
3 100004 2
4 100005 3
5 100007 1
6 100008 3
7 100009 18
8 100010 2
9 100011 4

We merge our new dataframe of previous loan acounts (df_prev_loan_cnt) on the applicant id (SK_ID_CURR) with a left join. A left join means that any SK_ID_CURR that is in df_prev_loan_cnt but not in df_train will not be included in the new dataframe, and all rows in df_train will be retained. Where df_train has an applicant, but df_prev_loan_cnt has no information, there will be a NaN. We see this is true for SK_ID_CURR = 100006

In [6]:
df_train = df_train.merge(df_prev_loan_cnt, on='SK_ID_CURR', how='left')
df_train[['SK_ID_CURR','prev_loan_cnt']].head(10)
Out[6]:
SK_ID_CURR prev_loan_cnt
0 100002 8.0
1 100003 4.0
2 100004 2.0
3 100006 NaN
4 100007 1.0
5 100008 3.0
6 100009 18.0
7 100010 2.0
8 100011 4.0
9 100012 NaN

We fill all entries such as SK_ID_CURR=100006 to 0.

In [7]:
df_train['prev_loan_cnt'] = df_train['prev_loan_cnt'].fillna(0)
df_train[['SK_ID_CURR','prev_loan_cnt']].head(10)
Out[7]:
SK_ID_CURR prev_loan_cnt
0 100002 8.0
1 100003 4.0
2 100004 2.0
3 100006 0.0
4 100007 1.0
5 100008 3.0
6 100009 18.0
7 100010 2.0
8 100011 4.0
9 100012 0.0

We see that our df_train now has the new column prev_loan_cnt appended to it.

In [8]:
df_train.head(10)
Out[8]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR prev_loan_cnt
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 8.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 4.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 2.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 NaN NaN NaN NaN NaN NaN 0.0
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
5 100008 0 Cash loans M N Y 0 99000.0 490495.5 27517.5 ... 0 0 0 0.0 0.0 0.0 0.0 1.0 1.0 3.0
6 100009 0 Cash loans F Y Y 1 171000.0 1560726.0 41301.0 ... 0 0 0 0.0 0.0 0.0 1.0 1.0 2.0 18.0
7 100010 0 Cash loans M Y Y 0 360000.0 1530000.0 42075.0 ... 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 2.0
8 100011 0 Cash loans F N Y 0 112500.0 1019610.0 33826.5 ... 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 4.0
9 100012 0 Revolving loans M N Y 0 135000.0 405000.0 20250.0 ... 0 0 0 NaN NaN NaN NaN NaN NaN 0.0

10 rows × 123 columns

We use our function plot_kde_hist_var to evaluate the potential explanatory power of the prev_loan_cnt variable. By evaluating the plot and the correlation with Target=True we find that prev_loan_cnt is likely to be a weak predictor of Target.

In [ ]:
eda.plot_kde_hist_var(df_train,['EXT_SOURCE_3','prev_loan_cnt'])
Median Value of EXT_SOURCE_3 when Target (True): 0.379100
Median Value of EXT_SOURCE_3 when Target (False): 0.546023
Pearson Correlation of EXT_SOURCE_3 with Target (True): -0.178919
Median Value of prev_loan_cnt when Target (True): 3.000000
Median Value of prev_loan_cnt when Target (False): 4.000000
Pearson Correlation of prev_loan_cnt with Target (True): -0.010020
Out[ ]:
([True, True], [0.0, 1.1726988855032431e-45])

Engineering new features from all numeric columns from 'bureau.csv'

We initially only obtained prev_loan_cnt by grouping together our dataset based on loan application ID (SK_ID_CURR) and using the count operation on the loan IDs (SK_ID_BUREAU).

However, in feature engineering, we don't know which variable might be important and different aggregation calculations such as median, max, min, etc applied to each numeric variable may lead to a better predictor.

We drop out SK_ID_BUREAU because it is only important for prev_loan_cnt and has no meaning even if we applied metrics such as median, min, max, etc sd SK_ID_BUREAU is only useful when applying count.

We perform the same groupby operation as before in 'Previous Loan Counts' on all numeric variables and reset the index.

In [9]:
df_bureau_agg = df_bureau.drop(columns=['SK_ID_BUREAU']).groupby(['SK_ID_CURR'], as_index=False).agg(['count','mean','median','min','max','sum']).reset_index()
df_bureau_agg.head()
Out[9]:
SK_ID_CURR DAYS_CREDIT CREDIT_DAY_OVERDUE ... DAYS_CREDIT_UPDATE AMT_ANNUITY
count mean median min max sum count mean median ... median min max sum count mean median min max sum
0 100001 7 -735.000000 -857.0 -1572 -49 -5145 7 0.0 0.0 ... -155.0 -155 -6 -652 7 3545.357143 0.0 0.0 10822.5 24817.5
1 100002 8 -874.000000 -1042.5 -1437 -103 -6992 8 0.0 0.0 ... -402.5 -1185 -7 -3999 7 0.000000 0.0 0.0 0.0 0.0
2 100003 4 -1400.750000 -1205.5 -2586 -606 -5603 4 0.0 0.0 ... -545.0 -2131 -43 -3264 0 NaN NaN NaN NaN 0.0
3 100004 2 -867.000000 -867.0 -1326 -408 -1734 2 0.0 0.0 ... -532.0 -682 -382 -1064 0 NaN NaN NaN NaN 0.0
4 100005 3 -190.666667 -137.0 -373 -62 -572 3 0.0 0.0 ... -31.0 -121 -11 -163 3 1420.500000 0.0 0.0 4261.5 4261.5

5 rows × 73 columns

We can see that using the groupby and agg functions lead to a Multi-level index. The first level are items such as ['DAYS_CREDIT','CREDIT_DAY_OVERDUE'] whereas the second level are the aggregation functions such as ['count','median','mean','min','max','sum'].

To extract specific columns from the dataframe based on a Multiindex, use the Level 0 name (e.g., DAYS_CREDIT,CREDIT_DAY_OVEDUE) and then the Level 1 name (e.g., mean, median). This returns a Series object. Use the loc function to return a DataFrame object.

In [10]:
df_bureau_agg['DAYS_CREDIT']['count'].head(3)
Out[10]:
0    7
1    8
2    4
Name: count, dtype: int64

Use the loc function to return a DataFrame object

In [ ]:
df_bureau_agg.loc[:,[('DAYS_CREDIT','mean')]].head(3)
Out[ ]:
DAYS_CREDIT
mean
0 -735.00
1 -874.00
2 -1400.75

The xs command is another method for accessing Multi-Level indexes. It allows you to index specifically by listing the column names and the levels that you are referring to in the Multi-index dataframe

In [ ]:
df_bureau_agg.xs(('DAYS_CREDIT','mean'),level=(0,1),axis=1).head(3)
Out[ ]:
DAYS_CREDIT
mean
0 -735.00
1 -874.00
2 -1400.75

We can see in the below instance that xs extracted all the mean values for all the Multi-Index entries.

In [ ]:
df_bureau_agg.xs(('mean'), level =(1),axis=1).head()
Out[ ]:
DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 -735.000000 0.0 82.428571 -825.500000 NaN 0.0 207623.571429 85240.928571 0.00000 0.0 -93.142857 3545.357143
1 -874.000000 0.0 -349.000000 -697.500000 1681.029 0.0 108131.945625 49156.200000 7997.14125 0.0 -499.875000 0.000000
2 -1400.750000 0.0 -544.500000 -1097.333333 0.000 0.0 254350.125000 0.000000 202500.00000 0.0 -816.000000 NaN
3 -867.000000 0.0 -488.500000 -532.500000 0.000 0.0 94518.900000 0.000000 0.00000 0.0 -532.000000 NaN
4 -190.666667 0.0 439.333333 -123.000000 0.000 0.0 219042.000000 189469.500000 0.00000 0.0 -54.333333 1420.500000
In [ ]:
df_bureau_agg.xs(('mean'), level =(1),axis=1).head()
Out[ ]:
DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 -735.000000 0.0 82.428571 -825.500000 NaN 0.0 207623.571429 85240.928571 0.00000 0.0 -93.142857 3545.357143
1 -874.000000 0.0 -349.000000 -697.500000 1681.029 0.0 108131.945625 49156.200000 7997.14125 0.0 -499.875000 0.000000
2 -1400.750000 0.0 -544.500000 -1097.333333 0.000 0.0 254350.125000 0.000000 202500.00000 0.0 -816.000000 NaN
3 -867.000000 0.0 -488.500000 -532.500000 0.000 0.0 94518.900000 0.000000 0.00000 0.0 -532.000000 NaN
4 -190.666667 0.0 439.333333 -123.000000 0.000 0.0 219042.000000 189469.500000 0.00000 0.0 -54.333333 1420.500000

Evaluate usefulness of new features

We merge df_train and df_bureau_agg to evaluate the predictive power of our new features with TARGET. To evaluate the predictive power of the new features we perform the following actions

  • Merge the new features with the training dataset.
  • Calculate correlations between the new features and TARGET from the training dataset.
  • Generate KDE plots of features of populations with Target=0/1.
In [ ]:
df_train = df_train.merge(df_bureau_agg, how='left', on='SK_ID_CURR' )
df_train.head()
Out[ ]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... (DAYS_CREDIT_UPDATE, median) (DAYS_CREDIT_UPDATE, min) (DAYS_CREDIT_UPDATE, max) (DAYS_CREDIT_UPDATE, sum) (AMT_ANNUITY, count) (AMT_ANNUITY, mean) (AMT_ANNUITY, median) (AMT_ANNUITY, min) (AMT_ANNUITY, max) (AMT_ANNUITY, sum)
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... -402.5 -1185.0 -7.0 -3999.0 7.0 0.0 0.0 0.0 0.0 0.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... -545.0 -2131.0 -43.0 -3264.0 0.0 NaN NaN NaN NaN 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... -532.0 -682.0 -382.0 -1064.0 0.0 NaN NaN NaN NaN 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... -783.0 -783.0 -783.0 -783.0 0.0 NaN NaN NaN NaN 0.0

5 rows × 195 columns

We can see that the merged dataframe contains only one level for the index. Thus the Multi_index from df_bureau_agg has been squashed into a tuple.

In [ ]:
df_train.columns
Out[ ]:
Index([                    'SK_ID_CURR',                         'TARGET',
                   'NAME_CONTRACT_TYPE',                    'CODE_GENDER',
                         'FLAG_OWN_CAR',                'FLAG_OWN_REALTY',
                         'CNT_CHILDREN',               'AMT_INCOME_TOTAL',
                           'AMT_CREDIT',                    'AMT_ANNUITY',
       ...
       ('DAYS_CREDIT_UPDATE', 'median'),    ('DAYS_CREDIT_UPDATE', 'min'),
          ('DAYS_CREDIT_UPDATE', 'max'),    ('DAYS_CREDIT_UPDATE', 'sum'),
               ('AMT_ANNUITY', 'count'),          ('AMT_ANNUITY', 'mean'),
              ('AMT_ANNUITY', 'median'),           ('AMT_ANNUITY', 'min'),
                 ('AMT_ANNUITY', 'max'),           ('AMT_ANNUITY', 'sum')],
      dtype='object', length=195)

We convert these tuple into a string and update the column names of the training dataframe accordingly

In [ ]:
colName = list(df_train.columns)

colNewName = []
for col in colName:
  
  if isinstance(col,tuple):
    newCol = '_'.join(col)
    colNewName.append(newCol)
  else:
    colNewName.append(col)  

    
df_train.columns = colNewName
print(df_train.columns)
Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY',
       ...
       'DAYS_CREDIT_UPDATE_median', 'DAYS_CREDIT_UPDATE_min',
       'DAYS_CREDIT_UPDATE_max', 'DAYS_CREDIT_UPDATE_sum', 'AMT_ANNUITY_count',
       'AMT_ANNUITY_mean', 'AMT_ANNUITY_median', 'AMT_ANNUITY_min',
       'AMT_ANNUITY_max', 'AMT_ANNUITY_sum'],
      dtype='object', length=195)

We calculate the correlation of all variables in the dataframe to evaluate whether our new variables from bureau are high magnitude of correlation with TARGET.

In [ ]:
corrVal = df_train.corr()['TARGET']
corrValabs = abs(corrVal).sort_values(ascending=False)
In [ ]:
print('Highest Correlation:\n{}'.format(corrValabs.head(15)))
Highest Correlation:
TARGET                         1.000000
EXT_SOURCE_3                   0.178919
EXT_SOURCE_2                   0.160472
EXT_SOURCE_1                   0.155317
DAYS_CREDIT_mean               0.089729
DAYS_CREDIT_median             0.085414
DAYS_BIRTH                     0.078239
DAYS_CREDIT_min                0.075248
DAYS_CREDIT_UPDATE_mean        0.068927
DAYS_CREDIT_UPDATE_median      0.067338
REGION_RATING_CLIENT_W_CITY    0.060893
REGION_RATING_CLIENT           0.058899
DAYS_ENDDATE_FACT_min          0.055887
DAYS_LAST_PHONE_CHANGE         0.055218
DAYS_CREDIT_ENDDATE_sum        0.053735
Name: TARGET, dtype: float64
In [ ]:
print('Lowest Correlation:\n{}'.format(corrValabs.tail(15)))
Lowest Correlation:
FLAG_DOCUMENT_10              0.001414
AMT_ANNUITY_mean              0.001391
FLAG_DOCUMENT_19              0.001358
AMT_ANNUITY_max               0.001120
AMT_REQ_CREDIT_BUREAU_HOUR    0.000930
AMT_REQ_CREDIT_BUREAU_WEEK    0.000788
FLAG_DOCUMENT_12              0.000756
AMT_CREDIT_SUM_DEBT_mean      0.000637
FLAG_MOBIL                    0.000534
FLAG_CONT_MOBILE              0.000370
FLAG_DOCUMENT_5               0.000316
AMT_CREDIT_SUM_DEBT_min       0.000242
FLAG_DOCUMENT_20              0.000215
CNT_CREDIT_PROLONG_min        0.000182
AMT_CREDIT_SUM_OVERDUE_min    0.000003
Name: TARGET, dtype: float64

We see that a few of our new variables have higher correlations with TARGET such as (DAYS_CREDIT, x), (DAYS_CREDIT_UPDATE,x), thus we can run our KDE plot

We plot the distributions for these new feature variables that exhibit higher correlations with TARGET. Bear in mind that the correlations are still around 0.06 - 0.09 which are still quite low in the scheme of things.

The shapes of the KDEs of these different features for the TARGET=0 and TARGET=1 populations are quite different but this may be due to the imbalanced population of the dataset

In [ ]:
newFeat = ['DAYS_CREDIT_mean','DAYS_CREDIT_median','DAYS_CREDIT_min','DAYS_CREDIT_UPDATE_mean']
eda.plot_kde_hist_var(df_train,newFeat)
Median Value of DAYS_CREDIT_mean when Target (True): -835.333333
Median Value of DAYS_CREDIT_mean when Target (False): -1067.000000
Pearson Correlation of DAYS_CREDIT_mean with Target (True): 0.089729
Median Value of DAYS_CREDIT_median when Target (True): -729.000000
Median Value of DAYS_CREDIT_median when Target (False): -976.000000
Pearson Correlation of DAYS_CREDIT_median with Target (True): 0.085414
Median Value of DAYS_CREDIT_min when Target (True): -1469.000000
Median Value of DAYS_CREDIT_min when Target (False): -1861.000000
Pearson Correlation of DAYS_CREDIT_min with Target (True): 0.075248
Median Value of DAYS_CREDIT_UPDATE_mean when Target (True): -348.873016
Median Value of DAYS_CREDIT_UPDATE_mean when Target (False): -492.800000
Pearson Correlation of DAYS_CREDIT_UPDATE_mean with Target (True): 0.068927
Out[ ]:
([True, True, True, True], [0.0, 0.0, 3.7630723481423284e-305, 0.0])

Functions for aggregating numeric columns

We write a function that aggregates all the numeric columns in a dataframe, and collapses the multi-index dataframe into a single index.

It is often easier to work with a Single Index rather than a multi index dataframe (i.e., When merging a multi-index with a single-index dataframe, multi-index values are converted into tuples).

We write a function in our rand_eda module to perform this operation so it can be repeated for other datasets.

In [11]:
df = df_bureau.copy()
groupby_var = 'SK_ID_CURR'
drop_var_list= ['SK_ID_BUREAU']
df_name = 'bureau'

df_bureau_agg_num = eda.agg_num(df, groupby_var, drop_var_list, df_name)
df_bureau_agg_num.head()
Out[11]:
bureau_DAYS_CREDIT_mean bureau_DAYS_CREDIT_median bureau_DAYS_CREDIT_sum bureau_DAYS_CREDIT_count bureau_DAYS_CREDIT_max bureau_DAYS_CREDIT_min bureau_CREDIT_DAY_OVERDUE_mean bureau_CREDIT_DAY_OVERDUE_median bureau_CREDIT_DAY_OVERDUE_sum bureau_CREDIT_DAY_OVERDUE_count ... bureau_DAYS_CREDIT_UPDATE_sum bureau_DAYS_CREDIT_UPDATE_count bureau_DAYS_CREDIT_UPDATE_max bureau_DAYS_CREDIT_UPDATE_min bureau_AMT_ANNUITY_mean bureau_AMT_ANNUITY_median bureau_AMT_ANNUITY_sum bureau_AMT_ANNUITY_count bureau_AMT_ANNUITY_max bureau_AMT_ANNUITY_min
SK_ID_CURR
100001 -735.000000 -857.0 -5145 7 -49 -1572 0.0 0.0 0 7 ... -652 7 -6 -155 3545.357143 0.0 24817.5 7 10822.5 0.0
100002 -874.000000 -1042.5 -6992 8 -103 -1437 0.0 0.0 0 8 ... -3999 8 -7 -1185 0.000000 0.0 0.0 7 0.0 0.0
100003 -1400.750000 -1205.5 -5603 4 -606 -2586 0.0 0.0 0 4 ... -3264 4 -43 -2131 NaN NaN 0.0 0 NaN NaN
100004 -867.000000 -867.0 -1734 2 -408 -1326 0.0 0.0 0 2 ... -1064 2 -382 -682 NaN NaN 0.0 0 NaN NaN
100005 -190.666667 -137.0 -572 3 -62 -373 0.0 0.0 0 3 ... -163 3 -11 -121 1420.500000 0.0 4261.5 3 4261.5 0.0

5 rows × 72 columns

The code below is included in the rand_eda.py module. This is the module we use for all DataScience activity.

In [ ]:
'''
Takes a two-level multi-index and flattens it into a one-level DataFrame

inputs
------
    df (dataframe): a df with a multi_index column

outputs
-------
    df dataframe: returns df with single index columns

'''
def flatten_multi_index(df):

  colName = []

  for idx0,val0 in enumerate(df.columns.levels[0]):
    for idx1,val1 in enumerate(df.columns.levels[1]):
      combine = (val0,val1)
      colName.append('_'.join(combine))

  df.columns = colName

  return df
In [ ]:
"""
Performs aggregation statistical functions on numeric columns of a dataset and
 collapses the multi index into a single index.

Inputs
------
    df (dataframe): dataframe to perform aggregation operations on
    groupby_var (string): this is the column to group by
    drop_var (list): list of columns to drop.  These are columns that add no value when aggregation operations are applied (e.g., ID columns)

Outputs
-------
    df_agg : a dataframe with a flattened multi_index and aggregate statistics across all numeric columns
"""
def agg_num(df, groupby_var, drop_var_list, df_name):
  
    for col in df:
        for drop_var in drop_var_list:
            if col == drop_var:
                df = df.drop(columns = col)
# extract out the groupby ID variable, extract all numeric columns, then add back
# the groupby ID variable
    groupby_ids = df[groupby_var]
    df_numeric = df.select_dtypes('number')
    df_numeric[groupby_var] = groupby_ids

# group the dataframe accordingly
    df_agg = df_numeric.groupby(groupby_var).agg(['mean','median','sum','count','max','min'])

# flatten the multi index naming convention to a single index
# prefix the column names with a dataframe name (important if we merge this data into a training dataframe)
# re-insert the groupby index data
    df_agg = flatten_multi_index(df_agg)
    df_agg = df_agg.add_prefix('{}_'.format(df_name))
    

    return df_agg

Engineering new features from categorical columns

We print out basic informaiton about the df_bureau dataframe and find that there are a x3 categorical columns.

In [12]:
df_bureau.head()
info = eda.print_basic_info_df(df_bureau)
print(info)
Num rows                                                      1716428
Num cols                                                           17
Dtype                               float64(8), int64(6), object(3), 
Memory (MB)                                                       536
True (%)                                                          N/A
Is Balanced                                                       N/A
Categorical cols    CREDIT_ACTIVE(4), CREDIT_CURRENCY(4), CREDIT_T...
dtype: object

A closer inspection shows that there are x3 categorical variables that we may need to apply one hot encoding upon a few categorical columns such as CREDIT_ACTIVE, CREDIT_CURRENCY, etc..

In [13]:
df_bureau.head()
Out[13]:
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 NaN 0.0 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 NaN 0.0 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN 0.0 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.0 NaN NaN 0.0 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN 0.0 Consumer credit -21 NaN

We take a closer look at the categorical variables to evalute how many unique entries, the names of these unique entries and how often they occur in the dataset.

In [8]:
eda.extract_cat_var(df_bureau)
Categorical column: CREDIT_ACTIVE

Number of unique entries: 4

Unique entry names:
['Closed' 'Active' 'Sold' 'Bad debt']

Value counts of each entry:
Closed      1079273
Active       630607
Sold           6527
Bad debt         21
Name: CREDIT_ACTIVE, dtype: int64

---------------------------------------------------------
Categorical column: CREDIT_CURRENCY

Number of unique entries: 4

Unique entry names:
['currency 1' 'currency 2' 'currency 4' 'currency 3']

Value counts of each entry:
currency 1    1715020
currency 2       1224
currency 3        174
currency 4         10
Name: CREDIT_CURRENCY, dtype: int64

---------------------------------------------------------
Categorical column: CREDIT_TYPE

Number of unique entries: 15

Unique entry names:
['Consumer credit' 'Credit card' 'Mortgage' 'Car loan' 'Microloan'
 'Loan for working capital replenishment' 'Loan for business development'
 'Real estate loan' 'Unknown type of loan' 'Another type of loan'
 'Cash loan (non-earmarked)' 'Loan for the purchase of equipment'
 'Mobile operator loan' 'Interbank credit'
 'Loan for purchase of shares (margin lending)']

Value counts of each entry:
Consumer credit                                 1251615
Credit card                                      402195
Car loan                                          27690
Mortgage                                          18391
Microloan                                         12413
Loan for business development                      1975
Another type of loan                               1017
Unknown type of loan                                555
Loan for working capital replenishment              469
Cash loan (non-earmarked)                            56
Real estate loan                                     27
Loan for the purchase of equipment                   19
Loan for purchase of shares (margin lending)          4
Interbank credit                                      1
Mobile operator loan                                  1
Name: CREDIT_TYPE, dtype: int64

---------------------------------------------------------

Using sklearn's OneHotEncoder

We apply scikit-learn's OneHotEncoder which as of version 20 can now function on string variables.

We can see from previous cells that CREDIT_ACTIVE consists of x4 unique variables. Thus, we expect that if we apply OneHotEncoder, this would produce a matrix with x4 columns.

Tip

  • OneHotEncoder only accepts DataFrames, not a Series so use [[ ]] when selecting a single column(s).
  • Use the handle_unknown='ignore' functionality of OneHotEncoder to ignore issues such as:
    • Unknown categories/missing values in the test set that are not in the training set.
    • OneHotEncoder can perform transform and ignore unknown categories/missing values on the test data.
    • OneHotEncoder cannot perform fit_transform on unknown categories/missing values for the training data.

OneHotEncoder on a single DataFrame column

In [ ]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

credit_act = df_bureau[['CREDIT_ACTIVE']]
credit_act_OHE = ohe.fit_transform(credit_act)
In [ ]:
print(credit_act_OHE)
[[0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 ...
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]
In [ ]:
featnames = ohe.get_feature_names()
print(featnames)
['x0_Active' 'x0_Bad debt' 'x0_Closed' 'x0_Sold']
In [ ]:
df_credit_ohe = pd.DataFrame(credit_act_OHE,columns=featnames)
df_credit_ohe.head()
Out[ ]:
x0_Active x0_Bad debt x0_Closed x0_Sold
0 0.0 0.0 1.0 0.0
1 1.0 0.0 0.0 0.0
2 1.0 0.0 0.0 0.0
3 1.0 0.0 0.0 0.0
4 1.0 0.0 0.0 0.0
In [ ]:
df_bureau['CREDIT_ACTIVE'].head()
Out[ ]:
0    Closed
1    Active
2    Active
3    Active
4    Active
Name: CREDIT_ACTIVE, dtype: object

We can see that the OneHotEncoder performs an accurate operation by comparing the results between df_credit_ohe and df_bureau['CREDIT_ACTIVE'].

We also verify this by performing the inverse_transform applied to the transformed OHE array back to its original labeled values.

In [ ]:
ohe.inverse_transform(credit_act_OHE)[:5]
Out[ ]:
array([['Closed'],
       ['Active'],
       ['Active'],
       ['Active'],
       ['Active']], dtype=object)

OneHotEncoder on a full DataFrame

In [ ]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

We perform one hot encoding on all variables of df_bureau. We see that 23 columns are created with a 1 in the appropriate column

In [ ]:
df_bureau_categorical = ohe.fit_transform(df_bureau.select_dtypes('object'))
print(df_bureau_categorical.shape)
df_bureau_categorical[:5]
(1716428, 23)
Out[ ]:
array([[0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.]])
In [ ]:
df_bureau_categorical_origin = ohe.inverse_transform(df_bureau_categorical)
print(df_bureau_categorical_origin)
[['Closed' 'currency 1' 'Consumer credit']
 ['Active' 'currency 1' 'Credit card']
 ['Active' 'currency 1' 'Consumer credit']
 ...
 ['Closed' 'currency 1' 'Consumer credit']
 ['Closed' 'currency 1' 'Consumer credit']
 ['Closed' 'currency 1' 'Microloan']]
In [ ]:
featnames = ohe.get_feature_names()
print(featnames)
['x0_Active' 'x0_Bad debt' 'x0_Closed' 'x0_Sold' 'x1_currency 1'
 'x1_currency 2' 'x1_currency 3' 'x1_currency 4' 'x2_Another type of loan'
 'x2_Car loan' 'x2_Cash loan (non-earmarked)' 'x2_Consumer credit'
 'x2_Credit card' 'x2_Interbank credit' 'x2_Loan for business development'
 'x2_Loan for purchase of shares (margin lending)'
 'x2_Loan for the purchase of equipment'
 'x2_Loan for working capital replenishment' 'x2_Microloan'
 'x2_Mobile operator loan' 'x2_Mortgage' 'x2_Real estate loan'
 'x2_Unknown type of loan']

We can see above that scikit-learn's OneHotEncoder. Is useful and quite powerful.

It is particularly useful if we are using the Pipeline functionality in sckit-learn

In [ ]:
df_bureau.head()
Out[ ]:
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 NaN 0.0 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 NaN 0.0 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN 0.0 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.0 NaN NaN 0.0 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN 0.0 Consumer credit -21 NaN

Using panda's get_dummies

We use the get_dummies from pandas to perform one hot encoding.

In [ ]:
df_categorical = pd.get_dummies(df_bureau.select_dtypes('object'))

df_categorical['SK_ID_CURR'] = df_bureau['SK_ID_CURR']  
df_categorical.head()
Out[ ]:
CREDIT_ACTIVE_Active CREDIT_ACTIVE_Bad debt CREDIT_ACTIVE_Closed CREDIT_ACTIVE_Sold CREDIT_CURRENCY_currency 1 CREDIT_CURRENCY_currency 2 CREDIT_CURRENCY_currency 3 CREDIT_CURRENCY_currency 4 CREDIT_TYPE_Another type of loan CREDIT_TYPE_Car loan ... CREDIT_TYPE_Loan for business development CREDIT_TYPE_Loan for purchase of shares (margin lending) CREDIT_TYPE_Loan for the purchase of equipment CREDIT_TYPE_Loan for working capital replenishment CREDIT_TYPE_Microloan CREDIT_TYPE_Mobile operator loan CREDIT_TYPE_Mortgage CREDIT_TYPE_Real estate loan CREDIT_TYPE_Unknown type of loan SK_ID_CURR
0 0 0 1 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 215354
1 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 215354
2 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 215354
3 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 215354
4 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 215354

5 rows × 24 columns

In [ ]:
df_categorical_grouped = df_categorical.groupby('SK_ID_CURR').agg(['sum','mean'])
df_categorical_grouped.head()
Out[ ]:
CREDIT_ACTIVE_Active CREDIT_ACTIVE_Bad debt CREDIT_ACTIVE_Closed CREDIT_ACTIVE_Sold CREDIT_CURRENCY_currency 1 ... CREDIT_TYPE_Microloan CREDIT_TYPE_Mobile operator loan CREDIT_TYPE_Mortgage CREDIT_TYPE_Real estate loan CREDIT_TYPE_Unknown type of loan
sum mean sum mean sum mean sum mean sum mean ... sum mean sum mean sum mean sum mean sum mean
SK_ID_CURR
100001 3 0.428571 0 0.0 4 0.571429 0 0.0 7 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100002 2 0.250000 0 0.0 6 0.750000 0 0.0 8 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100003 1 0.250000 0 0.0 3 0.750000 0 0.0 4 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100004 0 0.000000 0 0.0 2 1.000000 0 0.0 2 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100005 2 0.666667 0 0.0 1 0.333333 0 0.0 3 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0

5 rows × 46 columns

As the groupby function has led to a multiindex, we write a function to flatten it to a single index.

In [ ]:
df_categorical_grouped = eda.flatten_multi_index(df_categorical_grouped)
df_categorical_grouped.head()
Out[ ]:
CREDIT_ACTIVE_Active_sum CREDIT_ACTIVE_Active_mean CREDIT_ACTIVE_Bad debt_sum CREDIT_ACTIVE_Bad debt_mean CREDIT_ACTIVE_Closed_sum CREDIT_ACTIVE_Closed_mean CREDIT_ACTIVE_Sold_sum CREDIT_ACTIVE_Sold_mean CREDIT_CURRENCY_currency 1_sum CREDIT_CURRENCY_currency 1_mean ... CREDIT_TYPE_Microloan_sum CREDIT_TYPE_Microloan_mean CREDIT_TYPE_Mobile operator loan_sum CREDIT_TYPE_Mobile operator loan_mean CREDIT_TYPE_Mortgage_sum CREDIT_TYPE_Mortgage_mean CREDIT_TYPE_Real estate loan_sum CREDIT_TYPE_Real estate loan_mean CREDIT_TYPE_Unknown type of loan_sum CREDIT_TYPE_Unknown type of loan_mean
SK_ID_CURR
100001 3 0.428571 0 0.0 4 0.571429 0 0.0 7 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100002 2 0.250000 0 0.0 6 0.750000 0 0.0 8 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100003 1 0.250000 0 0.0 3 0.750000 0 0.0 4 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100004 0 0.000000 0 0.0 2 1.000000 0 0.0 2 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100005 2 0.666667 0 0.0 1 0.333333 0 0.0 3 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0

5 rows × 46 columns

Functions for aggregating categorical columns

We create a function cnt_cat so we can perform this operation on other datases easily.

In [14]:
df =  df_bureau.copy()
groupby_var = 'SK_ID_CURR'
df_name = 'bureau'

df_bureau_cat_cnt = eda.cnt_cat(df,groupby_var,df_name)
df_bureau_cat_cnt.head()
Out[14]:
bureau_CREDIT_ACTIVE_Active_sum bureau_CREDIT_ACTIVE_Active_mean bureau_CREDIT_ACTIVE_Bad debt_sum bureau_CREDIT_ACTIVE_Bad debt_mean bureau_CREDIT_ACTIVE_Closed_sum bureau_CREDIT_ACTIVE_Closed_mean bureau_CREDIT_ACTIVE_Sold_sum bureau_CREDIT_ACTIVE_Sold_mean bureau_CREDIT_CURRENCY_currency 1_sum bureau_CREDIT_CURRENCY_currency 1_mean ... bureau_CREDIT_TYPE_Microloan_sum bureau_CREDIT_TYPE_Microloan_mean bureau_CREDIT_TYPE_Mobile operator loan_sum bureau_CREDIT_TYPE_Mobile operator loan_mean bureau_CREDIT_TYPE_Mortgage_sum bureau_CREDIT_TYPE_Mortgage_mean bureau_CREDIT_TYPE_Real estate loan_sum bureau_CREDIT_TYPE_Real estate loan_mean bureau_CREDIT_TYPE_Unknown type of loan_sum bureau_CREDIT_TYPE_Unknown type of loan_mean
SK_ID_CURR
100001 3 0.428571 0 0.0 4 0.571429 0 0.0 7 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100002 2 0.250000 0 0.0 6 0.750000 0 0.0 8 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100003 1 0.250000 0 0.0 3 0.750000 0 0.0 4 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100004 0 0.000000 0 0.0 2 1.000000 0 0.0 2 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100005 2 0.666667 0 0.0 1 0.333333 0 0.0 3 1.0 ... 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0

5 rows × 46 columns

The code below is included in the rand_eda.py module. This is the module we use for all DataScience activity.

In [ ]:
'''
Performs count functions on categorical columns of a dataset and collapses the multi multi_index
into a single index

Inputs
------
    df (dataframe): input DataFrame
    groupby_var (string): name of variable to groupby

Outputs
-------
    df (dataframe): ouptut dataframe

'''

def cnt_cat(df,groupby_var,df_name):

    groupby_ids = df[groupby_var]
    df_cat= pd.get_dummies(df.select_dtypes('object'))
    df_cat[groupby_var] = groupby_ids
    df_cat_agg = df_cat.groupby(groupby_var).agg(['sum','mean'])


    df_cat_agg = flatten_multi_index(df_cat_agg)
    df_cat_agg = df_cat_agg.add_prefix('{}_'.format(df_name))


    return df_cat_agg

Applying agg. functions for bureau.csv

In [15]:
df_bureau.head()
Out[15]:
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 NaN 0.0 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 NaN 0.0 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN 0.0 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.0 NaN NaN 0.0 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN 0.0 Consumer credit -21 NaN

Applying agg. functions for bureau_balance.csv

We apply the aggregation function to bureau balances. We see that this provides data on each loan and its status.

In [9]:
df_bureau_bal.head()
Out[9]:
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C

We see that the dataset is quite large and only contains x3 columns.

SK_ID_BUREAU: This are the identifying keysof the database MONTHS_BALANCE: These are the monthly balance of each loan. STATUS: This is the status of each loan.

In [10]:
df_bureau_bal.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
SK_ID_BUREAU      int64
MONTHS_BALANCE    int64
STATUS            object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB
In [11]:
df_bureau_bal['STATUS'].unique()
Out[11]:
array(['C', '0', 'X', '1', '2', '3', '5', '4'], dtype=object)

We group by SK_ID_BUREAU which is the ID of each bureau, and give each new feature a prefix of bureau_bal. There are no columns to drop since there are only x3 columns and aggregation by any of these are not useless features.

In [ ]:
df = df_bureau_bal
groupby_var = 'SK_ID_BUREAU'
df_name = 'bureau_bal'
drop_var_list = list()
In [17]:
df_bureau_agg_cnt = eda.cnt_cat(df,groupby_var,df_name)
df_bureau_agg_cnt.head()
Out[17]:
bureau_bal_STATUS_0_sum bureau_bal_STATUS_0_mean bureau_bal_STATUS_1_sum bureau_bal_STATUS_1_mean bureau_bal_STATUS_2_sum bureau_bal_STATUS_2_mean bureau_bal_STATUS_3_sum bureau_bal_STATUS_3_mean bureau_bal_STATUS_4_sum bureau_bal_STATUS_4_mean bureau_bal_STATUS_5_sum bureau_bal_STATUS_5_mean bureau_bal_STATUS_C_sum bureau_bal_STATUS_C_mean bureau_bal_STATUS_X_sum bureau_bal_STATUS_X_mean
SK_ID_BUREAU
5001709 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 86 0.886598 11 0.113402
5001710 5 0.060241 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 48 0.578313 30 0.361446
5001711 3 0.750000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 1 0.250000
5001712 10 0.526316 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 9 0.473684 0 0.000000
5001713 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 22 1.000000

We can see that performing the agg. functions on categorical columns has led to 16 columns as there are 8 unique entries for STATUS.

In [14]:
df_bureau_agg_cnt.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 817395 entries, 5001709 to 6842888
Data columns (total 16 columns):
bureau_bal_STATUS_0_sum     817395 non-null uint8
bureau_bal_STATUS_0_mean    817395 non-null float64
bureau_bal_STATUS_1_sum     817395 non-null uint8
bureau_bal_STATUS_1_mean    817395 non-null float64
bureau_bal_STATUS_2_sum     817395 non-null uint8
bureau_bal_STATUS_2_mean    817395 non-null float64
bureau_bal_STATUS_3_sum     817395 non-null uint8
bureau_bal_STATUS_3_mean    817395 non-null float64
bureau_bal_STATUS_4_sum     817395 non-null uint8
bureau_bal_STATUS_4_mean    817395 non-null float64
bureau_bal_STATUS_5_sum     817395 non-null uint8
bureau_bal_STATUS_5_mean    817395 non-null float64
bureau_bal_STATUS_C_sum     817395 non-null uint8
bureau_bal_STATUS_C_mean    817395 non-null float64
bureau_bal_STATUS_X_sum     817395 non-null uint8
bureau_bal_STATUS_X_mean    817395 non-null float64
dtypes: float64(8), uint8(8)
memory usage: 62.4 MB

We can see the process of performing statistical aggregation operations on the numeric columns of bureau_balance.csv has resulted in x6 new columns.

In [18]:
df_bureau_agg_num = eda.agg_num(df, groupby_var,[],df_name)
df_bureau_agg_num.head()
Out[18]:
bureau_bal_MONTHS_BALANCE_mean bureau_bal_MONTHS_BALANCE_median bureau_bal_MONTHS_BALANCE_sum bureau_bal_MONTHS_BALANCE_count bureau_bal_MONTHS_BALANCE_max bureau_bal_MONTHS_BALANCE_min
SK_ID_BUREAU
5001709 -48.0 -48.0 -4656 97 0 -96
5001710 -41.0 -41.0 -3403 83 0 -82
5001711 -1.5 -1.5 -6 4 0 -3
5001712 -9.0 -9.0 -171 19 0 -18
5001713 -10.5 -10.5 -231 22 0 -21
In [25]:
df_bureau_agg_num.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 817395 entries, 5001709 to 6842888
Data columns (total 6 columns):
bureau_bal_MONTHS_BALANCE_mean      817395 non-null float64
bureau_bal_MONTHS_BALANCE_median    817395 non-null float64
bureau_bal_MONTHS_BALANCE_sum       817395 non-null int64
bureau_bal_MONTHS_BALANCE_count     817395 non-null int64
bureau_bal_MONTHS_BALANCE_max       817395 non-null int64
bureau_bal_MONTHS_BALANCE_min       817395 non-null int64
dtypes: float64(2), int64(4)
memory usage: 83.7 MB

The first merge we perform is between df_bureau_agg_num and df_bureau_agg_cnt.

We merge them both based on the SK_ID_BUREAU column (i.e., loan ID) that is used as an index on both and based on an outer join so that rows that are in one dataframe or the other will be added to the final dataframe regardless. This ensures we do not lose any data from either dataframe.

In [51]:
df_bureau_loan = pd.merge(df_bureau_agg_num,df_bureau_agg_cnt, left_index=True, right_index=True, how='outer')
df_bureau_loan.head()
Out[51]:
bureau_bal_MONTHS_BALANCE_mean bureau_bal_MONTHS_BALANCE_median bureau_bal_MONTHS_BALANCE_sum bureau_bal_MONTHS_BALANCE_count bureau_bal_MONTHS_BALANCE_max bureau_bal_MONTHS_BALANCE_min bureau_bal_STATUS_0_sum bureau_bal_STATUS_0_mean bureau_bal_STATUS_1_sum bureau_bal_STATUS_1_mean ... bureau_bal_STATUS_3_sum bureau_bal_STATUS_3_mean bureau_bal_STATUS_4_sum bureau_bal_STATUS_4_mean bureau_bal_STATUS_5_sum bureau_bal_STATUS_5_mean bureau_bal_STATUS_C_sum bureau_bal_STATUS_C_mean bureau_bal_STATUS_X_sum bureau_bal_STATUS_X_mean
SK_ID_BUREAU
5001709 -48.0 -48.0 -4656 97 0 -96 0 0.000000 0 0.0 ... 0 0.0 0 0.0 0 0.0 86 0.886598 11 0.113402
5001710 -41.0 -41.0 -3403 83 0 -82 5 0.060241 0 0.0 ... 0 0.0 0 0.0 0 0.0 48 0.578313 30 0.361446
5001711 -1.5 -1.5 -6 4 0 -3 3 0.750000 0 0.0 ... 0 0.0 0 0.0 0 0.0 0 0.000000 1 0.250000
5001712 -9.0 -9.0 -171 19 0 -18 10 0.526316 0 0.0 ... 0 0.0 0 0.0 0 0.0 9 0.473684 0 0.000000
5001713 -10.5 -10.5 -231 22 0 -21 0 0.000000 0 0.0 ... 0 0.0 0 0.0 0 0.0 0 0.000000 22 1.000000

5 rows × 22 columns

The second merge we perform is to append the SK_ID_CURR that is the ID for each loan applicant to our DataFrame of bureau loans (df_bureau_loan).

In [52]:
df_bureau_loan = pd.merge(df_bureau_loan,df_bureau[['SK_ID_CURR','SK_ID_BUREAU']],left_index=True, right_on='SK_ID_BUREAU',how='left').reset_index(drop=True)
df_bureau_loan.head()
Out[52]:
bureau_bal_MONTHS_BALANCE_mean bureau_bal_MONTHS_BALANCE_median bureau_bal_MONTHS_BALANCE_sum bureau_bal_MONTHS_BALANCE_count bureau_bal_MONTHS_BALANCE_max bureau_bal_MONTHS_BALANCE_min bureau_bal_STATUS_0_sum bureau_bal_STATUS_0_mean bureau_bal_STATUS_1_sum bureau_bal_STATUS_1_mean ... bureau_bal_STATUS_4_sum bureau_bal_STATUS_4_mean bureau_bal_STATUS_5_sum bureau_bal_STATUS_5_mean bureau_bal_STATUS_C_sum bureau_bal_STATUS_C_mean bureau_bal_STATUS_X_sum bureau_bal_STATUS_X_mean SK_ID_CURR SK_ID_BUREAU
0 -48.0 -48.0 -4656 97 0 -96 0 0.000000 0 0.0 ... 0 0.0 0 0.0 86 0.886598 11 0.113402 NaN 5001709
1 -41.0 -41.0 -3403 83 0 -82 5 0.060241 0 0.0 ... 0 0.0 0 0.0 48 0.578313 30 0.361446 162368.0 5001710
2 -1.5 -1.5 -6 4 0 -3 3 0.750000 0 0.0 ... 0 0.0 0 0.0 0 0.000000 1 0.250000 162368.0 5001711
3 -9.0 -9.0 -171 19 0 -18 10 0.526316 0 0.0 ... 0 0.0 0 0.0 9 0.473684 0 0.000000 162368.0 5001712
4 -10.5 -10.5 -231 22 0 -21 0 0.000000 0 0.0 ... 0 0.0 0 0.0 0 0.000000 22 1.000000 150635.0 5001713

5 rows × 24 columns

Since we have aggregated data for each loan together, and associated the loans with clients, we can group this data by a client basis

In [53]:
df = df_bureau_loan.copy()
groupby_var = 'SK_ID_CURR'
drop_var_list = ['SK_ID_BUREAU']
df_name = 'client'


df_bureau_loan_by_client = eda.agg_num(df,groupby_var,drop_var_list,df_name)
df_bureau_loan_by_client.head()
Out[53]:
client_bureau_bal_MONTHS_BALANCE_mean_mean client_bureau_bal_MONTHS_BALANCE_mean_median client_bureau_bal_MONTHS_BALANCE_mean_sum client_bureau_bal_MONTHS_BALANCE_mean_count client_bureau_bal_MONTHS_BALANCE_mean_max client_bureau_bal_MONTHS_BALANCE_mean_min client_bureau_bal_MONTHS_BALANCE_median_mean client_bureau_bal_MONTHS_BALANCE_median_median client_bureau_bal_MONTHS_BALANCE_median_sum client_bureau_bal_MONTHS_BALANCE_median_count ... client_bureau_bal_STATUS_X_sum_sum client_bureau_bal_STATUS_X_sum_count client_bureau_bal_STATUS_X_sum_max client_bureau_bal_STATUS_X_sum_min client_bureau_bal_STATUS_X_mean_mean client_bureau_bal_STATUS_X_mean_median client_bureau_bal_STATUS_X_mean_sum client_bureau_bal_STATUS_X_mean_count client_bureau_bal_STATUS_X_mean_max client_bureau_bal_STATUS_X_mean_min
SK_ID_CURR
100001.0 -11.785714 -14.00 -82.5 7 -0.5 -25.5 -11.785714 -14.00 -82.5 7 ... 30.0 7 9 0 0.214590 0.241379 1.502129 7 0.500000 0.0
100002.0 -21.875000 -26.00 -175.0 8 -1.5 -39.5 -21.875000 -26.00 -175.0 8 ... 15.0 8 3 0 0.161932 0.187500 1.295455 8 0.500000 0.0
100005.0 -3.000000 -2.00 -9.0 3 -1.0 -6.0 -3.000000 -2.00 -9.0 3 ... 2.0 3 1 0 0.136752 0.076923 0.410256 3 0.333333 0.0
100010.0 -46.000000 -46.00 -92.0 2 -19.5 -72.5 -46.000000 -46.00 -92.0 2 ... 0.0 2 0 0 0.000000 0.000000 0.000000 2 0.000000 0.0
100013.0 -28.250000 -29.75 -113.0 4 -19.5 -34.0 -28.250000 -29.75 -113.0 4 ... 41.0 4 40 0 0.254545 0.009091 1.018182 4 1.000000 0.0

5 rows × 132 columns

Converting iPython notebook to Python code

This allows us to run the code in Spyder.

In [2]:
!jupyter nbconvert ml_kaggle-home-loan-credit-risk-feat-eng-p2.ipynb --to python
[NbConvertApp] Converting notebook ml_kaggle-home-loan-credit-risk-feat-eng-p2.ipynb to python
[NbConvertApp] Writing 22805 bytes to ml_kaggle-home-loan-credit-risk-feat-eng-p2.py
In [ ]:
 

Comments

Comments powered by Disqus