Feature engineering an important part of machine-learning as we try to engineer (i.e., modify/create) new features from our existing dataset that might be meaningful in predicting the TARGET.

In the kaggle home-credit-default-risk competition, we are given the following datasets:

application_train.csv
previous_application.csv
installments_payments.csv
bureau.csv
POS_CASH_balance.csv
bureau_balance.csv
credit_card_balance.csv

Each datasets provides more information about the loan application in terms of how prompt they have been on their instalment payments, their credit history on other loans, the amount of cash or credit card balances they have etc. A data scientist/researcher should always investigate and create new features from all the information provided.

In this intermediate exercise, we will now analyze other datasets to create new features to improve our model.

Aggregation functions (i.e., median, mean, max, min, count, etc.) to engineer new feature variables across numeric columns .
- Use correlation estimates, graphical analysis (i.e., KDE plots) to evaluate predictive power of new featured engineers

Table of contents¶

Loading in required modules

Loading pickled dataframes

Engineering a new feature (e.g., previous loan counts) from 'bureau.csv'

Engineering new features from all numeric columns from 'bureau.csv'

Evaluate usefulness of new features

Writing functions for aggregating numeric columns

Engineering new features from categorical columns

Using sklearn's OneHotEncoder

OneHotEncoder on a single DataFrame column

OneHotEncoder on a full DataFrame

Using panda's get_dummies

Writing functions for aggregating categorical columns

Applying agg. functions for bureau.csv

Applying agg. functions for bureau_balance.csv

Converting iPython notebook to Python code

Loading in required modules¶

We mount Google Drive to our Colab instance so we can access any Python modules and datasets from this Colab notebook.

In [1]:

from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive

Add the Google Drive folder that is shared between the Crostini distribution and ChromeOS to the path. This allows seamlessly integrating any Python code/modules to be used in this this notebook.

In [ ]:

# importing all system modules
import os
import sys
import warnings
from pathlib import Path
warnings.filterwarnings('ignore')
if sys.platform == 'linux':
    sys.path.append('/content/gdrive/My Drive/Colab Notebooks/modules/') # Google Cloud
    #sys.path.append('/home/randlow/github/blog/listings/machine-learning/') # Crostini
elif sys.platform == 'win32':
    sys.path.append('\\Users\\randl\\github\\blog2\\listings\\machine-learning\\') # win32

# importing data science modules
import pandas as pd
import numpy as np
import sklearn
import scipy as sp
import pickleshare

# importing graphics modules
import matplotlib.pyplot as plt
import seaborn as sns

# importing personal data science modules
import rand_eda as eda

Loading pickled dataframes¶

To see how the below dataframes were obtained see the post on the Kaggle: Credit risk (Exploratory Data Analysis)

In [ ]:

home = str(Path.home())
if sys.platform == 'linux':
    inputDir =  '/content/gdrive/My Drive/Colab Notebooks/datasets/' # Google Cloud
    #inputDir = '/mnt/chromeOS/GoogleDrive/MyDrive/Colab Notebooks/datasets/' # Crostini
elif sys.platform == 'win32':
    inputDir = '\\datasets\\kaggle\\home-credit-default-risk\\' # windows

storeDir = inputDir+'kaggle/home-credit-default-risk/' # Google Cloud
#storeDir = home+inputDir+'kaggle/home-credit-default-risk/'
df_bureau = pd.read_csv(storeDir + 'bureau.csv')
df_bureau_bal = pd.read_csv(storeDir + 'bureau_balance.csv')
df_train = pd.read_csv(storeDir + 'application_train.csv')

In [27]:

'''
db = pickleshare.PickleShareDB(storeDir+'pickleshare')
print(db.keys())

df_app_test_align = db['df_app_test_align'] 
df_app_train_align = db['df_app_train_align'] 
'''

Out[27]:

"\ndb = pickleshare.PickleShareDB(storeDir+'pickleshare')\nprint(db.keys())\n\ndf_app_test_align = db['df_app_test_align'] \ndf_app_train_align = db['df_app_train_align'] \n"

Engineering a new feature (e.g., previous loan counts) from 'bureau.csv'¶

Analyzing the df_bureau, we see that for each loan applicant (i.e., 215354, 162297) we have a list of historical loans of each applicant. This data contains information such as whether the loan is still closed or active, how many days the loan was over due, the amount of credit taken, the credit type, and more.

We can assume that the number of previous loans can potentially impact the probability of default for each applicant. Based the loan ids (i.e., SK_ID_BUREAU) we can perform a number of aggregation operations such as mean, max, min, sum and evalute whether these new features add any value for our model.

In [3]:

df_bureau.head(10)

Out[3]:

	SK_ID_CURR	SK_ID_BUREAU	CREDIT_ACTIVE	CREDIT_CURRENCY	DAYS_CREDIT	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	AMT_CREDIT_SUM_LIMIT	CREDIT_TYPE	DAYS_CREDIT_UPDATE	AMT_ANNUITY
0	215354	5714462	Closed	currency 1	-497	-153.0	-153.0	NaN	91323.00	0.00	NaN	Consumer credit	-131	NaN
1	215354	5714463	Active	currency 1	-208	1075.0	NaN	NaN	225000.00	171342.00	NaN	Credit card	-20	NaN
2	215354	5714464	Active	currency 1	-203	528.0	NaN	NaN	464323.50	NaN	NaN	Consumer credit	-16	NaN
3	215354	5714465	Active	currency 1	-203	NaN	NaN	NaN	90000.00	NaN	NaN	Credit card	-16	NaN
4	215354	5714466	Active	currency 1	-629	1197.0	NaN	77674.5	2700000.00	NaN	NaN	Consumer credit	-21	NaN
5	215354	5714467	Active	currency 1	-273	27460.0	NaN	0.0	180000.00	71017.38	108982.62	Credit card	-31	NaN
6	215354	5714468	Active	currency 1	-43	79.0	NaN	0.0	42103.80	42103.80	0.00	Consumer credit	-22	NaN
7	162297	5714469	Closed	currency 1	-1896	-1684.0	-1710.0	14985.0	76878.45	0.00	0.00	Consumer credit	-1710	NaN
8	162297	5714470	Closed	currency 1	-1146	-811.0	-840.0	0.0	103007.70	0.00	0.00	Consumer credit	-840	NaN
9	162297	5714471	Active	currency 1	-1146	-484.0	NaN	0.0	4500.00	0.00	0.00	Credit card	-690	NaN

For applicant 215354 we can see he has 7 active accounts and 5 closed accounts. We can see that x3 loans were credit card, x1 is a car loan, and the rest were consumer credit loans. It is possible that the total number of previous loans held by the applicant may have predictive power on whether the applicant should be approved for his new loan. After all, if an applicant has too many loans he might be in financial distress.

In [4]:

df_bureau[df_bureau['SK_ID_CURR']==215354]

Out[4]:

	SK_ID_CURR	SK_ID_BUREAU	CREDIT_ACTIVE	CREDIT_CURRENCY	DAYS_CREDIT	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	AMT_CREDIT_SUM_LIMIT	CREDIT_TYPE	DAYS_CREDIT_UPDATE	AMT_ANNUITY
0	215354	5714462	Closed	currency 1	-497	-153.0	-153.0	NaN	91323.0	0.00	NaN	Consumer credit	-131	NaN
1	215354	5714463	Active	currency 1	-208	1075.0	NaN	NaN	225000.0	171342.00	NaN	Credit card	-20	NaN
2	215354	5714464	Active	currency 1	-203	528.0	NaN	NaN	464323.5	NaN	NaN	Consumer credit	-16	NaN
3	215354	5714465	Active	currency 1	-203	NaN	NaN	NaN	90000.0	NaN	NaN	Credit card	-16	NaN
4	215354	5714466	Active	currency 1	-629	1197.0	NaN	77674.5	2700000.0	NaN	NaN	Consumer credit	-21	NaN
5	215354	5714467	Active	currency 1	-273	27460.0	NaN	0.0	180000.0	71017.38	108982.62	Credit card	-31	NaN
6	215354	5714468	Active	currency 1	-43	79.0	NaN	0.0	42103.8	42103.80	0.00	Consumer credit	-22	NaN
225157	215354	5714458	Closed	currency 1	-1872	-776.0	-1203.0	NaN	494550.0	0.00	0.00	Consumer credit	-696	NaN
225158	215354	5714459	Closed	currency 1	-1734	-638.0	-1336.0	NaN	1084032.0	0.00	NaN	Car loan	-1336	NaN
225159	215354	5714460	Closed	currency 1	-1333	-1089.0	-987.0	NaN	102150.0	NaN	NaN	Consumer credit	-984	NaN
225160	215354	5714461	Closed	currency 1	-1011	-830.0	-770.0	NaN	500463.0	NaN	NaN	Consumer credit	-767	NaN

We create a new dataframe where we group based on the applicants id (SK_ID_CURR) and we apply a count operation to SK_ID_BUREAU. We also choose to use a standard index for our dataframe and rename the column to prev_loan_cnt.

In [5]:

df_prev_loan_cnt = df_bureau.groupby('SK_ID_CURR', as_index=False)['SK_ID_BUREAU'].count().rename(columns={'SK_ID_BUREAU':'prev_loan_cnt'})
df_prev_loan_cnt.head(10)

Out[5]:

	SK_ID_CURR	prev_loan_cnt
0	100001	7
1	100002	8
2	100003	4
3	100004	2
4	100005	3
5	100007	1
6	100008	3
7	100009	18
8	100010	2
9	100011	4

We merge our new dataframe of previous loan acounts (df_prev_loan_cnt) on the applicant id (SK_ID_CURR) with a left join. A left join means that any SK_ID_CURR that is in df_prev_loan_cnt but not in df_train will not be included in the new dataframe, and all rows in df_train will be retained. Where df_train has an applicant, but df_prev_loan_cnt has no information, there will be a NaN. We see this is true for SK_ID_CURR = 100006

In [6]:

df_train = df_train.merge(df_prev_loan_cnt, on='SK_ID_CURR', how='left')
df_train[['SK_ID_CURR','prev_loan_cnt']].head(10)

Out[6]:

	SK_ID_CURR	prev_loan_cnt
0	100002	8.0
1	100003	4.0
2	100004	2.0
3	100006	NaN
4	100007	1.0
5	100008	3.0
6	100009	18.0
7	100010	2.0
8	100011	4.0
9	100012	NaN

We fill all entries such as SK_ID_CURR=100006 to 0.

In [7]:

df_train['prev_loan_cnt'] = df_train['prev_loan_cnt'].fillna(0)
df_train[['SK_ID_CURR','prev_loan_cnt']].head(10)

Out[7]:

	SK_ID_CURR	prev_loan_cnt
0	100002	8.0
1	100003	4.0
2	100004	2.0
3	100006	0.0
4	100007	1.0
5	100008	3.0
6	100009	18.0
7	100010	2.0
8	100011	4.0
9	100012	0.0

We see that our df_train now has the new column prev_loan_cnt appended to it.

In [8]:

df_train.head(10)

Out[8]:

	SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	...	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR	prev_loan_cnt
0	100002	1	Cash loans	M	N	Y	0	202500.0	406597.5	24700.5	...	0.0	0.0	0.0	0.0	0.0	1.0	8.0
1	100003	0	Cash loans	F	N	N	0	270000.0	1293502.5	35698.5	...	0.0	0.0	0.0	0.0	0.0	0.0	4.0
2	100004	0	Revolving loans	M	Y	Y	0	67500.0	135000.0	6750.0	...	0.0	0.0	0.0	0.0	0.0	0.0	2.0
3	100006	0	Cash loans	F	N	Y	0	135000.0	312682.5	29686.5	...	NaN	NaN	NaN	NaN	NaN	NaN	0.0
4	100007	0	Cash loans	M	N	Y	0	121500.0	513000.0	21865.5	...	0.0	0.0	0.0	0.0	0.0	0.0	1.0
5	100008	0	Cash loans	M	N	Y	0	99000.0	490495.5	27517.5	...	0.0	0.0	0.0	0.0	1.0	1.0	3.0
6	100009	0	Cash loans	F	Y	Y	1	171000.0	1560726.0	41301.0	...	0.0	0.0	0.0	1.0	1.0	2.0	18.0
7	100010	0	Cash loans	M	Y	Y	0	360000.0	1530000.0	42075.0	...	0.0	0.0	0.0	0.0	0.0	0.0	2.0
8	100011	0	Cash loans	F	N	Y	0	112500.0	1019610.0	33826.5	...	0.0	0.0	0.0	0.0	0.0	1.0	4.0
9	100012	0	Revolving loans	M	N	Y	0	135000.0	405000.0	20250.0	...	NaN	NaN	NaN	NaN	NaN	NaN	0.0

10 rows × 123 columns

We use our function plot_kde_hist_var to evaluate the potential explanatory power of the prev_loan_cnt variable. By evaluating the plot and the correlation with Target=True we find that prev_loan_cnt is likely to be a weak predictor of Target.

In [ ]:

eda.plot_kde_hist_var(df_train,['EXT_SOURCE_3','prev_loan_cnt'])

Median Value of EXT_SOURCE_3 when Target (True): 0.379100
Median Value of EXT_SOURCE_3 when Target (False): 0.546023
Pearson Correlation of EXT_SOURCE_3 with Target (True): -0.178919
Median Value of prev_loan_cnt when Target (True): 3.000000
Median Value of prev_loan_cnt when Target (False): 4.000000
Pearson Correlation of prev_loan_cnt with Target (True): -0.010020

Out[ ]:

([True, True], [0.0, 1.1726988855032431e-45])

Engineering new features from all numeric columns from 'bureau.csv'¶

We initially only obtained prev_loan_cnt by grouping together our dataset based on loan application ID (SK_ID_CURR) and using the count operation on the loan IDs (SK_ID_BUREAU).

However, in feature engineering, we don't know which variable might be important and different aggregation calculations such as median, max, min, etc applied to each numeric variable may lead to a better predictor.

We drop out SK_ID_BUREAU because it is only important for prev_loan_cnt and has no meaning even if we applied metrics such as median, min, max, etc sd SK_ID_BUREAU is only useful when applying count.

We perform the same groupby operation as before in 'Previous Loan Counts' on all numeric variables and reset the index.

In [9]:

df_bureau_agg = df_bureau.drop(columns=['SK_ID_BUREAU']).groupby(['SK_ID_CURR'], as_index=False).agg(['count','mean','median','min','max','sum']).reset_index()
df_bureau_agg.head()

Out[9]:

	SK_ID_CURR	DAYS_CREDIT						CREDIT_DAY_OVERDUE			...	DAYS_CREDIT_UPDATE				AMT_ANNUITY
		count	mean	median	min	max	sum	count	mean	median	...	median	min	max	sum	count	mean	median	min	max	sum
0	100001	7	-735.000000	-857.0	-1572	-49	-5145	7	0.0	0.0	...	-155.0	-155	-6	-652	7	3545.357143	0.0	0.0	10822.5	24817.5
1	100002	8	-874.000000	-1042.5	-1437	-103	-6992	8	0.0	0.0	...	-402.5	-1185	-7	-3999	7	0.000000	0.0	0.0	0.0	0.0
2	100003	4	-1400.750000	-1205.5	-2586	-606	-5603	4	0.0	0.0	...	-545.0	-2131	-43	-3264	0	NaN	NaN	NaN	NaN	0.0
3	100004	2	-867.000000	-867.0	-1326	-408	-1734	2	0.0	0.0	...	-532.0	-682	-382	-1064	0	NaN	NaN	NaN	NaN	0.0
4	100005	3	-190.666667	-137.0	-373	-62	-572	3	0.0	0.0	...	-31.0	-121	-11	-163	3	1420.500000	0.0	0.0	4261.5	4261.5

5 rows × 73 columns

We can see that using the groupby and agg functions lead to a Multi-level index. The first level are items such as ['DAYS_CREDIT','CREDIT_DAY_OVERDUE'] whereas the second level are the aggregation functions such as ['count','median','mean','min','max','sum'].

To extract specific columns from the dataframe based on a Multiindex, use the Level 0 name (e.g., DAYS_CREDIT,CREDIT_DAY_OVEDUE) and then the Level 1 name (e.g., mean, median). This returns a Series object. Use the loc function to return a DataFrame object.

In [10]:

df_bureau_agg['DAYS_CREDIT']['count'].head(3)

Out[10]:

0    7
1    8
2    4
Name: count, dtype: int64

Use the loc function to return a DataFrame object

In [ ]:

df_bureau_agg.loc[:,[('DAYS_CREDIT','mean')]].head(3)

Out[ ]:

	DAYS_CREDIT
	mean
0	-735.00
1	-874.00
2	-1400.75

The xs command is another method for accessing Multi-Level indexes. It allows you to index specifically by listing the column names and the levels that you are referring to in the Multi-index dataframe

In [ ]:

df_bureau_agg.xs(('DAYS_CREDIT','mean'),level=(0,1),axis=1).head(3)

Out[ ]:

	DAYS_CREDIT
	mean
0	-735.00
1	-874.00
2	-1400.75

We can see in the below instance that xs extracted all the mean values for all the Multi-Index entries.

In [ ]:

df_bureau_agg.xs(('mean'), level =(1),axis=1).head()

Out[ ]:

	DAYS_CREDIT	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	AMT_CREDIT_SUM_LIMIT	DAYS_CREDIT_UPDATE	AMT_ANNUITY
0	-735.000000	82.428571	-825.500000	NaN	207623.571429	85240.928571	0.00000	-93.142857	3545.357143
1	-874.000000	-349.000000	-697.500000	1681.029	108131.945625	49156.200000	7997.14125	-499.875000	0.000000
2	-1400.750000	-544.500000	-1097.333333	0.000	254350.125000	0.000000	202500.00000	-816.000000	NaN
3	-867.000000	-488.500000	-532.500000	0.000	94518.900000	0.000000	0.00000	-532.000000	NaN
4	-190.666667	439.333333	-123.000000	0.000	219042.000000	189469.500000	0.00000	-54.333333	1420.500000

In [ ]:

df_bureau_agg.xs(('mean'), level =(1),axis=1).head()

Out[ ]:

	DAYS_CREDIT	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	AMT_CREDIT_SUM_LIMIT	DAYS_CREDIT_UPDATE	AMT_ANNUITY
0	-735.000000	82.428571	-825.500000	NaN	207623.571429	85240.928571	0.00000	-93.142857	3545.357143
1	-874.000000	-349.000000	-697.500000	1681.029	108131.945625	49156.200000	7997.14125	-499.875000	0.000000
2	-1400.750000	-544.500000	-1097.333333	0.000	254350.125000	0.000000	202500.00000	-816.000000	NaN
3	-867.000000	-488.500000	-532.500000	0.000	94518.900000	0.000000	0.00000	-532.000000	NaN
4	-190.666667	439.333333	-123.000000	0.000	219042.000000	189469.500000	0.00000	-54.333333	1420.500000

Evaluate usefulness of new features¶

We merge df_train and df_bureau_agg to evaluate the predictive power of our new features with TARGET. To evaluate the predictive power of the new features we perform the following actions

Merge the new features with the training dataset.
Calculate correlations between the new features and TARGET from the training dataset.
Generate KDE plots of features of populations with Target=0/1.

In [ ]:

df_train = df_train.merge(df_bureau_agg, how='left', on='SK_ID_CURR' )
df_train.head()

Out[ ]:

	SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	...	(DAYS_CREDIT_UPDATE, median)	(DAYS_CREDIT_UPDATE, min)	(DAYS_CREDIT_UPDATE, max)	(DAYS_CREDIT_UPDATE, sum)	(AMT_ANNUITY, count)	(AMT_ANNUITY, mean)	(AMT_ANNUITY, median)	(AMT_ANNUITY, min)	(AMT_ANNUITY, max)	(AMT_ANNUITY, sum)
0	100002	1	Cash loans	M	N	Y	202500.0	406597.5	24700.5	...	-402.5	-1185.0	-7.0	-3999.0	7.0	0.0	0.0	0.0	0.0	0.0
1	100003	0	Cash loans	F	N	N	270000.0	1293502.5	35698.5	...	-545.0	-2131.0	-43.0	-3264.0	0.0	NaN	NaN	NaN	NaN	0.0
2	100004	0	Revolving loans	M	Y	Y	67500.0	135000.0	6750.0	...	-532.0	-682.0	-382.0	-1064.0	0.0	NaN	NaN	NaN	NaN	0.0
3	100006	0	Cash loans	F	N	Y	135000.0	312682.5	29686.5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	100007	0	Cash loans	M	N	Y	121500.0	513000.0	21865.5	...	-783.0	-783.0	-783.0	-783.0	0.0	NaN	NaN	NaN	NaN	0.0

5 rows × 195 columns

We can see that the merged dataframe contains only one level for the index. Thus the Multi_index from df_bureau_agg has been squashed into a tuple.

In [ ]:

df_train.columns

Out[ ]:

Index([                    'SK_ID_CURR',                         'TARGET',
                   'NAME_CONTRACT_TYPE',                    'CODE_GENDER',
                         'FLAG_OWN_CAR',                'FLAG_OWN_REALTY',
                         'CNT_CHILDREN',               'AMT_INCOME_TOTAL',
                           'AMT_CREDIT',                    'AMT_ANNUITY',
       ...
       ('DAYS_CREDIT_UPDATE', 'median'),    ('DAYS_CREDIT_UPDATE', 'min'),
          ('DAYS_CREDIT_UPDATE', 'max'),    ('DAYS_CREDIT_UPDATE', 'sum'),
               ('AMT_ANNUITY', 'count'),          ('AMT_ANNUITY', 'mean'),
              ('AMT_ANNUITY', 'median'),           ('AMT_ANNUITY', 'min'),
                 ('AMT_ANNUITY', 'max'),           ('AMT_ANNUITY', 'sum')],
      dtype='object', length=195)

We convert these tuple into a string and update the column names of the training dataframe accordingly

In [ ]:

colName = list(df_train.columns)

colNewName = []
for col in colName:
  
  if isinstance(col,tuple):
    newCol = '_'.join(col)
    colNewName.append(newCol)
  else:
    colNewName.append(col)  

    
df_train.columns = colNewName
print(df_train.columns)

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY',
       ...
       'DAYS_CREDIT_UPDATE_median', 'DAYS_CREDIT_UPDATE_min',
       'DAYS_CREDIT_UPDATE_max', 'DAYS_CREDIT_UPDATE_sum', 'AMT_ANNUITY_count',
       'AMT_ANNUITY_mean', 'AMT_ANNUITY_median', 'AMT_ANNUITY_min',
       'AMT_ANNUITY_max', 'AMT_ANNUITY_sum'],
      dtype='object', length=195)

We calculate the correlation of all variables in the dataframe to evaluate whether our new variables from bureau are high magnitude of correlation with TARGET.

In [ ]:

corrVal = df_train.corr()['TARGET']
corrValabs = abs(corrVal).sort_values(ascending=False)

In [ ]:

print('Highest Correlation:\n{}'.format(corrValabs.head(15)))

Highest Correlation:
TARGET                         1.000000
EXT_SOURCE_3                   0.178919
EXT_SOURCE_2                   0.160472
EXT_SOURCE_1                   0.155317
DAYS_CREDIT_mean               0.089729
DAYS_CREDIT_median             0.085414
DAYS_BIRTH                     0.078239
DAYS_CREDIT_min                0.075248
DAYS_CREDIT_UPDATE_mean        0.068927
DAYS_CREDIT_UPDATE_median      0.067338
REGION_RATING_CLIENT_W_CITY    0.060893
REGION_RATING_CLIENT           0.058899
DAYS_ENDDATE_FACT_min          0.055887
DAYS_LAST_PHONE_CHANGE         0.055218
DAYS_CREDIT_ENDDATE_sum        0.053735
Name: TARGET, dtype: float64

In [ ]:

print('Lowest Correlation:\n{}'.format(corrValabs.tail(15)))

Lowest Correlation:
FLAG_DOCUMENT_10              0.001414
AMT_ANNUITY_mean              0.001391
FLAG_DOCUMENT_19              0.001358
AMT_ANNUITY_max               0.001120
AMT_REQ_CREDIT_BUREAU_HOUR    0.000930
AMT_REQ_CREDIT_BUREAU_WEEK    0.000788
FLAG_DOCUMENT_12              0.000756
AMT_CREDIT_SUM_DEBT_mean      0.000637
FLAG_MOBIL                    0.000534
FLAG_CONT_MOBILE              0.000370
FLAG_DOCUMENT_5               0.000316
AMT_CREDIT_SUM_DEBT_min       0.000242
FLAG_DOCUMENT_20              0.000215
CNT_CREDIT_PROLONG_min        0.000182
AMT_CREDIT_SUM_OVERDUE_min    0.000003
Name: TARGET, dtype: float64

We see that a few of our new variables have higher correlations with TARGET such as (DAYS_CREDIT, x), (DAYS_CREDIT_UPDATE,x), thus we can run our KDE plot

We plot the distributions for these new feature variables that exhibit higher correlations with TARGET. Bear in mind that the correlations are still around 0.06 - 0.09 which are still quite low in the scheme of things.

The shapes of the KDEs of these different features for the TARGET=0 and TARGET=1 populations are quite different but this may be due to the imbalanced population of the dataset

In [ ]:

newFeat = ['DAYS_CREDIT_mean','DAYS_CREDIT_median','DAYS_CREDIT_min','DAYS_CREDIT_UPDATE_mean']
eda.plot_kde_hist_var(df_train,newFeat)

Median Value of DAYS_CREDIT_mean when Target (True): -835.333333
Median Value of DAYS_CREDIT_mean when Target (False): -1067.000000
Pearson Correlation of DAYS_CREDIT_mean with Target (True): 0.089729
Median Value of DAYS_CREDIT_median when Target (True): -729.000000
Median Value of DAYS_CREDIT_median when Target (False): -976.000000
Pearson Correlation of DAYS_CREDIT_median with Target (True): 0.085414
Median Value of DAYS_CREDIT_min when Target (True): -1469.000000
Median Value of DAYS_CREDIT_min when Target (False): -1861.000000
Pearson Correlation of DAYS_CREDIT_min with Target (True): 0.075248
Median Value of DAYS_CREDIT_UPDATE_mean when Target (True): -348.873016
Median Value of DAYS_CREDIT_UPDATE_mean when Target (False): -492.800000
Pearson Correlation of DAYS_CREDIT_UPDATE_mean with Target (True): 0.068927

Out[ ]:

([True, True, True, True], [0.0, 0.0, 3.7630723481423284e-305, 0.0])

Functions for aggregating numeric columns¶

We write a function that aggregates all the numeric columns in a dataframe, and collapses the multi-index dataframe into a single index.

It is often easier to work with a Single Index rather than a multi index dataframe (i.e., When merging a multi-index with a single-index dataframe, multi-index values are converted into tuples).

We write a function in our rand_eda module to perform this operation so it can be repeated for other datasets.

In [11]:

df = df_bureau.copy()
groupby_var = 'SK_ID_CURR'
drop_var_list= ['SK_ID_BUREAU']
df_name = 'bureau'

df_bureau_agg_num = eda.agg_num(df, groupby_var, drop_var_list, df_name)
df_bureau_agg_num.head()

Out[11]:

	bureau_DAYS_CREDIT_mean	bureau_DAYS_CREDIT_median	bureau_DAYS_CREDIT_sum	bureau_DAYS_CREDIT_count	bureau_DAYS_CREDIT_max	bureau_DAYS_CREDIT_min	bureau_CREDIT_DAY_OVERDUE_mean	bureau_CREDIT_DAY_OVERDUE_median	bureau_CREDIT_DAY_OVERDUE_sum	bureau_CREDIT_DAY_OVERDUE_count	...	bureau_DAYS_CREDIT_UPDATE_sum	bureau_DAYS_CREDIT_UPDATE_count	bureau_DAYS_CREDIT_UPDATE_max	bureau_DAYS_CREDIT_UPDATE_min	bureau_AMT_ANNUITY_mean	bureau_AMT_ANNUITY_median	bureau_AMT_ANNUITY_sum	bureau_AMT_ANNUITY_count	bureau_AMT_ANNUITY_max	bureau_AMT_ANNUITY_min
SK_ID_CURR
100001	-735.000000	-857.0	-5145	7	-49	-1572	0.0	0.0	0	7	...	-652	7	-6	-155	3545.357143	0.0	24817.5	7	10822.5	0.0
100002	-874.000000	-1042.5	-6992	8	-103	-1437	0.0	0.0	0	8	...	-3999	8	-7	-1185	0.000000	0.0	0.0	7	0.0	0.0
100003	-1400.750000	-1205.5	-5603	4	-606	-2586	0.0	0.0	0	4	...	-3264	4	-43	-2131	NaN	NaN	0.0	0	NaN	NaN
100004	-867.000000	-867.0	-1734	2	-408	-1326	0.0	0.0	0	2	...	-1064	2	-382	-682	NaN	NaN	0.0	0	NaN	NaN
100005	-190.666667	-137.0	-572	3	-62	-373	0.0	0.0	0	3	...	-163	3	-11	-121	1420.500000	0.0	4261.5	3	4261.5	0.0

5 rows × 72 columns

The code below is included in the rand_eda.py module. This is the module we use for all DataScience activity.

In [ ]:

'''
Takes a two-level multi-index and flattens it into a one-level DataFrame

inputs
------
    df (dataframe): a df with a multi_index column

outputs
-------
    df dataframe: returns df with single index columns

'''
def flatten_multi_index(df):

  colName = []

  for idx0,val0 in enumerate(df.columns.levels[0]):
    for idx1,val1 in enumerate(df.columns.levels[1]):
      combine = (val0,val1)
      colName.append('_'.join(combine))

  df.columns = colName

  return df

In [ ]:

"""
Performs aggregation statistical functions on numeric columns of a dataset and
 collapses the multi index into a single index.

Inputs
------
    df (dataframe): dataframe to perform aggregation operations on
    groupby_var (string): this is the column to group by
    drop_var (list): list of columns to drop.  These are columns that add no value when aggregation operations are applied (e.g., ID columns)

Outputs
-------
    df_agg : a dataframe with a flattened multi_index and aggregate statistics across all numeric columns
"""
def agg_num(df, groupby_var, drop_var_list, df_name):
  
    for col in df:
        for drop_var in drop_var_list:
            if col == drop_var:
                df = df.drop(columns = col)
# extract out the groupby ID variable, extract all numeric columns, then add back
# the groupby ID variable
    groupby_ids = df[groupby_var]
    df_numeric = df.select_dtypes('number')
    df_numeric[groupby_var] = groupby_ids

# group the dataframe accordingly
    df_agg = df_numeric.groupby(groupby_var).agg(['mean','median','sum','count','max','min'])

# flatten the multi index naming convention to a single index
# prefix the column names with a dataframe name (important if we merge this data into a training dataframe)
# re-insert the groupby index data
    df_agg = flatten_multi_index(df_agg)
    df_agg = df_agg.add_prefix('{}_'.format(df_name))
    

    return df_agg

Engineering new features from categorical columns¶

We print out basic informaiton about the df_bureau dataframe and find that there are a x3 categorical columns.

In [12]:

df_bureau.head()
info = eda.print_basic_info_df(df_bureau)
print(info)

Num rows                                                      1716428
Num cols                                                           17
Dtype                               float64(8), int64(6), object(3), 
Memory (MB)                                                       536
True (%)                                                          N/A
Is Balanced                                                       N/A
Categorical cols    CREDIT_ACTIVE(4), CREDIT_CURRENCY(4), CREDIT_T...
dtype: object

A closer inspection shows that there are x3 categorical variables that we may need to apply one hot encoding upon a few categorical columns such as CREDIT_ACTIVE, CREDIT_CURRENCY, etc..

In [13]:

df_bureau.head()

Out[13]:

	SK_ID_CURR	SK_ID_BUREAU	CREDIT_ACTIVE	CREDIT_CURRENCY	DAYS_CREDIT	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	AMT_CREDIT_SUM_LIMIT	CREDIT_TYPE	DAYS_CREDIT_UPDATE	AMT_ANNUITY
0	215354	5714462	Closed	currency 1	-497	-153.0	-153.0	NaN	91323.0	0.0	NaN	Consumer credit	-131	NaN
1	215354	5714463	Active	currency 1	-208	1075.0	NaN	NaN	225000.0	171342.0	NaN	Credit card	-20	NaN
2	215354	5714464	Active	currency 1	-203	528.0	NaN	NaN	464323.5	NaN	NaN	Consumer credit	-16	NaN
3	215354	5714465	Active	currency 1	-203	NaN	NaN	NaN	90000.0	NaN	NaN	Credit card	-16	NaN
4	215354	5714466	Active	currency 1	-629	1197.0	NaN	77674.5	2700000.0	NaN	NaN	Consumer credit	-21	NaN

We take a closer look at the categorical variables to evalute how many unique entries, the names of these unique entries and how often they occur in the dataset.

In [8]:

eda.extract_cat_var(df_bureau)

Categorical column: CREDIT_ACTIVE

Number of unique entries: 4

Unique entry names:
['Closed' 'Active' 'Sold' 'Bad debt']

Value counts of each entry:
Closed      1079273
Active       630607
Sold           6527
Bad debt         21
Name: CREDIT_ACTIVE, dtype: int64

---------------------------------------------------------
Categorical column: CREDIT_CURRENCY

Number of unique entries: 4

Unique entry names:
['currency 1' 'currency 2' 'currency 4' 'currency 3']

Value counts of each entry:
currency 1    1715020
currency 2       1224
currency 3        174
currency 4         10
Name: CREDIT_CURRENCY, dtype: int64

---------------------------------------------------------
Categorical column: CREDIT_TYPE

Number of unique entries: 15

Unique entry names:
['Consumer credit' 'Credit card' 'Mortgage' 'Car loan' 'Microloan'
 'Loan for working capital replenishment' 'Loan for business development'
 'Real estate loan' 'Unknown type of loan' 'Another type of loan'
 'Cash loan (non-earmarked)' 'Loan for the purchase of equipment'
 'Mobile operator loan' 'Interbank credit'
 'Loan for purchase of shares (margin lending)']

Value counts of each entry:
Consumer credit                                 1251615
Credit card                                      402195
Car loan                                          27690
Mortgage                                          18391
Microloan                                         12413
Loan for business development                      1975
Another type of loan                               1017
Unknown type of loan                                555
Loan for working capital replenishment              469
Cash loan (non-earmarked)                            56
Real estate loan                                     27
Loan for the purchase of equipment                   19
Loan for purchase of shares (margin lending)          4
Interbank credit                                      1
Mobile operator loan                                  1
Name: CREDIT_TYPE, dtype: int64

---------------------------------------------------------

Using sklearn's `OneHotEncoder`¶

We apply scikit-learn's OneHotEncoder which as of version 20 can now function on string variables.

We can see from previous cells that CREDIT_ACTIVE consists of x4 unique variables. Thus, we expect that if we apply OneHotEncoder, this would produce a matrix with x4 columns.

Tip

OneHotEncoder only accepts DataFrames, not a Series so use [[ ]] when selecting a single column(s).
Use the handle_unknown='ignore' functionality of OneHotEncoder to ignore issues such as:
- Unknown categories/missing values in the test set that are not in the training set.
- OneHotEncoder can perform transform and ignore unknown categories/missing values on the test data.
- OneHotEncoder cannot perform fit_transform on unknown categories/missing values for the training data.

OneHotEncoder on a single DataFrame column¶

In [ ]:

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

credit_act = df_bureau[['CREDIT_ACTIVE']]
credit_act_OHE = ohe.fit_transform(credit_act)

In [ ]:

print(credit_act_OHE)

[[0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 ...
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]

In [ ]:

featnames = ohe.get_feature_names()
print(featnames)

['x0_Active' 'x0_Bad debt' 'x0_Closed' 'x0_Sold']

In [ ]:

df_credit_ohe = pd.DataFrame(credit_act_OHE,columns=featnames)
df_credit_ohe.head()

Out[ ]:

	x0_Active	x0_Closed
0	0.0	1.0
1	1.0	0.0
2	1.0	0.0
3	1.0	0.0
4	1.0	0.0

In [ ]:

df_bureau['CREDIT_ACTIVE'].head()

Out[ ]:

0    Closed
1    Active
2    Active
3    Active
4    Active
Name: CREDIT_ACTIVE, dtype: object

We can see that the OneHotEncoder performs an accurate operation by comparing the results between df_credit_ohe and df_bureau['CREDIT_ACTIVE'].

We also verify this by performing the inverse_transform applied to the transformed OHE array back to its original labeled values.

In [ ]:

ohe.inverse_transform(credit_act_OHE)[:5]

Out[ ]:

array([['Closed'],
       ['Active'],
       ['Active'],
       ['Active'],
       ['Active']], dtype=object)

OneHotEncoder on a full DataFrame¶

In [ ]:

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

We perform one hot encoding on all variables of df_bureau. We see that 23 columns are created with a 1 in the appropriate column

In [ ]:

df_bureau_categorical = ohe.fit_transform(df_bureau.select_dtypes('object'))
print(df_bureau_categorical.shape)
df_bureau_categorical[:5]

(1716428, 23)

Out[ ]:

array([[0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.]])

In [ ]:

df_bureau_categorical_origin = ohe.inverse_transform(df_bureau_categorical)
print(df_bureau_categorical_origin)

[['Closed' 'currency 1' 'Consumer credit']
 ['Active' 'currency 1' 'Credit card']
 ['Active' 'currency 1' 'Consumer credit']
 ...
 ['Closed' 'currency 1' 'Consumer credit']
 ['Closed' 'currency 1' 'Consumer credit']
 ['Closed' 'currency 1' 'Microloan']]

In [ ]:

featnames = ohe.get_feature_names()
print(featnames)

['x0_Active' 'x0_Bad debt' 'x0_Closed' 'x0_Sold' 'x1_currency 1'
 'x1_currency 2' 'x1_currency 3' 'x1_currency 4' 'x2_Another type of loan'
 'x2_Car loan' 'x2_Cash loan (non-earmarked)' 'x2_Consumer credit'
 'x2_Credit card' 'x2_Interbank credit' 'x2_Loan for business development'
 'x2_Loan for purchase of shares (margin lending)'
 'x2_Loan for the purchase of equipment'
 'x2_Loan for working capital replenishment' 'x2_Microloan'
 'x2_Mobile operator loan' 'x2_Mortgage' 'x2_Real estate loan'
 'x2_Unknown type of loan']

We can see above that scikit-learn's OneHotEncoder. Is useful and quite powerful.

It is particularly useful if we are using the Pipeline functionality in sckit-learn

In [ ]:

df_bureau.head()

Out[ ]:

	SK_ID_CURR	SK_ID_BUREAU	CREDIT_ACTIVE	CREDIT_CURRENCY	DAYS_CREDIT	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	AMT_CREDIT_SUM_LIMIT	CREDIT_TYPE	DAYS_CREDIT_UPDATE	AMT_ANNUITY
0	215354	5714462	Closed	currency 1	-497	-153.0	-153.0	NaN	91323.0	0.0	NaN	Consumer credit	-131	NaN
1	215354	5714463	Active	currency 1	-208	1075.0	NaN	NaN	225000.0	171342.0	NaN	Credit card	-20	NaN
2	215354	5714464	Active	currency 1	-203	528.0	NaN	NaN	464323.5	NaN	NaN	Consumer credit	-16	NaN
3	215354	5714465	Active	currency 1	-203	NaN	NaN	NaN	90000.0	NaN	NaN	Credit card	-16	NaN
4	215354	5714466	Active	currency 1	-629	1197.0	NaN	77674.5	2700000.0	NaN	NaN	Consumer credit	-21	NaN

Using panda's `get_dummies`¶

We use the get_dummies from pandas to perform one hot encoding.

In [ ]:

df_categorical = pd.get_dummies(df_bureau.select_dtypes('object'))

df_categorical['SK_ID_CURR'] = df_bureau['SK_ID_CURR']  
df_categorical.head()

Out[ ]:

	CREDIT_ACTIVE_Active	CREDIT_ACTIVE_Closed	CREDIT_CURRENCY_currency 1	...	SK_ID_CURR
0	0	1	1	...	215354
1	1	0	1	...	215354
2	1	0	1	...	215354
3	1	0	1	...	215354
4	1	0	1	...	215354

5 rows × 24 columns

In [ ]:

df_categorical_grouped = df_categorical.groupby('SK_ID_CURR').agg(['sum','mean'])
df_categorical_grouped.head()

Out[ ]:

	CREDIT_ACTIVE_Active		CREDIT_ACTIVE_Bad debt		CREDIT_ACTIVE_Closed		CREDIT_ACTIVE_Sold		CREDIT_CURRENCY_currency 1		...	CREDIT_TYPE_Microloan		CREDIT_TYPE_Mobile operator loan		CREDIT_TYPE_Mortgage		CREDIT_TYPE_Real estate loan		CREDIT_TYPE_Unknown type of loan
	sum	mean	sum	mean	sum	mean	sum	mean	sum	mean	...	sum	mean	sum	mean	sum	mean	sum	mean	sum	mean
SK_ID_CURR
100001	3	0.428571	0	0.0	4	0.571429	0	0.0	7	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100002	2	0.250000	0	0.0	6	0.750000	0	0.0	8	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100003	1	0.250000	0	0.0	3	0.750000	0	0.0	4	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100004	0	0.000000	0	0.0	2	1.000000	0	0.0	2	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100005	2	0.666667	0	0.0	1	0.333333	0	0.0	3	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0

5 rows × 46 columns

As the groupby function has led to a multiindex, we write a function to flatten it to a single index.

In [ ]:

df_categorical_grouped = eda.flatten_multi_index(df_categorical_grouped)
df_categorical_grouped.head()

Out[ ]:

	CREDIT_ACTIVE_Active_sum	CREDIT_ACTIVE_Active_mean	CREDIT_ACTIVE_Bad debt_sum	CREDIT_ACTIVE_Bad debt_mean	CREDIT_ACTIVE_Closed_sum	CREDIT_ACTIVE_Closed_mean	CREDIT_ACTIVE_Sold_sum	CREDIT_ACTIVE_Sold_mean	CREDIT_CURRENCY_currency 1_sum	CREDIT_CURRENCY_currency 1_mean	...	CREDIT_TYPE_Microloan_sum	CREDIT_TYPE_Microloan_mean	CREDIT_TYPE_Mobile operator loan_sum	CREDIT_TYPE_Mobile operator loan_mean	CREDIT_TYPE_Mortgage_sum	CREDIT_TYPE_Mortgage_mean	CREDIT_TYPE_Real estate loan_sum	CREDIT_TYPE_Real estate loan_mean	CREDIT_TYPE_Unknown type of loan_sum	CREDIT_TYPE_Unknown type of loan_mean
SK_ID_CURR
100001	3	0.428571	0	0.0	4	0.571429	0	0.0	7	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100002	2	0.250000	0	0.0	6	0.750000	0	0.0	8	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100003	1	0.250000	0	0.0	3	0.750000	0	0.0	4	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100004	0	0.000000	0	0.0	2	1.000000	0	0.0	2	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100005	2	0.666667	0	0.0	1	0.333333	0	0.0	3	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0

5 rows × 46 columns

Functions for aggregating categorical columns¶

We create a function cnt_cat so we can perform this operation on other datases easily.

In [14]:

df =  df_bureau.copy()
groupby_var = 'SK_ID_CURR'
df_name = 'bureau'

df_bureau_cat_cnt = eda.cnt_cat(df,groupby_var,df_name)
df_bureau_cat_cnt.head()

Out[14]:

	bureau_CREDIT_ACTIVE_Active_sum	bureau_CREDIT_ACTIVE_Active_mean	bureau_CREDIT_ACTIVE_Bad debt_sum	bureau_CREDIT_ACTIVE_Bad debt_mean	bureau_CREDIT_ACTIVE_Closed_sum	bureau_CREDIT_ACTIVE_Closed_mean	bureau_CREDIT_ACTIVE_Sold_sum	bureau_CREDIT_ACTIVE_Sold_mean	bureau_CREDIT_CURRENCY_currency 1_sum	bureau_CREDIT_CURRENCY_currency 1_mean	...	bureau_CREDIT_TYPE_Microloan_sum	bureau_CREDIT_TYPE_Microloan_mean	bureau_CREDIT_TYPE_Mobile operator loan_sum	bureau_CREDIT_TYPE_Mobile operator loan_mean	bureau_CREDIT_TYPE_Mortgage_sum	bureau_CREDIT_TYPE_Mortgage_mean	bureau_CREDIT_TYPE_Real estate loan_sum	bureau_CREDIT_TYPE_Real estate loan_mean	bureau_CREDIT_TYPE_Unknown type of loan_sum	bureau_CREDIT_TYPE_Unknown type of loan_mean
SK_ID_CURR
100001	3	0.428571	0	0.0	4	0.571429	0	0.0	7	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100002	2	0.250000	0	0.0	6	0.750000	0	0.0	8	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100003	1	0.250000	0	0.0	3	0.750000	0	0.0	4	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100004	0	0.000000	0	0.0	2	1.000000	0	0.0	2	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100005	2	0.666667	0	0.0	1	0.333333	0	0.0	3	1.0	...	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0

5 rows × 46 columns

The code below is included in the rand_eda.py module. This is the module we use for all DataScience activity.

In [ ]:

'''
Performs count functions on categorical columns of a dataset and collapses the multi multi_index
into a single index

Inputs
------
    df (dataframe): input DataFrame
    groupby_var (string): name of variable to groupby

Outputs
-------
    df (dataframe): ouptut dataframe

'''

def cnt_cat(df,groupby_var,df_name):

    groupby_ids = df[groupby_var]
    df_cat= pd.get_dummies(df.select_dtypes('object'))
    df_cat[groupby_var] = groupby_ids
    df_cat_agg = df_cat.groupby(groupby_var).agg(['sum','mean'])


    df_cat_agg = flatten_multi_index(df_cat_agg)
    df_cat_agg = df_cat_agg.add_prefix('{}_'.format(df_name))


    return df_cat_agg

Applying agg. functions for `bureau.csv`¶

In [15]:

df_bureau.head()

Out[15]:

	SK_ID_CURR	SK_ID_BUREAU	CREDIT_ACTIVE	CREDIT_CURRENCY	DAYS_CREDIT	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	AMT_CREDIT_SUM_LIMIT	CREDIT_TYPE	DAYS_CREDIT_UPDATE	AMT_ANNUITY
0	215354	5714462	Closed	currency 1	-497	-153.0	-153.0	NaN	91323.0	0.0	NaN	Consumer credit	-131	NaN
1	215354	5714463	Active	currency 1	-208	1075.0	NaN	NaN	225000.0	171342.0	NaN	Credit card	-20	NaN
2	215354	5714464	Active	currency 1	-203	528.0	NaN	NaN	464323.5	NaN	NaN	Consumer credit	-16	NaN
3	215354	5714465	Active	currency 1	-203	NaN	NaN	NaN	90000.0	NaN	NaN	Credit card	-16	NaN
4	215354	5714466	Active	currency 1	-629	1197.0	NaN	77674.5	2700000.0	NaN	NaN	Consumer credit	-21	NaN

Applying agg. functions for `bureau_balance.csv`¶

We apply the aggregation function to bureau balances. We see that this provides data on each loan and its status.

In [9]:

df_bureau_bal.head()

Out[9]:

	SK_ID_BUREAU	MONTHS_BALANCE	STATUS
0	5715448	0	C
1	5715448	-1	C
2	5715448	-2	C
3	5715448	-3	C
4	5715448	-4	C

We see that the dataset is quite large and only contains x3 columns.

SK_ID_BUREAU: This are the identifying keysof the database MONTHS_BALANCE: These are the monthly balance of each loan. STATUS: This is the status of each loan.

In [10]:

df_bureau_bal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
SK_ID_BUREAU      int64
MONTHS_BALANCE    int64
STATUS            object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB

In [11]:

df_bureau_bal['STATUS'].unique()

Out[11]:

array(['C', '0', 'X', '1', '2', '3', '5', '4'], dtype=object)

We group by SK_ID_BUREAU which is the ID of each bureau, and give each new feature a prefix of bureau_bal. There are no columns to drop since there are only x3 columns and aggregation by any of these are not useless features.

In [ ]:

df = df_bureau_bal
groupby_var = 'SK_ID_BUREAU'
df_name = 'bureau_bal'
drop_var_list = list()

In [17]:

df_bureau_agg_cnt = eda.cnt_cat(df,groupby_var,df_name)
df_bureau_agg_cnt.head()

Out[17]:

	bureau_bal_STATUS_0_sum	bureau_bal_STATUS_0_mean	bureau_bal_STATUS_1_sum	bureau_bal_STATUS_1_mean	bureau_bal_STATUS_2_sum	bureau_bal_STATUS_2_mean	bureau_bal_STATUS_3_sum	bureau_bal_STATUS_3_mean	bureau_bal_STATUS_4_sum	bureau_bal_STATUS_4_mean	bureau_bal_STATUS_5_sum	bureau_bal_STATUS_5_mean	bureau_bal_STATUS_C_sum	bureau_bal_STATUS_C_mean	bureau_bal_STATUS_X_sum	bureau_bal_STATUS_X_mean
SK_ID_BUREAU
5001709	0	0.000000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	86	0.886598	11	0.113402
5001710	5	0.060241	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	48	0.578313	30	0.361446
5001711	3	0.750000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.000000	1	0.250000
5001712	10	0.526316	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	9	0.473684	0	0.000000
5001713	0	0.000000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.000000	22	1.000000

We can see that performing the agg. functions on categorical columns has led to 16 columns as there are 8 unique entries for STATUS.

In [14]:

df_bureau_agg_cnt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 817395 entries, 5001709 to 6842888
Data columns (total 16 columns):
bureau_bal_STATUS_0_sum     817395 non-null uint8
bureau_bal_STATUS_0_mean    817395 non-null float64
bureau_bal_STATUS_1_sum     817395 non-null uint8
bureau_bal_STATUS_1_mean    817395 non-null float64
bureau_bal_STATUS_2_sum     817395 non-null uint8
bureau_bal_STATUS_2_mean    817395 non-null float64
bureau_bal_STATUS_3_sum     817395 non-null uint8
bureau_bal_STATUS_3_mean    817395 non-null float64
bureau_bal_STATUS_4_sum     817395 non-null uint8
bureau_bal_STATUS_4_mean    817395 non-null float64
bureau_bal_STATUS_5_sum     817395 non-null uint8
bureau_bal_STATUS_5_mean    817395 non-null float64
bureau_bal_STATUS_C_sum     817395 non-null uint8
bureau_bal_STATUS_C_mean    817395 non-null float64
bureau_bal_STATUS_X_sum     817395 non-null uint8
bureau_bal_STATUS_X_mean    817395 non-null float64
dtypes: float64(8), uint8(8)
memory usage: 62.4 MB

We can see the process of performing statistical aggregation operations on the numeric columns of bureau_balance.csv has resulted in x6 new columns.

In [18]:

df_bureau_agg_num = eda.agg_num(df, groupby_var,[],df_name)
df_bureau_agg_num.head()

Out[18]:

	bureau_bal_MONTHS_BALANCE_mean	bureau_bal_MONTHS_BALANCE_median	bureau_bal_MONTHS_BALANCE_sum	bureau_bal_MONTHS_BALANCE_count	bureau_bal_MONTHS_BALANCE_max	bureau_bal_MONTHS_BALANCE_min
SK_ID_BUREAU
5001709	-48.0	-48.0	-4656	97	0	-96
5001710	-41.0	-41.0	-3403	83	0	-82
5001711	-1.5	-1.5	-6	4	0	-3
5001712	-9.0	-9.0	-171	19	0	-18
5001713	-10.5	-10.5	-231	22	0	-21

In [25]:

df_bureau_agg_num.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 817395 entries, 5001709 to 6842888
Data columns (total 6 columns):
bureau_bal_MONTHS_BALANCE_mean      817395 non-null float64
bureau_bal_MONTHS_BALANCE_median    817395 non-null float64
bureau_bal_MONTHS_BALANCE_sum       817395 non-null int64
bureau_bal_MONTHS_BALANCE_count     817395 non-null int64
bureau_bal_MONTHS_BALANCE_max       817395 non-null int64
bureau_bal_MONTHS_BALANCE_min       817395 non-null int64
dtypes: float64(2), int64(4)
memory usage: 83.7 MB

The first merge we perform is between df_bureau_agg_num and df_bureau_agg_cnt.

We merge them both based on the SK_ID_BUREAU column (i.e., loan ID) that is used as an index on both and based on an outer join so that rows that are in one dataframe or the other will be added to the final dataframe regardless. This ensures we do not lose any data from either dataframe.

In [51]:

df_bureau_loan = pd.merge(df_bureau_agg_num,df_bureau_agg_cnt, left_index=True, right_index=True, how='outer')
df_bureau_loan.head()

Out[51]:

	bureau_bal_MONTHS_BALANCE_mean	bureau_bal_MONTHS_BALANCE_median	bureau_bal_MONTHS_BALANCE_sum	bureau_bal_MONTHS_BALANCE_count	bureau_bal_MONTHS_BALANCE_max	bureau_bal_MONTHS_BALANCE_min	bureau_bal_STATUS_0_sum	bureau_bal_STATUS_0_mean	bureau_bal_STATUS_1_sum	bureau_bal_STATUS_1_mean	...	bureau_bal_STATUS_3_sum	bureau_bal_STATUS_3_mean	bureau_bal_STATUS_4_sum	bureau_bal_STATUS_4_mean	bureau_bal_STATUS_5_sum	bureau_bal_STATUS_5_mean	bureau_bal_STATUS_C_sum	bureau_bal_STATUS_C_mean	bureau_bal_STATUS_X_sum	bureau_bal_STATUS_X_mean
SK_ID_BUREAU
5001709	-48.0	-48.0	-4656	97	0	-96	0	0.000000	0	0.0	...	0	0.0	0	0.0	0	0.0	86	0.886598	11	0.113402
5001710	-41.0	-41.0	-3403	83	0	-82	5	0.060241	0	0.0	...	0	0.0	0	0.0	0	0.0	48	0.578313	30	0.361446
5001711	-1.5	-1.5	-6	4	0	-3	3	0.750000	0	0.0	...	0	0.0	0	0.0	0	0.0	0	0.000000	1	0.250000
5001712	-9.0	-9.0	-171	19	0	-18	10	0.526316	0	0.0	...	0	0.0	0	0.0	0	0.0	9	0.473684	0	0.000000
5001713	-10.5	-10.5	-231	22	0	-21	0	0.000000	0	0.0	...	0	0.0	0	0.0	0	0.0	0	0.000000	22	1.000000

5 rows × 22 columns

The second merge we perform is to append the SK_ID_CURR that is the ID for each loan applicant to our DataFrame of bureau loans (df_bureau_loan).

In [52]:

df_bureau_loan = pd.merge(df_bureau_loan,df_bureau[['SK_ID_CURR','SK_ID_BUREAU']],left_index=True, right_on='SK_ID_BUREAU',how='left').reset_index(drop=True)
df_bureau_loan.head()

Out[52]:

	bureau_bal_MONTHS_BALANCE_mean	bureau_bal_MONTHS_BALANCE_median	bureau_bal_MONTHS_BALANCE_sum	bureau_bal_MONTHS_BALANCE_count	bureau_bal_MONTHS_BALANCE_min	bureau_bal_STATUS_0_sum	bureau_bal_STATUS_0_mean	...	bureau_bal_STATUS_C_sum	bureau_bal_STATUS_C_mean	bureau_bal_STATUS_X_sum	bureau_bal_STATUS_X_mean	SK_ID_CURR	SK_ID_BUREAU
0	-48.0	-48.0	-4656	97	-96	0	0.000000	...	86	0.886598	11	0.113402	NaN	5001709
1	-41.0	-41.0	-3403	83	-82	5	0.060241	...	48	0.578313	30	0.361446	162368.0	5001710
2	-1.5	-1.5	-6	4	-3	3	0.750000	...	0	0.000000	1	0.250000	162368.0	5001711
3	-9.0	-9.0	-171	19	-18	10	0.526316	...	9	0.473684	0	0.000000	162368.0	5001712
4	-10.5	-10.5	-231	22	-21	0	0.000000	...	0	0.000000	22	1.000000	150635.0	5001713

5 rows × 24 columns

Since we have aggregated data for each loan together, and associated the loans with clients, we can group this data by a client basis

In [53]:

df = df_bureau_loan.copy()
groupby_var = 'SK_ID_CURR'
drop_var_list = ['SK_ID_BUREAU']
df_name = 'client'


df_bureau_loan_by_client = eda.agg_num(df,groupby_var,drop_var_list,df_name)
df_bureau_loan_by_client.head()

Out[53]:

	client_bureau_bal_MONTHS_BALANCE_mean_mean	client_bureau_bal_MONTHS_BALANCE_mean_median	client_bureau_bal_MONTHS_BALANCE_mean_sum	client_bureau_bal_MONTHS_BALANCE_mean_count	client_bureau_bal_MONTHS_BALANCE_mean_max	client_bureau_bal_MONTHS_BALANCE_mean_min	client_bureau_bal_MONTHS_BALANCE_median_mean	client_bureau_bal_MONTHS_BALANCE_median_median	client_bureau_bal_MONTHS_BALANCE_median_sum	client_bureau_bal_MONTHS_BALANCE_median_count	...	client_bureau_bal_STATUS_X_sum_sum	client_bureau_bal_STATUS_X_sum_count	client_bureau_bal_STATUS_X_sum_max	client_bureau_bal_STATUS_X_sum_min	client_bureau_bal_STATUS_X_mean_mean	client_bureau_bal_STATUS_X_mean_median	client_bureau_bal_STATUS_X_mean_sum	client_bureau_bal_STATUS_X_mean_count	client_bureau_bal_STATUS_X_mean_max	client_bureau_bal_STATUS_X_mean_min
SK_ID_CURR
100001.0	-11.785714	-14.00	-82.5	7	-0.5	-25.5	-11.785714	-14.00	-82.5	7	...	30.0	7	9	0	0.214590	0.241379	1.502129	7	0.500000	0.0
100002.0	-21.875000	-26.00	-175.0	8	-1.5	-39.5	-21.875000	-26.00	-175.0	8	...	15.0	8	3	0	0.161932	0.187500	1.295455	8	0.500000	0.0
100005.0	-3.000000	-2.00	-9.0	3	-1.0	-6.0	-3.000000	-2.00	-9.0	3	...	2.0	3	1	0	0.136752	0.076923	0.410256	3	0.333333	0.0
100010.0	-46.000000	-46.00	-92.0	2	-19.5	-72.5	-46.000000	-46.00	-92.0	2	...	0.0	2	0	0	0.000000	0.000000	0.000000	2	0.000000	0.0
100013.0	-28.250000	-29.75	-113.0	4	-19.5	-34.0	-28.250000	-29.75	-113.0	4	...	41.0	4	40	0	0.254545	0.009091	1.018182	4	1.000000	0.0

5 rows × 132 columns

Converting iPython notebook to Python code¶

This allows us to run the code in Spyder.

In [2]:

!jupyter nbconvert ml_kaggle-home-loan-credit-risk-feat-eng-p2.ipynb --to python

[NbConvertApp] Converting notebook ml_kaggle-home-loan-credit-risk-feat-eng-p2.ipynb to python
[NbConvertApp] Writing 22805 bytes to ml_kaggle-home-loan-credit-risk-feat-eng-p2.py

In [ ]:

	x0_Active	x0_Closed
0	0.0	1.0
1	1.0	0.0
2	1.0	0.0
3	1.0	0.0
4	1.0	0.0

	x0_Active	x0_Closed
0	0.0	1.0
1	1.0	0.0
2	1.0	0.0
3	1.0	0.0
4	1.0	0.0

Table of contents¶

Loading in required modules¶

Loading pickled dataframes¶

Engineering a new feature (e.g., previous loan counts) from 'bureau.csv'¶

Engineering new features from all numeric columns from 'bureau.csv'¶

Evaluate usefulness of new features¶

Functions for aggregating numeric columns¶

Engineering new features from categorical columns¶

Using sklearn's OneHotEncoder¶

OneHotEncoder on a single DataFrame column¶

OneHotEncoder on a full DataFrame¶

Using panda's get_dummies¶

Functions for aggregating categorical columns¶

Applying agg. functions for bureau.csv¶

Applying agg. functions for bureau_balance.csv¶

Converting iPython notebook to Python code¶

Comments

Using sklearn's `OneHotEncoder`¶

Using panda's `get_dummies`¶

Applying agg. functions for `bureau.csv`¶

Applying agg. functions for `bureau_balance.csv`¶

	x0_Active	x0_Closed
0	0.0	1.0
1	1.0	0.0
2	1.0	0.0
3	1.0	0.0
4	1.0	0.0