Kaggle: Credit risk (Feature Engineering: Part 3)

Feature engineering an important part of machine-learning as we try to modify/create (i.e., engineer) new features from our existing dataset that might be meaningful in predicting the TARGET.

We will go through two simple methodologies in feature engineering

  • Polynomial features
  • Expert features

Loading in required modules

In [22]:
# importing all system modules
import os
import sys
import warnings
from pathlib import Path
warnings.filterwarnings('ignore')
if sys.platform == 'linux':
    sys.path.append('/home/randlow/github/blog2/listings/machine-learning/') # linux
elif sys.platform == 'win32':
    sys.path.append('\\Users\\randl\\github\\blog2\\listings\\machine-learning\\') # win32

# importing data science modules
import pandas as pd
import numpy as np
import sklearn
import scipy as sp
import pickleshare

# importing graphics modules
import matplotlib.pyplot as plt
import seaborn as sns

# importing personal data science modules
import rand_eda as eda

Loading pickled dataframes

To see how the below dataframes were obtained see the post on the Kaggle: Credit risk (Exploratory Data Analysis)

In [3]:
home = str(Path.home())
if sys.platform == 'linux':
    inputDir = "/datasets/kaggle/home-credit-default-risk" # linux
elif sys.platform == 'win32':
    inputDir = "\datasets\kaggle\home-credit-default-risk" # windows

storeDir = home+inputDir+'/pickleshare'

db = pickleshare.PickleShareDB(storeDir)
print(db.keys())

df_app_test_align = db['df_app_test_align'] 
df_app_train_align = db['df_app_train_align'] 
df_app_train_corr_target = db['df_app_train_corr_target']
['df_app_test_align', 'df_app_train_align', 'df_app_train_corr_target', 'df_app_train_align_expert', 'df_app_test_align_expert', 'df_app_train_poly_align', 'df_app_test_poly_align']

We print out the highest +ve and -ve correlated features to the TARGET. We perform this to try generating auxilary features using the PolynomialFeatures function in Scikit-learn to see if this increases the explanatory/predictive power of these features that are highly correlated to TARGET.

In [4]:
print(df_app_train_corr_target.tail(10))
print(df_app_train_corr_target.head(10))
NAME_EDUCATION_TYPE_Secondary / secondary special    0.049824
REG_CITY_NOT_WORK_CITY                               0.050994
DAYS_ID_PUBLISH                                      0.051457
DAYS_LAST_PHONE_CHANGE                               0.055218
NAME_INCOME_TYPE_Working                             0.057481
REGION_RATING_CLIENT                                 0.058899
REGION_RATING_CLIENT_W_CITY                          0.060893
DAYS_EMPLOYED                                        0.074958
DAYS_BIRTH                                           0.078239
TARGET                                               1.000000
Name: TARGET, dtype: float64
EXT_SOURCE_3                           -0.178919
EXT_SOURCE_2                           -0.160472
EXT_SOURCE_1                           -0.155317
NAME_EDUCATION_TYPE_Higher education   -0.056593
NAME_INCOME_TYPE_Pensioner             -0.046209
ORGANIZATION_TYPE_XNA                  -0.045987
FLOORSMAX_AVG                          -0.044003
FLOORSMAX_MEDI                         -0.043768
FLOORSMAX_MODE                         -0.043226
EMERGENCYSTATE_MODE_No                 -0.042201
Name: TARGET, dtype: float64

Polynomial features

The variables that we select are EXT_SOURCE_1/2/3 (-ve), DAYS_BIRTH (+ve), and DAYS_EMPLOYED (+ve) that all have large correlation values to TARGET relative to the other features.

We create new poly_feat_x dataframes as both training and test datasets need to be equivalent. Thus, any polnomial features in the training dataset, must be created for the test dataset too.

In [5]:
imp_feat_list = ['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','DAYS_BIRTH'] # ,'DAYS_EMPLOYED'

poly_feat_train = df_app_train_align[imp_feat_list]
poly_feat_test = df_app_test_align[imp_feat_list]

We observed that several features often had NaN values. We use the SimpleImputer function in scikit-learn's impute toolkit where we replace all np.nan with median values in that column.

We fit on the training data, as that is all the in-sample data that we have. Then we perform the transformations on both the training and test datasets

In [6]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='median')
imputer.fit(poly_feat_train)

poly_feat_train = imputer.transform(poly_feat_train)
poly_feat_test = imputer.transform(poly_feat_test)

We set PolynomialFeatures function to $N=3$. Based on the list of features provided, new features of diffenret powers (e.g., $x^2$, $x^3$) and interaction variables (i.e., $x \times y$) are created. We fit the PolynomialFeatures function on the training dataset, and then we transform the data from poly_feat_train and poly_feat_test.

We see that many new features are created (i.e., 56)

In [7]:
from sklearn.preprocessing import PolynomialFeatures
poly_transform = PolynomialFeatures(degree=3)
poly_transform.fit(poly_feat_train)


poly_transform_train = poly_transform.transform(poly_feat_train)
poly_transform_test = poly_transform.transform(poly_feat_test)

print('Shape of polynomial features (training): {}'.format(poly_transform_train.shape))
print('Shape of polynomial features (test): {}'.format(poly_transform_test.shape))
Shape of polynomial features (training): (307511, 35)
Shape of polynomial features (test): (48744, 35)

We can see a detailed list of the new polynomial and interaction features that have been created

In [8]:
poly_feat_name_list = poly_transform.get_feature_names(imp_feat_list)

We would like to see if any of these new features in the training dataset have higher correlations with the TARGET. Since the new polynomial features in the training dataset have a correlation magnitude greater than the original feature set, we should consider adding them into our model.

In [9]:
df_poly_feat_train = pd.DataFrame(poly_transform_train,columns=poly_feat_name_list)
df_poly_feat_train['TARGET'] = df_app_train_align['TARGET']
poly_feat_corr = df_poly_feat_train.corr()['TARGET'].sort_values()

print(poly_feat_corr.head(10))
print(poly_feat_corr.tail(10))
EXT_SOURCE_3^3                -0.005448
EXT_SOURCE_3^2                -0.004932
EXT_SOURCE_3                  -0.004023
EXT_SOURCE_1 EXT_SOURCE_3^2   -0.003921
EXT_SOURCE_3 DAYS_BIRTH^2     -0.003050
EXT_SOURCE_1 EXT_SOURCE_3     -0.002701
EXT_SOURCE_2 EXT_SOURCE_3^2   -0.002487
DAYS_BIRTH^2                  -0.001403
EXT_SOURCE_2 EXT_SOURCE_3     -0.001345
EXT_SOURCE_1^2 EXT_SOURCE_3   -0.001202
Name: TARGET, dtype: float64
EXT_SOURCE_1^3                          0.001574
EXT_SOURCE_1 EXT_SOURCE_2               0.001634
EXT_SOURCE_1 EXT_SOURCE_2^2             0.001659
DAYS_BIRTH                              0.001675
EXT_SOURCE_1^2 EXT_SOURCE_2             0.002161
EXT_SOURCE_1 EXT_SOURCE_3 DAYS_BIRTH    0.002667
EXT_SOURCE_3 DAYS_BIRTH                 0.003814
EXT_SOURCE_3^2 DAYS_BIRTH               0.004843
TARGET                                  1.000000
1                                            NaN
Name: TARGET, dtype: float64

We create a new DataFrame for the test data that includes the new polynomial features.

In [10]:
df_poly_feat_test = pd.DataFrame(poly_transform_test,columns=poly_feat_name_list)

We include the row identifiers into the polynomial feature training and test datasets (i.e., df_poly_feat_train/test). This allows us to merge them with the original feature training and test datasets (i.e., df_app_train/test_align)

In [11]:
df_poly_feat_train.index = df_app_train_align.index
df_poly_feat_test.index  = df_app_test_align.index
In [12]:
df_app_train_poly = df_app_train_align.merge(df_poly_feat_train,left_index=True,right_index=True)
df_app_test_poly = df_app_test_align.merge(df_poly_feat_test,left_index=True,right_index=True)

Now that we have two new dataframes that have the new polynomial features added to them, we need to make sure botht he training and test datasets are aligned correctly.

In [13]:
df_app_train_poly_align, df_app_test_poly_align = df_app_train_poly.align(df_app_test_poly,join='inner',axis=1)
In [14]:
df_app_train_poly_align.head()
Out[14]:
NAME_CONTRACT_TYPE FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH_x ... EXT_SOURCE_2^3 EXT_SOURCE_2^2 EXT_SOURCE_3 EXT_SOURCE_2^2 DAYS_BIRTH EXT_SOURCE_2 EXT_SOURCE_3^2 EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH EXT_SOURCE_2 DAYS_BIRTH^2 EXT_SOURCE_3^3 EXT_SOURCE_3^2 DAYS_BIRTH EXT_SOURCE_3 DAYS_BIRTH^2 DAYS_BIRTH^3
SK_ID_CURR
100002 0 0 1 0 202500.0 406597.5 24700.5 351000.0 0.018801 -9461 ... 0.018181 0.009637 -654.152107 0.005108 -346.733022 2.353667e+07 0.002707 -183.785678 1.247560e+07 -8.468590e+11
100003 0 0 0 0 270000.0 1293502.5 35698.5 1129500.0 0.003541 -16765 ... 0.240927 0.207254 -6491.237078 0.178286 -5583.975307 1.748916e+08 0.153368 -4803.518937 1.504475e+08 -4.712058e+12
100004 1 1 1 0 67500.0 135000.0 6750.0 135000.0 0.010032 -19046 ... 0.171798 0.225464 -5885.942404 0.295894 -7724.580288 2.016572e+08 0.388325 -10137.567875 2.646504e+08 -6.908939e+12
100006 0 0 1 0 135000.0 312682.5 29686.5 297000.0 0.008019 -19005 ... 0.275185 0.226462 -8040.528832 0.186365 -6616.894625 2.349331e+08 0.153368 -5445.325225 1.933364e+08 -6.864416e+12
100007 0 0 1 0 121500.0 513000.0 21865.5 513000.0 0.028663 -19932 ... 0.033616 0.055754 -2076.117157 0.092471 -3443.335521 1.282190e+08 0.153368 -5710.929881 2.126570e+08 -7.918677e+12

5 rows × 271 columns

In [14]:
df_app_test_poly_align.head()
Out[14]:
NAME_CONTRACT_TYPE FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH_x ... EXT_SOURCE_2^3 EXT_SOURCE_2^2 EXT_SOURCE_3 EXT_SOURCE_2^2 DAYS_BIRTH EXT_SOURCE_2 EXT_SOURCE_3^2 EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH EXT_SOURCE_2 DAYS_BIRTH^2 EXT_SOURCE_3^3 EXT_SOURCE_3^2 DAYS_BIRTH EXT_SOURCE_3 DAYS_BIRTH^2 DAYS_BIRTH^3
SK_ID_CURR
100001 0 0 1 0 135000.0 568800.0 20560.5 450000.0 0.018850 -19241 ... 0.492392 0.099469 -11997.802403 0.020094 -2423.698322 2.923427e+08 0.004059 -489.615795 5.905670e+07 -7.123328e+12
100005 0 0 1 0 99000.0 222768.0 17370.0 180000.0 0.035792 -18064 ... 0.024809 0.036829 -1536.577117 0.054673 -2281.043619 9.516956e+07 0.081161 -3386.201665 1.412789e+08 -5.894429e+12
100013 0 1 1 0 202500.0 663264.0 69777.0 630000.0 0.019101 -20038 ... 0.342687 0.299203 -9812.640816 0.261238 -8567.521115 2.809794e+08 0.228089 -7480.393855 2.453261e+08 -8.045687e+12
100028 0 0 1 2 315000.0 1575000.0 49018.5 1575000.0 0.026392 -13976 ... 0.132399 0.159163 -3630.555667 0.191336 -4364.443591 9.955450e+07 0.230013 -5246.681115 1.196786e+08 -2.729912e+12
100038 0 1 0 1 180000.0 625500.0 32067.0 625500.0 0.010032 -13040 ... 0.077139 0.094065 -2362.974127 0.114707 -2881.489762 7.238455e+07 0.139877 -3513.785087 8.826814e+07 -2.217342e+12

5 rows × 271 columns

We can see that our original dataset has 237 features, and our polynomial feature engineering resulted in 36 new features. Having an extended set of both original and polynomial features resulted in 273 features. When we aligned both the training and test dataets together, we eend up with 271 features.

In [18]:
print('Original features (train): {}'.format(df_app_train_align.shape))
print('Polynomial features(train):{}'.format(df_poly_feat_train.shape))
print('Original & polynomial features (train): {}'.format(df_app_train_poly.shape))
print('Original & polynomial features align (train): {}'.format(df_app_train_poly_align.shape))
Original features (train): (307511, 237)
Polynomial features(train):(307511, 36)
Original & polynomial features (train): (307511, 273)
Original & polynomial features align (train): (307511, 271)

We analyze what columns have been removed via the align process.

In [19]:
s1 = set(df_app_train_poly.columns)
s2 = set(df_app_train_poly_align.columns)

diff_s1_s2 = s1-s2
diff_s1_s2
Out[19]:
{'TARGET_x', 'TARGET_y'}

Expert knowledge features

Often, experts have domain knowledge about what combination of existing features have strong explanatory/predictive power. In this case we are looking at the following features

  • Percentage of days employed - How long a person has been employed as a percentage of his life is a stronger predictor of his ability to keep paying off his loans.
  • Available credit as a percentage of income - If a person has a very large amount of credit available as a percentage of income, this can impact his ability to pay off the loans
  • Annuity as a percentage of income - If a person receives an annuity, this is a more stable source of income thus if it is higher, you are less likely to default.
  • Annuity as a percentage of available credit - If a person receives an annuity, this is more stable source of income thus if it is a high percentage compared to his/her credit availability then the person is more likely be able to pay off his debts.
In [20]:
df_app_train_align_expert = df_app_train_align.copy()
df_app_test_align_expert = df_app_test_align.copy()

# Training dataset
df_app_train_align_expert['DAYS_EMPLOYED_PCT'] = df_app_train_align_expert['DAYS_EMPLOYED'] / df_app_train_align_expert['DAYS_BIRTH']
df_app_train_align_expert['CREDIT_INCOME_PCT'] = df_app_train_align_expert['AMT_CREDIT'] / df_app_train_align_expert['AMT_INCOME_TOTAL']
df_app_train_align_expert['ANNUITY_INCOME_PCT'] = df_app_train_align_expert['AMT_ANNUITY'] / df_app_train_align_expert['AMT_INCOME_TOTAL']
df_app_train_align_expert['CREDIT_TERM'] = df_app_train_align_expert['AMT_ANNUITY'] / df_app_train_align_expert['AMT_CREDIT']

# Test dataset
df_app_test_align_expert['DAYS_EMPLOYED_PCT'] = df_app_test_align_expert['DAYS_EMPLOYED'] / df_app_test_align_expert['DAYS_BIRTH']
df_app_test_align_expert['CREDIT_INCOME_PCT'] = df_app_test_align_expert['AMT_CREDIT'] / df_app_test_align_expert['AMT_INCOME_TOTAL']
df_app_test_align_expert['ANNUITY_INCOME_PCT'] = df_app_test_align_expert['AMT_ANNUITY'] / df_app_test_align_expert['AMT_INCOME_TOTAL']
df_app_test_align_expert['CREDIT_TERM'] = df_app_test_align_expert['AMT_ANNUITY'] / df_app_test_align_expert['AMT_CREDIT']

Graphically visualizing expert features

In [18]:
varList = ['DAYS_EMPLOYED_PCT','CREDIT_INCOME_PCT','ANNUITY_INCOME_PCT','CREDIT_TERM']
eda.plot_kde_hist_var(df_app_train_align_expert,varList,calcStat = True, drawAll = True)
Out[18]:
([True, True, True, True],
 [0.0, 7.99949230230649e-15, 6.495509444616207e-27, 4.99818549092097e-178])

We analyze the expert features and find that DAYS_EMPLOYED_PCT ranks highly as measured by correlation in relation to TARGET.

In [18]:
corr_exp = df_app_train_align_expert.corr()['TARGET'].sort_values()
print(corr_exp.head(10))
print(corr_exp.tail(10))
EXT_SOURCE_3                           -0.178919
EXT_SOURCE_2                           -0.160472
EXT_SOURCE_1                           -0.155317
DAYS_EMPLOYED_PCT                      -0.067955
NAME_EDUCATION_TYPE_Higher education   -0.056593
NAME_INCOME_TYPE_Pensioner             -0.046209
ORGANIZATION_TYPE_XNA                  -0.045987
FLOORSMAX_AVG                          -0.044003
FLOORSMAX_MEDI                         -0.043768
FLOORSMAX_MODE                         -0.043226
Name: TARGET, dtype: float64
NAME_EDUCATION_TYPE_Secondary / secondary special    0.049824
REG_CITY_NOT_WORK_CITY                               0.050994
DAYS_ID_PUBLISH                                      0.051457
DAYS_LAST_PHONE_CHANGE                               0.055218
NAME_INCOME_TYPE_Working                             0.057481
REGION_RATING_CLIENT                                 0.058899
REGION_RATING_CLIENT_W_CITY                          0.060893
DAYS_EMPLOYED                                        0.074958
DAYS_BIRTH                                           0.078239
TARGET                                               1.000000
Name: TARGET, dtype: float64

Pickling data

We end the feature engineering at this point and pickle all the necessary dataframes (polynomial and expert feature sets for training and testing) for the next step which is model selection.

In [23]:
db = pickleshare.PickleShareDB(storeDir)
db['df_app_train_align_expert'] = df_app_train_align_expert
db['df_app_test_align_expert'] = df_app_test_align_expert
db['df_app_train_poly_align'] = df_app_train_poly_align
db['df_app_test_poly_align'] = df_app_test_poly_align

Summary

Polynomial feature engineering

  • Evaluate which are the features with the largest +ve and -ve correlations with TARGET.
  • Extract those features and fill in any np.nan rows by imputing with the median of that column (i.e., sklearn.impute.SimpleImputer)
  • Create new polynomial and interactive features (i.e., sklearn.preprocessing.PolynomialFeatures).
  • Evalute whether these new polynomial and interactive features exhibit greater +ve and -ve correlations with TARGET compared to the original feature set. If so, consider creating a new dataset with these new polynomial and interactive features.
  • Include row key identifiers (i.e., index) into the new polynomial feature set for both the polynomial training and test datasets (i.e., df_poly_feat_train, df_poly_feat_test)
  • Merge this new polynomial feature dataset with the original feature dataset (i.e., merge df_poly_feat_train and df_app_train_align) for both training and test datasets.
  • Align the new training and test datasets together.

Expert feature engineering

  • These are features that are well known to domain knowledge to have high explanatory and predictive power.
  • They are useful in combining features in the original set together, thus making your model more parsimonious
  • Once you've created these expert features, compare their correlations with the TARGET and evaluate if they are greater than the individual features themselves.

Converting iPython notebook to Python code

This allows us to run the code in Spyder.

In [24]:
!jupyter nbconvert ml_kaggle-home-loan-credit-risk-feat-eng.ipynb --to python
[NbConvertApp] Converting notebook ml_kaggle-home-loan-credit-risk-feat-eng.ipynb to python
[NbConvertApp] Writing 11886 bytes to ml_kaggle-home-loan-credit-risk-feat-eng.py
In [ ]:
 

Comments

Comments powered by Disqus