Kaggle: Credit risk (Feature Engineering: Part 1)
Feature engineering an important part of machine-learning as we try to modify/create (i.e., engineer) new features from our existing dataset that might be meaningful in predicting the TARGET
.
In the kaggle home-credit-default-risk competition, we are given the following datasets:
application_train.csv
previous_application.csv
installments_payments.csv
bureau.csv
POS_CASH_balance.csv
bureau_balance.csv
credit_card_balance.csv
Each datasets provides more information about the loan application in terms of how prompt they have been on their instalment payments, their credit history on other loans, the amount of cash or credit card balances they have etc. A data scientists/researcher should always investigate and create new features from all the information provided.
In this basic exercise, we will focus on the main dataset, that is application_train.csv
. We will go through two simple methodologies in feature engineering
- Polynomial features
- Expert features
Loading in required modules¶
# importing all system modules
import os
import sys
import warnings
from pathlib import Path
warnings.filterwarnings('ignore')
if sys.platform == 'linux':
sys.path.append('/mnt/chromeos/GoogleDrive/MyDrive/Colab Notebooks/modules/') # linux
elif sys.platform == 'win32':
sys.path.append('\\Users\\randl\\github\\blog2\\listings\\machine-learning\\') # win32
# importing data science modules
import pandas as pd
import numpy as np
import sklearn
import scipy as sp
import pickleshare
# importing graphics modules
import matplotlib.pyplot as plt
import seaborn as sns
# importing personal data science modules
import rand_eda as eda
Loading pickled dataframes¶
To see how the below dataframes were obtained see the post on the Kaggle: Credit risk (Exploratory Data Analysis)
home = str(Path.home())
if sys.platform == 'linux':
inputDir = '/mnt/chromeOS/GoogleDrive/MyDrive/Colab Notebooks/datasets/' # linux
#inputDir = "/datasets/kaggle/home-credit-default-risk" # linux
elif sys.platform == 'win32':
inputDir = '\\datasets\\kaggle\\home-credit-default-risk\\' # windows
storeDir = home+inputDir+'kaggle/home-credit-default-risk/pickleshare'
db = pickleshare.PickleShareDB(storeDir)
print(db.keys())
df_app_test_align = db['df_app_test_align']
df_app_train_align = db['df_app_train_align']
df_app_train_corr_target = db['df_app_train_corr_target']
We print out the highest +ve and -ve correlated features to the TARGET
. We perform this to try generating auxilary features using the PolynomialFeatures
function in Scikit-learn to see if this increases the explanatory/predictive power of these features that are highly correlated to TARGET
.
print(df_app_train_corr_target.tail(10))
print(df_app_train_corr_target.head(10))
Polynomial features¶
The variables that we select are EXT_SOURCE_1/2/3
(-ve), DAYS_BIRTH
(+ve), and DAYS_EMPLOYED
(+ve) that all have large correlation values to TARGET
relative to the other features.
We create new poly_feat_x
dataframes as both training and test datasets need to be equivalent. Thus, any polnomial features in the training dataset, must be created for the test dataset too.
imp_feat_list = ['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','DAYS_BIRTH'] # ,'DAYS_EMPLOYED'
poly_feat_train = df_app_train_align[imp_feat_list]
poly_feat_test = df_app_test_align[imp_feat_list]
We observed that several features often had NaN values. We use the SimpleImputer
function in scikit-learn's impute
toolkit where we replace all np.nan
with median
values in that column.
We fit on the training data, as that is all the in-sample data that we have. Then we perform the transformations on both the training and test datasets
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='median')
imputer.fit(poly_feat_train)
poly_feat_train = imputer.transform(poly_feat_train)
poly_feat_test = imputer.transform(poly_feat_test)
We set PolynomialFeatures
function to $N=3$. Based on the list of features provided, new features of diffenret powers (e.g., $x^2$, $x^3$) and interaction variables (i.e., $x \times y$) are created. We fit the PolynomialFeatures
function on the training dataset, and then we transform the data from poly_feat_train
and poly_feat_test
.
We see that many new features are created (i.e., 35)
from sklearn.preprocessing import PolynomialFeatures
poly_transform = PolynomialFeatures(degree=3)
poly_transform.fit(poly_feat_train)
poly_transform_train = poly_transform.transform(poly_feat_train)
poly_transform_test = poly_transform.transform(poly_feat_test)
print('Shape of polynomial features (training): {}'.format(poly_transform_train.shape))
print('Shape of polynomial features (test): {}'.format(poly_transform_test.shape))
We can see a detailed list of the new polynomial and interaction features that have been created
poly_feat_name_list = poly_transform.get_feature_names(imp_feat_list)
We would like to see if any of these new features in the training dataset have higher correlations with the TARGET
. Since the new polynomial features in the training dataset have a correlation magnitude greater than the original feature set, we should consider adding them into our model.
df_poly_feat_train = pd.DataFrame(poly_transform_train,columns=poly_feat_name_list)
df_poly_feat_train['TARGET'] = df_app_train_align['TARGET']
poly_feat_corr = df_poly_feat_train.corr()['TARGET'].sort_values()
print(poly_feat_corr.head(10))
print(poly_feat_corr.tail(10))
We create a new DataFrame for the test data that includes the new polynomial features.
df_poly_feat_test = pd.DataFrame(poly_transform_test,columns=poly_feat_name_list)
We include the row identifiers into the polynomial feature training and test datasets (i.e., df_poly_feat_train/test
). This allows us to merge them with the original feature training and test datasets (i.e., df_app_train/test_align
)
df_poly_feat_train.index = df_app_train_align.index
df_poly_feat_test.index = df_app_test_align.index
df_app_train_poly = df_app_train_align.merge(df_poly_feat_train,left_index=True,right_index=True)
df_app_test_poly = df_app_test_align.merge(df_poly_feat_test,left_index=True,right_index=True)
Now that we have two new dataframes that have the new polynomial features added to them, we need to make sure botht he training and test datasets are aligned correctly.
df_app_train_poly_align, df_app_test_poly_align = df_app_train_poly.align(df_app_test_poly,join='inner',axis=1)
df_app_train_poly_align.head()
df_app_test_poly_align.head()
We can see that our original dataset has 237 features, and our polynomial feature engineering resulted in 36 new features. Having an extended set of both original and polynomial features resulted in 273 features. When we aligned both the training and test dataets together, we eend up with 271 features.
print('Original features (train): {}'.format(df_app_train_align.shape))
print('Polynomial features(train):{}'.format(df_poly_feat_train.shape))
print('Original & polynomial features (train): {}'.format(df_app_train_poly.shape))
print('Original & polynomial features align (train): {}'.format(df_app_train_poly_align.shape))
We analyze what columns have been removed via the align
process.
s1 = set(df_app_train_poly.columns)
s2 = set(df_app_train_poly_align.columns)
diff_s1_s2 = s1-s2
diff_s1_s2
Expert knowledge features¶
Often, experts have domain knowledge about what combination of existing features have strong explanatory/predictive power. In this case we are looking at the following features
- Percentage of days employed - How long a person has been employed as a percentage of his life is a stronger predictor of his ability to keep paying off his loans.
- Available credit as a percentage of income - If a person has a very large amount of credit available as a percentage of income, this can impact his ability to pay off the loans
- Annuity as a percentage of income - If a person receives an annuity, this is a more stable source of income thus if it is higher, you are less likely to default.
- Annuity as a percentage of available credit - If a person receives an annuity, this is more stable source of income thus if it is a high percentage compared to his/her credit availability then the person is more likely be able to pay off his debts.
df_app_train_align_expert = df_app_train_align.copy()
df_app_test_align_expert = df_app_test_align.copy()
# Training dataset
df_app_train_align_expert['DAYS_EMPLOYED_PCT'] = df_app_train_align_expert['DAYS_EMPLOYED'] / df_app_train_align_expert['DAYS_BIRTH']
df_app_train_align_expert['CREDIT_INCOME_PCT'] = df_app_train_align_expert['AMT_CREDIT'] / df_app_train_align_expert['AMT_INCOME_TOTAL']
df_app_train_align_expert['ANNUITY_INCOME_PCT'] = df_app_train_align_expert['AMT_ANNUITY'] / df_app_train_align_expert['AMT_INCOME_TOTAL']
df_app_train_align_expert['CREDIT_TERM'] = df_app_train_align_expert['AMT_ANNUITY'] / df_app_train_align_expert['AMT_CREDIT']
# Test dataset
df_app_test_align_expert['DAYS_EMPLOYED_PCT'] = df_app_test_align_expert['DAYS_EMPLOYED'] / df_app_test_align_expert['DAYS_BIRTH']
df_app_test_align_expert['CREDIT_INCOME_PCT'] = df_app_test_align_expert['AMT_CREDIT'] / df_app_test_align_expert['AMT_INCOME_TOTAL']
df_app_test_align_expert['ANNUITY_INCOME_PCT'] = df_app_test_align_expert['AMT_ANNUITY'] / df_app_test_align_expert['AMT_INCOME_TOTAL']
df_app_test_align_expert['CREDIT_TERM'] = df_app_test_align_expert['AMT_ANNUITY'] / df_app_test_align_expert['AMT_CREDIT']
Graphically visualizing expert features
varList = ['DAYS_EMPLOYED_PCT','CREDIT_INCOME_PCT','ANNUITY_INCOME_PCT','CREDIT_TERM']
eda.plot_kde_hist_var(df_app_train_align_expert,varList,calcStat = True, drawAll = True)
We analyze the expert features and find that DAYS_EMPLOYED_PCT
ranks highly as measured by correlation in relation to TARGET
.
corr_exp = df_app_train_align_expert.corr()['TARGET'].sort_values()
print(corr_exp.head(20))
print(corr_exp.tail(20))
Pickling data¶
We end the feature engineering at this point and pickle all the necessary dataframes (polynomial and expert feature sets for training and testing) for the next step which is model selection.
db = pickleshare.PickleShareDB(storeDir)
db['df_app_train_align_expert'] = df_app_train_align_expert
db['df_app_test_align_expert'] = df_app_test_align_expert
db['df_app_train_poly_align'] = df_app_train_poly_align
db['df_app_test_poly_align'] = df_app_test_poly_align
Summary¶
Polynomial feature engineering¶
- Evaluate which are the features with the largest +ve and -ve correlations with
TARGET
. - Extract those features and fill in any
np.nan
rows by imputing with themedian
of that column (i.e.,sklearn.impute.SimpleImputer
) - Create new polynomial and interactive features (i.e.,
sklearn.preprocessing.PolynomialFeatures
). - Evalute whether these new polynomial and interactive features exhibit greater +ve and -ve correlations with
TARGET
compared to the original feature set. If so, consider creating a new dataset with these new polynomial and interactive features. - Include row key identifiers (i.e., index) into the new polynomial feature set for both the polynomial training and test datasets (i.e.,
df_poly_feat_train
,df_poly_feat_test
) - Merge this new polynomial feature dataset with the original feature dataset (i.e., merge
df_poly_feat_train
anddf_app_train_align
) for both training and test datasets. - Align the new training and test datasets together.
Expert feature engineering¶
- These are features that are well known to domain knowledge to have high explanatory and predictive power.
- They are useful in combining features in the original set together, thus making your model more parsimonious
- Once you've created these expert features, compare their correlations with the
TARGET
and evaluate if they are greater than the individual features themselves.
Converting iPython notebook to Python code¶
This allows us to run the code in Spyder.
!jupyter nbconvert ml_kaggle-home-loan-credit-risk-feat-eng.ipynb --to python
Comments
Comments powered by Disqus