Kaggle: Credit risk (Model: Gradient Boosting Machine - LightGBM)

A more advanced model for solving a classification problem is the Gradient Boosting Machine. There are several popular implementations of GBM namely:

  • XGBoost - Released by Tianqi Chen (March, 2014)
  • Light GBM - Releast by Microsoft (Jan, 2017)
  • CatBoost - Released by Yandex (April, 2017)

Each of the packages differ how they choose to split the decision trees within the ensemble and how categorical variables a treated. My review through the internet is that the accuracy of these different Gradient Boosting packages are somewhat similar, and they differ mainly in terms of implementation speed. A crucial component of these different packages is how they treat and process categorical variables.

We will explore Light Gradient Boosting Machine (LGBM) in our below implementation.

Loading in required modules

In [11]:
# importing all system modules
import os
import sys
import warnings
import ipdb
from IPython.core.debugger import set_trace
from pathlib import Path
warnings.filterwarnings('ignore')
if sys.platform == 'linux':
    sys.path.append('/home/randlow/github/blog2/listings/machine-learning/') # linux
elif sys.platform == 'win32':
    sys.path.append('\\Users\\randl\\github\\blog2\\listings\\machine-learning\\') # win32

# importing data science modules
import pandas as pd
import numpy as np
import sklearn
import scipy as sp
import pickleshare

# importing graphics modules
import matplotlib.pyplot as plt
import seaborn as sns


# importing personal data science modules
import rand_eda as eda

Loading pickled dataframes

To see how the below dataframes were obtained see the post on the Kaggle: Credit risk (Feature Engineering)

In [15]:
if sys.platform == 'linux':
    inputDir = "/mnt/chromeos/GoogleDrive/MyDrive/Colab Notebooks/datasets/kaggle/home-credit-default-risk" # linux
    storeDir = inputDir+'/pickleshare'
elif sys.platform == 'win32':
    home = str(Path.home())
    inputDir = "\datasets\kaggle\home-credit-default-risk" # windows
    storeDir = home+inputDir+'\pickleshare'

print(storeDir)
/mnt/chromeos/GoogleDrive/MyDrive/Colab Notebooks/datasets/kaggle/home-credit-default-risk/pickleshare
In [16]:
db = pickleshare.PickleShareDB(storeDir)
print(db.keys())
['df_app_test_align', 'df_app_train_align', 'df_app_train_corr_target', 'df_app_train_align_expert', 'df_app_test_align_expert', 'df_app_train_poly_align', 'df_app_test_poly_align']
In [17]:
df_app_test_align = db['df_app_test_align'] 
df_app_train_align = db['df_app_train_align'] 
#df_app_train_align_expert  = db['df_app_train_align_expert'] 
#df_app_test_align_expert = db['df_app_test_align_expert'] 
#df_app_train_poly_align = db['df_app_train_poly_align']
#df_app_test_poly_align = db['df_app_test_poly_align'] 

Assign which ever datasets you want to train and test. This is because as part of feature engineering, you will often build new and different feature datasets and would like to test each one out to evaluate whether it improves model performance.

As the imputer is being fitted on the training data and used to transform both the training and test datasets, the training data needs to have the same number of features as the test dataset. This means that the TARGET column must be removed from the training dataset, and stored in train_labels for use later.

In [18]:
train = df_app_train_align.copy()
test = df_app_test_align.copy()

labels = train.pop('TARGET').values # store training labels
feat_names = list(train.columns) # store feature names
In [19]:
train_ids = train.index.values
test_ids = test.index.values

Feature set preprocessing

In [20]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='median')

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range= (0,1))

We fit the imputer and scaler on the training data, and perform the imputer and scaling transformations on both the training and test datasets.

Scikit-learn models only accepts arrays. So the imputer and scalers can accept DataFrames as inputs and they output the train and test variables as arrays for use into Scikit-Learn's machine learning models.

In [21]:
imputer.fit(train)
train = imputer.transform(train)
test = imputer.transform(test)

scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)
In [22]:
train.shape
Out[22]:
(307511, 236)

Model implementation (Gradient Boosting Machine)

With the Gradient Boosting machine, we are going to perform an additional step of using K-fold cross validation (i.e., Kfold). In the other models (i.e., Logit, Random Forest) we only fitted our model on the training dataset and then evaluated the model's performance based on the test dataset.

Using Kfold, we are going to split up our test data set into multiple folds (i.e., $K$). We will be fitting our model on $K-1$ folds and testing it on the $K^{th}$ fold (i.e., out-of-fold) and repeating the fitting process until all $K$ folds are explored. This method allows us to train our model more accurately without overfitting.

Thus, we copy our train (test_feat) dataframe to feat (test_feat) as these are more accurate descriptors since they are now feature and test feature datasets. The training dataset is split into further training and validation datasets based on the kfold.

In [23]:
feat = train.copy()
test_feat = test.copy()
In [ ]:
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import gc

Initialization of empty variables for kfold operation

In [ ]:
feat_imp_vals = np.zeros(len(feat_names)) # feature importance
test_pred = np.zeros(test_feat.shape[0]) # test predictions
oof = np.zeros(feat.shape[0]) # out of fold predictions
valid_scores = [] # validation scores
train_scores = [] # training scores
n_folds = 5
In [ ]:
k_fold = KFold(n_splits = n_folds, shuffle=True,random_state=100)

With the KFold we are taking our training dataset, and splitting it into further training and validation datasets. The k_fold

  • n_estimators: Number of boosted trees to fit (i.e., 1000 trees).
  • reg_alpha: L1 regularization on weights (i.e., LASSO).
  • reg_lambda: L2 regularization on weights (i.e, Ridge.
  • subsample: Subsample ratio of training instance. Setting it to 0.5 lgb will randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration. This prevents overfitting of the tree.
  • learning_rate: Boosting learning rate.
  • n_job : Number of processes to use for calculation. -1 means all processes will be used.
  • random_state : Allows model fit to be replicated.
  • class_weight :
In [ ]:
for train_indices, valid_indices in k_fold.split(feat):
    
    train_feat, train_labels = feat[train_indices], labels[train_indices] # training data for the fold
    valid_feat, valid_labels = feat[valid_indices], labels[valid_indices] # validation data for the fold
    
    model = lgb.LGBMClassifier(n_estimators=1000, objective='binary',
                              class_weight='balanced',learning_rate=0.05,
                              reg_alpha = 0.1, reg_lambda=0.1,
                              subsample = 0.8, n_jobs = -1, random_state=50)


    model.fit(train_feat, train_labels, eval_metric = 'auc',
              eval_set = [(valid_feat, valid_labels), (train_feat, train_labels)],
              eval_names = ['valid','train'], early_stopping_rounds = 100, verbose = 200)
             # categorical_feature = cat_indices,
    
    # record best iteration of each fold
    best_iter = model.best_iteration_ 
    # record the most important features of each fold
    feat_imp_vals += model.feature_importances_/k_fold.n_splits 
    # record the out-of-fold predictions
    oof[valid_indices] = model.predict_proba(valid_feat, num_iteration = best_iter)[:,1]/k_fold.n_splits
    
    # record the predictions on the test_feat dataset
    test_pred += model.predict_proba(test_feat, num_iteration =  best_iter)[:,1]/k_fold.n_splits 
    
    valid_score = model.best_score_['valid']['auc']
    train_score = model.best_score_['train']['auc']
    
    valid_scores.append(valid_score)
    train_scores.append(train_score)
    
    gc.enable()
    del model, train_feat, valid_feat
    gc.collect()
    
Training until validation scores don't improve for 100 rounds.

Creating summary of validation and training AUC scores

In [15]:
valid_auc = roc_auc_score(labels,oof) # calculate the auc based on the test dataset labels and the out-of-fold predictions
valid_scores.append(valid_auc) # calculate the overall validation auc score
train_scores.append(np.mean(train_scores)) # calculate the overall average training auc score

fold_names = list(range(n_folds))
fold_names.append('overall')

metrics = pd.DataFrame({'fold': fold_names,
                      'train': train_scores,
                      'valid': valid_scores})

Showing important features of the dataset

In [26]:
feat_imp = pd.DataFrame({'Feature': feat_names,'Importance':feat_imp_vals}) # creating feature importance dataframe
eda.plot_feat_importance(feat_imp)

Creating submission dataframe

In [15]:
submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_pred}) # creating Kaggle submission dataframe

Kaggle submission

We create the submission dataframe as per the Kaggle home-credit-default-risk competition guidelines

In [18]:
submit = submission
print(submit.head())
print(submit.shape)
   SK_ID_CURR    TARGET
0      100001  0.278467
1      100005  0.499217
2      100013  0.179013
3      100028  0.253097
4      100038  0.672102
(48744, 2)

Submit the csv file to Kaggle for scoring

In [22]:
submit.to_csv('lightgbm-home-loan-credit-risk.csv',index=False)
!kaggle competitions submit -c home-credit-default-risk -f lightgbm-home-loan-credit-risk.csv -m 'submitted'
100%|██████████████████████████████████████| 1.22M/1.22M [00:01<00:00, 1.07MB/s]
Successfully submitted to Home Credit Default Risk

We review our Light GBM from Kaggle and find that there is a slight improvement to 0.74 compared to 0.662 (logit) or 0.688 (random-forest). We can see that substantial improvements are obtained using LightGBM with the same dataset as logit or random-forest leading us to understand why Gradient Boosted Machines are the machine learning model of choice for many data scientists.

In [31]:
!kaggle competitions submissions -c home-credit-default-risk
fileName                                 date                 description  status    publicScore  privateScore  
---------------------------------------  -------------------  -----------  --------  -----------  ------------  
lightgbm-home-loan-credit-risk.csv       2019-02-19 04:16:44  submitted    complete  0.74158      0.74351       
random-forest-home-loan-credit-risk.csv  2019-02-19 04:15:01  submitted    complete  0.74158      0.74351       
random-forest-home-loan-credit-risk.csv  2019-02-11 17:31:03  submitted    complete  0.68694      0.68886       
random-forest-home-loan-credit-risk.csv  2019-02-11 05:24:55  submitted    complete  0.68694      0.68886       
random-forest-home-loan-credit-risk.csv  2019-02-11 05:10:40  submitted    complete  0.68694      0.68886       
logit-home-loan-credit-risk.csv          2019-02-11 04:52:51  submitted    complete  0.66223      0.67583       
random-forest-home-loan-credit-risk.csv  2019-02-11 04:44:50  submitted    complete  0.68694      0.68886       
logit-home-loan-credit-risk.csv          2019-02-08 04:08:33  submitted    complete  0.66223      0.67583       

Converting iPython notebook to Python code

This allows us to run the code in Spyder.

In [32]:
!jupyter nbconvert ml_kaggle-home-loan-credit-risk-model-lightgbm.ipynb --output-dir='~/github/blog2/listings/machine-learning/' --to python
[NbConvertApp] Converting notebook ml_kaggle-home-loan-credit-risk-model-lightgbm.ipynb to python
[NbConvertApp] Writing 10474 bytes to /home/randlow/github/blog2/listings/machine-learning/ml_kaggle-home-loan-credit-risk-model-lightgbm.py
In [ ]:
 

Comments

Comments powered by Disqus