ml_kaggle-home-loan-credit-risk-model-lightgbm.py (Source)

#!/usr/bin/env python
# coding: utf-8

# A more advanced model for solving a classification problem is the Gradient Boosting Machine.  There are several popular implementations of GBM namely:
# 
# * [XGBoost](https://xgboost.readthedocs.io/en/latest/) - Released by Tianqi Chen (March, 2014)
# * [Light GBM](https://lightgbm.readthedocs.io/en/latest/) - Releast by Microsoft (Jan, 2017)
# * [CatBoost](https://catboost.ai/) - Released by Yandex (April, 2017)
# 
# Each of the packages differ how they choose to split the decision trees within the ensemble and how categorical variables a treated.  My review through the internet is that the accuracy of these different Gradient Boosting packages are somewhat similar, and they differ mainly in terms of implementation speed.  A crucial component of these different packages is how they treat and process categorical variables.
# 
# We will explore Light Gradient Boosting Machine (LGBM) in our below implementation.

# # Loading in required modules

# In[1]:


# importing all system modules
import os
import sys
import warnings
import ipdb
from IPython.core.debugger import set_trace
from pathlib import Path
warnings.filterwarnings('ignore')
if sys.platform == 'linux':
    sys.path.append('/home/randlow/github/blog2/listings/machine-learning/') # linux
elif sys.platform == 'win32':
    sys.path.append('\\Users\\randl\\github\\blog2\\listings\\machine-learning\\') # win32

# importing data science modules
import pandas as pd
import numpy as np
import sklearn
import scipy as sp
import pickleshare

# importing graphics modules
import matplotlib.pyplot as plt
import seaborn as sns


# importing personal data science modules
import rand_eda as eda


# # Loading pickled dataframes
# 
# To see how the below dataframes were obtained see the post on the [Kaggle: Credit risk (Feature Engineering)](/posts/machine-learning/kaggle-home-loan-credit-risk-feat-eng/)
# 

# In[2]:


home = str(Path.home())
if sys.platform == 'linux':
    inputDir = "/datasets/kaggle/home-credit-default-risk" # linux
elif sys.platform == 'win32':
    inputDir = "\datasets\kaggle\home-credit-default-risk" # windows

storeDir = home+inputDir+'/pickleshare'

db = pickleshare.PickleShareDB(storeDir)
print(db.keys())

df_app_test_align = db['df_app_test_align'] 
df_app_train_align = db['df_app_train_align'] 
#df_app_train_align_expert  = db['df_app_train_align_expert'] 
#df_app_test_align_expert = db['df_app_test_align_expert'] 
#df_app_train_poly_align = db['df_app_train_poly_align']
#df_app_test_poly_align = db['df_app_test_poly_align'] 


# Assign which ever datasets you want to `train` and `test`.  This is because as part of feature engineering, you will often build new and different feature datasets and would like to test each one out to evaluate whether it improves model performance.
# 
# As the imputer is being fitted on the training data and used to transform both the training and test datasets, the training data needs to have the same number of features as the test dataset.  This means that the `TARGET` column must be removed from the training dataset, and stored in `train_labels` for use later.

# In[3]:


train = df_app_train_align.copy()
test = df_app_test_align.copy()

labels = train.pop('TARGET').values # store training labels
feat_names = list(train.columns) # store feature names


# In[4]:


train_ids = train.index.values
test_ids = test.index.values


# # Feature set preprocessing

# In[5]:


from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='median')

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range= (0,1))


# We fit the imputer and scaler on the training data, and perform the imputer and scaling transformations on both the training and test datasets.
# 
# Scikit-learn models only accepts arrays.  So the imputer and scalers can accept DataFrames as inputs and they output the `train` and `test` variables as arrays for use into Scikit-Learn's machine learning models.

# In[6]:


imputer.fit(train)
train = imputer.transform(train)
test = imputer.transform(test)

scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)


# In[7]:


train.shape


# # Model implementation ([Gradient Boosting Machine](https://lightgbm.readthedocs.io/en/latest/))

# With the Gradient Boosting machine, we are going to perform an additional step of using K-fold cross validation (i.e., `Kfold`).  In the other models (i.e., Logit, Random Forest) we only fitted our model on the training dataset and then evaluated the model's performance based on the test dataset.
# 
# Using `Kfold`, we are going to split up our `test` data set into multiple folds (i.e., $K$).  We will be fitting our model on $K-1$ folds and testing it on the $K^{th}$ fold (i.e., out-of-fold) and repeating the fitting process until all $K$ folds are explored.  This method allows us to train our model more accurately without overfitting.
# 
# Thus, we copy our `train` (`test_feat`) dataframe to `feat` (`test_feat`) as these are more accurate descriptors since they are now feature and test feature datasets.  The training dataset is split into further training and validation datasets based on the `kfold`.

# In[8]:


feat = train.copy()
test_feat = test.copy()


# In[9]:


from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import gc


# Initialization of empty variables for `kfold` operation

# In[10]:


feat_imp_vals = np.zeros(len(feat_names)) # feature importance
test_pred = np.zeros(test_feat.shape[0]) # test predictions
oof = np.zeros(feat.shape[0]) # out of fold predictions
valid_scores = [] # validation scores
train_scores = [] # training scores
n_folds = 5


# In[11]:


k_fold = KFold(n_splits = n_folds, shuffle=True,random_state=100)


# With the `KFold` we are taking our training dataset, and splitting it into further training and validation datasets. The `k_fold`
# 
# * `n_estimators`: Number of boosted trees to fit (i.e., 1000 trees).
# * `reg_alpha`: L1 regularization on weights (i.e., LASSO).
# * `reg_lambda`: L2 regularization on weights (i.e, Ridge.
# * `subsample`: Subsample ratio of training instance.  Setting it to 0.5 `lgb` will randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration. This prevents overfitting of the tree.
# * `learning_rate`: Boosting learning rate.
# * `n_job` : Number of processes to use for calculation. `-1` means all processes will be used.
# * `random_state` : Allows model fit to be replicated.
# * `class_weight` :
# 

# In[12]:


for train_indices, valid_indices in k_fold.split(feat):
    
    train_feat, train_labels = feat[train_indices], labels[train_indices] # training data for the fold
    valid_feat, valid_labels = feat[valid_indices], labels[valid_indices] # validation data for the fold
    
    model = lgb.LGBMClassifier(n_estimators=1000, objective='binary',
                              class_weight='balanced',learning_rate=0.05,
                              reg_alpha = 0.1, reg_lambda=0.1,
                              subsample = 0.8, n_jobs = -1, random_state=50)


    model.fit(train_feat, train_labels, eval_metric = 'auc',
              eval_set = [(valid_feat, valid_labels), (train_feat, train_labels)],
              eval_names = ['valid','train'], early_stopping_rounds = 100, verbose = 200)
             # categorical_feature = cat_indices,
    
    # record best iteration of each fold
    best_iter = model.best_iteration_ 
    # record the most important features of each fold
    feat_imp_vals += model.feature_importances_/k_fold.n_splits 
    # record the out-of-fold predictions
    oof[valid_indices] = model.predict_proba(valid_feat, num_iteration = best_iter)[:,1]/k_fold.n_splits
    
    # record the predictions on the test_feat dataset
    test_pred += model.predict_proba(test_feat, num_iteration =  best_iter)[:,1]/k_fold.n_splits 
    
    valid_score = model.best_score_['valid']['auc']
    train_score = model.best_score_['train']['auc']
    
    valid_scores.append(valid_score)
    train_scores.append(train_score)
    
    gc.enable()
    del model, train_feat, valid_feat
    gc.collect()
    


# ## Creating summary of validation and training AUC scores

# In[15]:


valid_auc = roc_auc_score(labels,oof) # calculate the auc based on the test dataset labels and the out-of-fold predictions
valid_scores.append(valid_auc) # calculate the overall validation auc score
train_scores.append(np.mean(train_scores)) # calculate the overall average training auc score

fold_names = list(range(n_folds))
fold_names.append('overall')

metrics = pd.DataFrame({'fold': fold_names,
                      'train': train_scores,
                      'valid': valid_scores})


# ## Showing important features of the dataset

# In[26]:


feat_imp = pd.DataFrame({'Feature': feat_names,'Importance':feat_imp_vals}) # creating feature importance dataframe
eda.plot_feat_importance(feat_imp)


# ## Creating submission dataframe

# In[15]:


submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_pred}) # creating Kaggle submission dataframe


# # Kaggle submission

# We create the submission dataframe as per the Kaggle home-credit-default-risk competition guidelines

# In[18]:


submit = submission
print(submit.head())
print(submit.shape)


# Submit the csv file to Kaggle for scoring

# In[22]:


submit.to_csv('lightgbm-home-loan-credit-risk.csv',index=False)
get_ipython().system("kaggle competitions submit -c home-credit-default-risk -f lightgbm-home-loan-credit-risk.csv -m 'submitted'")


# We review our Light GBM from Kaggle and find that there is a slight improvement to 0.74 compared to 0.662 (logit) or 0.688 (random-forest).  We can see that substantial improvements are obtained using LightGBM with the same dataset as logit or random-forest leading us to understand why Gradient Boosted Machines are the machine learning model of choice for many data scientists.

# In[21]:


get_ipython().system('kaggle competitions submissions -c home-credit-default-risk')


# # Converting iPython notebook to Python code
# 
# This allows us to run the code in Spyder.

# In[8]:


get_ipython().system("jupyter nbconvert ml_kaggle-home-loan-credit-risk-model-lightgbm.ipynb --output-dir='~/github/blog2/listings/machine-learning/' --to python")