A commonly used model for exploring classification problems is the random forest classifier.

It is called a random forest as it an ensemble (i.e., multiple) of decision trees and merges them to obtain a more accurate and stable prediction. Random forests lead to less overfit compared to a single decision tree especially if there are sufficient trees in the forest. It is also called 'random' as a random subset of features are considered by the algorithim each time a node is being split. In addition, where a decision tree uses the best possible thresholds for splitting a node, you can use a random threshold in a random forest. Random forests are ideal as a predictive tool, and not a descriptive tool. A decision tree is more suitable if you are evaluating relationships within the data.

Random forests are usually trained using the "bagging" approach (i.e., bootstrap aggregation). The "bagging" approach is such that given an initial training dataset $D$ of size $n$, bagging generates $m$ new datasets $D_i$ each of size $n$ by sampling from $D$ uniformly with replacement. Thus, $m$ models can be fitted on the $m$ new datasets that have been created from the initial training dataset $D$ via bootstrapping with replacement. These $m$ models are then combined by averaging the output (i.e., regression) or voting (i.e., classification).

Random forests are also useful as it is possible the measure the relative importance of each feaure on the prediction. This is performed by analyzing a feature's importance based on how often the tree nodes, and how many trees use that feature. Understanding which features are important allows us to drop those that add little or no value to our classification problem.

Loading in required modules¶

In [1]:

# importing all system modules
import os
import sys
import warnings
from pathlib import Path
warnings.filterwarnings('ignore')
if sys.platform == 'linux':
    sys.path.append('/home/randlow/github/blog2/listings/machine-learning/') # linux
elif sys.platform == 'win32':
    sys.path.append('\\Users\\randl\\github\\blog2\\listings\\machine-learning\\') # win32

# importing data science modules
import pandas as pd
import numpy as np
import sklearn
import scipy as sp
import pickleshare

# importing graphics modules
import matplotlib.pyplot as plt
import seaborn as sns
import bokeh as bk

# importing personal data science modules
import rand_eda as eda

Loading pickled dataframes¶

To see how the below dataframes were obtained see the post on the Kaggle: Credit risk (Feature Engineering)

In [2]:

home = str(Path.home())
if sys.platform == 'linux':
    inputDir = "/datasets/kaggle/home-credit-default-risk" # linux
elif sys.platform == 'win32':
    inputDir = "\datasets\kaggle\home-credit-default-risk" # windows

storeDir = home+inputDir+'/pickleshare'

db = pickleshare.PickleShareDB(storeDir)
print(db.keys())

df_app_test_align = db['df_app_test_align'] 
df_app_train_align = db['df_app_train_align'] 
#df_app_train_align_expert  = db['df_app_train_align_expert'] 
#df_app_test_align_expert = db['df_app_test_align_expert'] 
#df_app_train_poly_align = db['df_app_train_poly_align']
#df_app_test_poly_align = db['df_app_test_poly_align']

['df_app_test_align', 'df_app_train_align', 'df_app_train_corr_target', 'df_app_train_align_expert', 'df_app_test_align_expert', 'df_app_train_poly_align', 'df_app_test_poly_align']

Selection of feature set for model training & testing¶

Assign which ever datasets you want to train and test. This is because as part of feature engineering, you will often build new and different feature datasets and would like to test each one out to evaluate whether it improves model performance.

As the imputer is being fitted on the training data and used to transform both the training and test datasets, the training data needs to have the same number of features as the test dataset. This means that the TARGET column must be removed from the training dataset, and stored in train_labels for use later.

In [3]:

train = df_app_train_align.copy()
test = df_app_test_align.copy()

train_labels = train.pop('TARGET')
feat_names = list(train.columns)

Feature set preprocessing¶

In [4]:

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='median')

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range= (0,1))

We fit the imputer and scaler on the training data, and perform the imputer and scaling transformations on both the training and test datasets.

Scikit-learn models only accept arrays. So the imputer and scalers can accept DataFrames as inputs and they output the train and test variables as arrays for use into Scikit-Learn's machine learning models.

In [5]:

imputer.fit(train)
train = imputer.transform(train)
test = imputer.transform(test)

scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-5-e0eecd0cc321> in <module>
----> 1 imputer.fit(train)
      2 train = imputer.transform(train)
      3 test = imputer.transform(test)
      4 
      5 scaler.fit(train)

~/anaconda3/lib/python3.7/site-packages/sklearn/impute.py in fit(self, X, y)
    257                                                self.strategy,
    258                                                self.missing_values,
--> 259                                                fill_value)
    260 
    261         return self

~/anaconda3/lib/python3.7/site-packages/sklearn/impute.py in _dense_fit(self, X, strategy, missing_values, fill_value)
    315         # Median
    316         elif strategy == "median":
--> 317             median_masked = np.ma.median(masked_X, axis=0)
    318             # Avoid the warning "Warning: converting a masked element to nan."
    319             median = np.ma.getdata(median_masked)

~/anaconda3/lib/python3.7/site-packages/numpy/ma/extras.py in median(a, axis, out, overwrite_input, keepdims)
    692 
    693     r, k = _ureduce(a, func=_median, axis=axis, out=out,
--> 694                     overwrite_input=overwrite_input)
    695     if keepdims:
    696         return r.reshape(k)

~/anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py in _ureduce(a, func, **kwargs)
   3248         keepdim = (1,) * a.ndim
   3249 
-> 3250     r = func(a, **kwargs)
   3251     return r, keepdim
   3252 

~/anaconda3/lib/python3.7/site-packages/numpy/ma/extras.py in _median(a, axis, out, overwrite_input)
    713             asorted = a
    714     else:
--> 715         asorted = sort(a, axis=axis, fill_value=fill_value)
    716 
    717     if axis is None:

~/anaconda3/lib/python3.7/site-packages/numpy/ma/core.py in sort(a, axis, kind, order, endwith, fill_value)
   6710     if isinstance(a, MaskedArray):
   6711         a.sort(axis=axis, kind=kind, order=order,
-> 6712                endwith=endwith, fill_value=fill_value)
   6713     else:
   6714         a.sort(axis=axis, kind=kind, order=order)

~/anaconda3/lib/python3.7/site-packages/numpy/ma/core.py in sort(self, axis, kind, order, endwith, fill_value)
   5560 
   5561         sidx = self.argsort(axis=axis, kind=kind, order=order,
-> 5562                             fill_value=fill_value, endwith=endwith)
   5563 
   5564         self[...] = np.take_along_axis(self, sidx, axis=axis)

~/anaconda3/lib/python3.7/site-packages/numpy/ma/core.py in argsort(self, axis, kind, order, endwith, fill_value)
   5407 
   5408         filled = self.filled(fill_value)
-> 5409         return filled.argsort(axis=axis, kind=kind, order=order)
   5410 
   5411     def argmin(self, axis=None, fill_value=None, out=None):

MemoryError:

Model implementation (Random Forest)¶

In this implementation of random forest, we are using a 100 trees (n_estimators=100) using all processors (n_jobs=-1)

In [ ]:

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators = 50, random_state=10, verbose = 1, n_jobs = -1)

In [ ]:

rf.fit(train,train_labels)

Exploring random forest feature importances¶

Decision trees are non-parametric supervised learning models that infer the value of a target variable by analyzing decision rules from the features of the dataset. Since the random forest consists of many decision trees, a random forest can be used to produce what the most important features are to predict the target variable by analzying all the trees for which features use that tree to node

We can see here that our random forest selected EXT_SOURCE_2/3, DAYS_BIRTH as the top 3 most important features. These feature importances produced by the random forest can be used for further feature engineering and culling features that are of low importance (e.g., FLAG_DOCUMENT_x)

In [9]:

feat_importance_values = rf.feature_importances_
df_feat_importance = pd.DataFrame({'Feature':feat_names,'Importance': feat_importance_values})
eda.plot_feat_importance(df_feat_importance)

We apply our fitted random forest model to predict the TARGET outcomes from the test dataset

In [10]:

rf_pred = rf.predict_proba(test)[:,1]

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.7s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    1.5s finished

Kaggle submission¶

We create the submission dataframe as per the Kaggle home-credit-default-risk competition guidelines

In [11]:

submit = pd.DataFrame()
submit['SK_ID_CURR'] = df_app_test_align.index
submit['TARGET'] = rf_pred
print(submit.head())
print(submit.shape)

   SK_ID_CURR  TARGET
0      100001    0.09
1      100005    0.05
2      100013    0.02
3      100028    0.02
4      100038    0.07
(48744, 2)

Submit the csv file to Kaggle for scoring

In [12]:

submit.to_csv('random-forest-home-loan-credit-risk.csv',index=False)
!kaggle competitions submit -c home-credit-default-risk -f random-forest-home-loan-credit-risk.csv -m 'submitted'

100%|█████████████████████████████████████████| 567k/567k [00:00<00:00, 734kB/s]
Successfully submitted to Home Credit Default Risk

We review our random forest scores from Kaggle and find that there is a slight improvement to 0.687 compared to 0.662 based upon the logit model (publicScore). We will try other featured engineering datasets and other more sophisticaed machine learning models in the next posts.

In [13]:

!kaggle competitions submissions -c home-credit-default-risk

fileName                                 date                 description  status    publicScore  privateScore  
---------------------------------------  -------------------  -----------  --------  -----------  ------------  
random-forest-home-loan-credit-risk.csv  2019-02-11 17:31:03  submitted    complete  0.68694      0.68886       
random-forest-home-loan-credit-risk.csv  2019-02-11 05:24:55  submitted    complete  0.68694      0.68886       
random-forest-home-loan-credit-risk.csv  2019-02-11 05:10:40  submitted    complete  0.68694      0.68886       
logit-home-loan-credit-risk.csv          2019-02-11 04:52:51  submitted    complete  0.66223      0.67583       
random-forest-home-loan-credit-risk.csv  2019-02-11 04:44:50  submitted    complete  0.68694      0.68886       
logit-home-loan-credit-risk.csv          2019-02-08 04:08:33  submitted    complete  0.66223      0.67583

Converting iPython notebook to Python code¶

This allows us to run the code in Spyder.

In [14]:

!jupyter nbconvert ml_kaggle-home-loan-credit-risk-model-random-forest.ipynb --output-dir='~/github/blog2/listings/machine-learning/' --to python

[NbConvertApp] Converting notebook ml_kaggle-home-loan-credit-risk-model-random-forest.ipynb to python
[NbConvertApp] Writing 7711 bytes to /home/randlow/github/blog2/listings/machine-learning/ml_kaggle-home-loan-credit-risk-model-random-forest.py

Kaggle: Credit risk (Model: Decision Tree)

Rand Low

2019-Jan-18 (updated 2019-Jan-21)

Comments

A commonly used model for exploring classification problems is the decision tree. A detailed explanation is given in the post on What are decision trees and CARTS?.

Decision trees are the building blocks for random forests and gradient boosted trees

Loading in required modules¶

In [2]:

# importing all system modules
import os
import sys
import warnings
from pathlib import Path
warnings.filterwarnings('ignore')
if sys.platform == 'linux':
    sys.path.append('/home/randlow/github/blog/listings/machine-learning/') # linux
elif sys.platform == 'win32':
    sys.path.append('\\Users\\randl\\github\\blog2\\listings\\machine-learning\\') # win32

# importing data science modules
import pandas as pd
import numpy as np
import sklearn
import scipy as sp
import pickleshare

# importing graphics modules
import matplotlib.pyplot as plt
import seaborn as sns

# importing personal data science modules
import rand_eda as eda

Loading pickled dataframes¶

To see how the below dataframes were obtained see the post on the Kaggle: Credit risk (Feature Engineering)

In [3]:

home = str(Path.home())
if sys.platform == 'linux':
    inputDir = "/datasets/kaggle/home-credit-default-risk" # linux
elif sys.platform == 'win32':
    inputDir = "\datasets\kaggle\home-credit-default-risk" # windows

storeDir = home+inputDir+'/pickleshare'

db = pickleshare.PickleShareDB(storeDir)
print(db.keys())

df_app_test_align = db['df_app_test_align'] 
df_app_train_align = db['df_app_train_align'] 
#df_app_train_align_expert  = db['df_app_train_align_expert'] 
#df_app_test_align_expert = db['df_app_test_align_expert'] 
#df_app_train_poly_align = db['df_app_train_poly_align']
#df_app_test_poly_align = db['df_app_test_poly_align']

['df_app_test_align', 'df_app_train_align', 'df_app_train_corr_target', 'df_app_train_align_expert', 'df_app_test_align_expert', 'df_app_train_poly_align', 'df_app_test_poly_align']

Selection of feature set for model training & testing¶

In [4]:

train = df_app_train_align.copy()
test = df_app_test_align.copy()

train_labels = train.pop('TARGET')
feat_names = list(train.columns)

Feature set preprocessing¶

In [5]:

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='median')

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range= (0,1))

We fit the imputer and scaler on the training data, and perform the imputer and scaling transformations on both the training and test datasets.

In [6]:

imputer.fit(train)
train = imputer.transform(train)
test = imputer.transform(test)

scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)

Model implementation (Decision Tree)¶

In [7]:

from sklearn import tree
from sklearn.model_selection import train_test_split

dtree = tree.DecisionTreeClassifier(random_state=50,max_depth=5)
dtree = dtree.fit(train,train_labels)

Drawing the decision tree¶

export_graphviz generates the decision tree classifier into a dot file.
graphviz.Source().render() accepts a dot file and renders it into the specified format.
Image() displays it in the JupyterNotebook

In [8]:

import graphviz
from IPython.display import Image
dot_data = tree.export_graphviz(dtree,out_file=None,
                    filled=True, rounded=True,
                    special_characters=True, 
                    feature_names=feat_names,
                    class_names=['FALSE','TRUE'])
graphviz.Source(dot_data,format='png').render('dtree_render')  
Image('dtree_render.png')

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-8-8b7c3c3ed8dc> in <module>
----> 1 import graphviz
      2 from IPython.display import Image
      3 dot_data = tree.export_graphviz(dtree,out_file=None,
      4                     filled=True, rounded=True,
      5                     special_characters=True,

ModuleNotFoundError: No module named 'graphviz'

We apply our fitted decision tree to predict the TARGET outcomes from the test dataset

In [9]:

dtree_pred = dtree.predict_proba(test)[:,1]

Kaggle submission¶

We create the submission dataframe as per the Kaggle home-credit-default-risk competition guidelines

In [10]:

submit = pd.DataFrame()
submit['SK_ID_CURR'] = df_app_test_align.index
submit['TARGET'] = dtree_pred
print(submit.head())
print(submit.shape)

   SK_ID_CURR    TARGET
0      100001  0.079853
1      100005  0.103296
2      100013  0.022893
3      100028  0.034714
4      100038  0.088519
(48744, 2)

Submit the csv file to Kaggle for scoring

In [11]:

submit.to_csv('decision-tree-home-loan-credit-risk.csv',index=False)
!kaggle competitions submit -c home-credit-default-risk -f decision-tree-home-loan-credit-risk.csv -m 'submitted'

100%|██████████████████████████████████████| 1.25M/1.25M [00:00<00:00, 1.43MB/s]
Successfully submitted to Home Credit Default Risk

We review our decision tree scores from Kaggle and find that there is a slight improvement to 0.697 compared to 0.662 based upon the logit model (publicScore). We will try other featured engineering datasets and other more sophisticaed machine learning models in the next posts.

In [12]:

!kaggle competitions submissions -c home-credit-default-risk

fileName                                 date                 description  status    publicScore  privateScore  
---------------------------------------  -------------------  -----------  --------  -----------  ------------  
decision-tree-home-loan-credit-risk.csv  2019-02-23 17:11:25  submitted    complete  0.69792      0.68776       
lightgbm-home-loan-credit-risk.csv       2019-02-19 04:16:44  submitted    complete  0.74158      0.74351       
random-forest-home-loan-credit-risk.csv  2019-02-19 04:15:01  submitted    complete  0.74158      0.74351       
random-forest-home-loan-credit-risk.csv  2019-02-11 17:31:03  submitted    complete  0.68694      0.68886       
random-forest-home-loan-credit-risk.csv  2019-02-11 05:24:55  submitted    complete  0.68694      0.68886       
random-forest-home-loan-credit-risk.csv  2019-02-11 05:10:40  submitted    complete  0.68694      0.68886       
logit-home-loan-credit-risk.csv          2019-02-11 04:52:51  submitted    complete  0.66223      0.67583       
random-forest-home-loan-credit-risk.csv  2019-02-11 04:44:50  submitted    complete  0.68694      0.68886       
logit-home-loan-credit-risk.csv          2019-02-08 04:08:33  submitted    complete  0.66223      0.67583

Converting iPython notebook to Python code¶

This allows us to run the code in Spyder.

In [13]:

!jupyter nbconvert ml_kaggle-home-loan-credit-risk-model-decision-tree.ipynb --output-dir='~/github/blog2/listings/machine-learning/' --to python

[NbConvertApp] WARNING | pattern 'ml_kaggle-home-loan-credit-risk-model-decision-tree.ipynb' matched no files
This application is used to convert notebook files (*.ipynb) to various other
formats.

WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.

Options
-------

Arguments that take values are actually convenience aliases to full
Configurables, whose aliases are listed on the help line. For more information
on full configurables, see '--help-all'.

--debug
    set log level to logging.DEBUG (maximize logging output)
--generate-config
    generate default config file
-y
    Answer yes to any questions instead of prompting.
--execute
    Execute the notebook prior to export.
--allow-errors
    Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
--stdin
    read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
--stdout
    Write notebook output to stdout instead of files.
--inplace
    Run nbconvert in place, overwriting the existing notebook (only 
    relevant when converting to notebook format)
--clear-output
    Clear output of current file and save in place, 
    overwriting the existing notebook.
--no-prompt
    Exclude input and output prompts from converted document.
--no-input
    Exclude input cells and output prompts from converted document. 
    This mode is ideal for generating code-free reports.
--log-level=<Enum> (Application.log_level)
    Default: 30
    Choices: (0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL')
    Set the log level by value or name.
--config=<Unicode> (JupyterApp.config_file)
    Default: ''
    Full path of a config file.
--to=<Unicode> (NbConvertApp.export_format)
    Default: 'html'
    The export format to be used, either one of the built-in formats, or a
    dotted object name that represents the import path for an `Exporter` class
--template=<Unicode> (TemplateExporter.template_file)
    Default: ''
    Name of the template file to use
--writer=<DottedObjectName> (NbConvertApp.writer_class)
    Default: 'FilesWriter'
    Writer class used to write the  results of the conversion
--post=<DottedOrNone> (NbConvertApp.postprocessor_class)
    Default: ''
    PostProcessor class used to write the results of the conversion
--output=<Unicode> (NbConvertApp.output_base)
    Default: ''
    overwrite base name use for output files. can only be used when converting
    one notebook at a time.
--output-dir=<Unicode> (FilesWriter.build_directory)
    Default: ''
    Directory to write output(s) to. Defaults to output to the directory of each
    notebook. To recover previous default behaviour (outputting to the current
    working directory) use . as the flag value.
--reveal-prefix=<Unicode> (SlidesExporter.reveal_url_prefix)
    Default: ''
    The URL prefix for reveal.js (version 3.x). This defaults to the reveal CDN,
    but can be any url pointing to a copy  of reveal.js.
    For speaker notes to work, this must be a relative path to a local  copy of
    reveal.js: e.g., "reveal.js".
    If a relative path is given, it must be a subdirectory of the current
    directory (from which the server is run).
    See the usage documentation
    (https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-
    slideshow) for more details.
--nbformat=<Enum> (NotebookExporter.nbformat_version)
    Default: 4
    Choices: [1, 2, 3, 4]
    The nbformat version to write. Use this to downgrade notebooks.

To see all available configurables, use `--help-all`

Examples
--------

    The simplest way to use nbconvert is
    
    > jupyter nbconvert mynotebook.ipynb
    
    which will convert mynotebook.ipynb to the default format (probably HTML).
    
    You can specify the export format with `--to`.
    Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides']
    
    > jupyter nbconvert --to latex mynotebook.ipynb
    
    Both HTML and LaTeX support multiple output templates. LaTeX includes
    'base', 'article' and 'report'.  HTML includes 'basic' and 'full'. You
    can specify the flavor of the format used.
    
    > jupyter nbconvert --to html --template basic mynotebook.ipynb
    
    You can also pipe the output to stdout, rather than a file
    
    > jupyter nbconvert mynotebook.ipynb --stdout
    
    PDF is generated via latex
    
    > jupyter nbconvert mynotebook.ipynb --to pdf
    
    You can get (and serve) a Reveal.js-powered slideshow
    
    > jupyter nbconvert myslides.ipynb --to slides --post serve
    
    Multiple notebooks can be given at the command line in a couple of 
    different ways:
    
    > jupyter nbconvert notebook*.ipynb
    > jupyter nbconvert notebook1.ipynb notebook2.ipynb
    
    or you can specify the notebooks list in a config file, containing::
    
        c.NbConvertApp.notebooks = ["my_notebook.ipynb"]
    
    > jupyter nbconvert --config mycfg.py

In [ ]:

Kaggle: Credit risk (Model: Support Vector Machines)

Rand Low

2019-Jan-16 (updated 2019-Jan-20)

Comments

A more advanced tool for classification tasks than the logit model is the Support Vector Machine (SVM). SVMs are similar to logistic regression in that they both try to find the "best" line (i.e., optimal hyperplane) that separates two sets of points (i.e., classes).

11 minute read…

Kaggle: Credit risk (Model: Logit)

Rand Low

2019-Jan-15 (updated 2019-Jan-18)

Comments

A simple yet effective tool for classification tasks is the logit model. This model is often used as a baseline/benchmark approach before using more sophisticated machine learning models to evaluate the performance improvements.

6 minute read…

Kaggle: Credit risk (Feature Engineering: Automated)

Rand Low

2019-Jan-14 (updated 2019-Jan-20)

Comments

Feature engineering can be an onerous process, thus there are now several new libraries that help us automated the process:

9 minute read…

Kaggle: Credit risk (Feature Engineering: Part 3)

Rand Low

2019-Jan-14 (updated 2019-Jan-20)

Comments

Feature engineering an important part of machine-learning as we try to modify/create (i.e., engineer) new features from our existing dataset that might be meaningful in predicting the TARGET.

9 minute read…

Kaggle: Credit risk (Feature Engineering: Part 2)

Rand Low

2019-Jan-13 (updated 2019-Jan-20)

Comments

Feature engineering an important part of machine-learning as we try to engineer (i.e., modify/create) new features from our existing dataset that might be meaningful in predicting the TARGET.

35 minute read…

Kaggle: Credit risk (Feature Engineering: Part 1)

Rand Low

2019-Jan-12 (updated 2019-Jan-20)

Comments

Feature engineering an important part of machine-learning as we try to modify/create (i.e., engineer) new features from our existing dataset that might be meaningful in predicting the TARGET.

10 minute read…