Steps for building a machine learning model

Contents

Exploratory Data Analysis (EDA)
Feature Engineering

The steps required to build a machine learning model are given as follows:

Exploratory Data Analysis
Feature Engineering
Model Selection
Model Hyperparameter Training

I will describe the major actions that need to be performed in the above and include Python modules that I have written to assist me through building a machine learning model.

Exploratory Data Analysis (EDA)

Evaluate size of dataset (i.e., no. of rows/columns)

Evaluate the different types of datatypes for all features (i.e., string, float, int, categorical).

Produce descriptive statistics for numeric variables

Produce boxplots of all numeric variables

Evaluate how many missing values/erroneous values there are in each column.

Fill missing/erroneous values with np.nan.

Evaluate how many features are categorical variables.

Perform label encoding for 2-state categorical variables

One-hot encoding for n-stat categorical variables

Perform correlation analysis between features and target

Estimate Pearson correlation matrix.

Visualize using pair-plots

Examine feature differences between target populations

Produce a KDE of the distributions for each feature variable for each target state

Use statistical test (i.e., Kolmogorov-Smirnov, t-test) to evalute if any populations are significantly different.

Ensure that both the test and training datasets have the same number of features

machine-learning/rand_eda.py (Source)

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Jan 24 17:06:06 2019

This code consists of helper functions to perform Exploratory Data Analysis (EDA)

@author: randlow
"""

import pandas as pd
import matplotlib.pyplot as plt
import sys
import seaborn as sns
import scipy as sp
import numpy as np
import shelve
from sklearn import preprocessing


'''
Takes a two-level multi-index and flattens it into a one-level

inputs
------
    df (dataframe): a df with a multi_index column

outputs
-------
    df dataframe: returns df with single index columns

'''
def flatten_multi_index(df):

  colName = []

  for idx0,val0 in enumerate(df.columns.levels[0]):
    for idx1,val1 in enumerate(df.columns.levels[1]):
      combine = (val0,val1)
      colName.append('_'.join(combine))

  df.columns = colName

  return df

'''
Performs aggregation statistical functions on numeric columns of a dataset and
 collapses the multi index into a single index.

Inputs
------
    df (dataframe): dataframe to perform aggregation operations on
    groupby_var (string): this is the column to group by
    drop_var (list): list of columns to drop.  These are columns that add no value when aggregation operations are applied (e.g., ID columns)

Outputs
-------
    df_agg : a dataframe with a flattened multi_index and aggregate statistics across all numeric columns
'''
def agg_num(df, groupby_var, drop_var_list=list(), df_name=''):
    for col in df:
        for drop_var in drop_var_list:
            if col == drop_var:
                df.drop(columns = col,inplace=True)
# extract out the groupby ID variable, extract all numeric columns, then add back
# the groupby ID variable
    groupby_ids = df[groupby_var]
    df_numeric = df.select_dtypes('number')
    df_numeric[groupby_var] = groupby_ids

# group the dataframe accordingly
    df_agg = df_numeric.groupby(groupby_var).agg(['mean','median','sum','count','max','min'])

# flatten the multi index naming convention to a single index
# prefix the column names with a dataframe name (important if we merge this data into a training dataframe)
# re-insert the groupby index data
    df_agg = flatten_multi_index(df_agg)
    df_agg = df_agg.add_prefix('{}_'.format(df_name))

    return df_agg

'''
Performs count functions on categorical columns of a dataset and collapses the multi multi_index
into a single index

Inputs
------
    df (dataframe): input DataFrame
    groupby_var (string): name of variable to groupby

Outputs
-------
    df (dataframe): ouptut dataframe

'''

def cnt_cat(df,groupby_var,df_name):

    groupby_ids = df[groupby_var]
    df_cat= pd.get_dummies(df.select_dtypes('object'))
    df_cat[groupby_var] = groupby_ids
    df_cat_agg = df_cat.groupby(groupby_var).agg(['sum','mean'])


    df_cat_agg = flatten_multi_index(df_cat_agg)
    df_cat_agg = df_cat_agg.add_prefix('{}_'.format(df_name))


    return df_cat_agg

'''
prints detailed categorical information from the dataframe

inputs: dataframe

outputs: nothing

'''
def extract_cat_var(df):

    cat_colnames = list(df.select_dtypes('object').columns)

    for col in cat_colnames:
      print('Categorical column: {}\n'.format(col))
      print('Number of unique entries: {}\n'.format(df[col].nunique()))
      print('Unique entry names:\n{}\n'.format(df[col].unique()))
      print('Value counts of each entry:\n{}\n'.format(df[col].value_counts(dropna=False)))
      print('---------------------------------------------------------')

    return

'''
Performs Label encoding with a default of two unique entries
per category
'''

def label_encoding_df(df,cat_limit = 2):

    le = preprocessing.LabelEncoder()
    le_count = 0
    label_encode_list = []
    for col in df:
        if df[col].dtype=='object':
            if df[col].nunique(dropna=False) <= cat_limit:
                print(col)
                le_count += 1
                le.fit(df[col])
                df[col] = le.transform(df[col])
                label_encode_list.append(col)

    print('{0} columns were label encoded'.format(le_count))

    return df, label_encode_list

'''
Given a data frame and a list of feature variables,
* Produce the KDE and histogram plots for the Target=True and Target=False populations.
* Statistical differences between both populations.

Use this function to graphically evaluate whether certain feature variables exhibit
different characterstics for the Target=True and Target=False populations
'''
def plot_kde_hist_var(df,varList,calcStat = True, drawAll = False):
    numVar = len(varList)

    plt.figure(figsize=(10,numVar*4))
    ks_stat_list = []
    ks_pval_list = []
    try:
        for i,var in enumerate(varList):
            tgt_true = df.loc[df['TARGET']==1,var]
            tgt_false = df.loc[df['TARGET']==0,var]

            # calculate statistical significance between both populations
            if calcStat == True:
                (ks_stat,ks_pval)= sp.stats.ks_2samp(tgt_true,tgt_false)
                ks_stat_list.append(ks_stat)
                ks_pval_list.append(ks_pval)
                ks_hval_list = [True for hyp in ks_pval_list if hyp<0.05]

            #
            median_tgt_true = tgt_true.median()
            median_tgt_false = tgt_false.median()
            corrVal = df['TARGET'].corr(df[var])
            print('Median Value of {} when Target (True): {:.6f}'.format(var,median_tgt_true))
            print('Median Value of {} when Target (False): {:.6f}'.format(var,median_tgt_false))
            print('Pearson Correlation of {} with Target (True): {:.6f}'.format(var,corrVal))

            # drawing KDE distributions
            tgt_true.dropna(inplace=True) # require to dropna for sns.distplot function
            tgt_false.dropna(inplace=True)
            plt.subplot(numVar,1,i+1)
            sns.distplot(tgt_true,rug=drawAll,kde=drawAll,label='Target: True')
            sns.distplot(tgt_false,rug=drawAll,kde=drawAll,label='Target: False')
            plt.legend()
            #plt.title(var)
    except TypeError as error:
        print(error)
        print('Features are objects.  Need ints/floats')

    return ks_hval_list, ks_pval_list

'''
Given a dataframe and a list of feature variables, the histogram of
the feature variables is produced
'''
def plot_hist_var(df,varList):

    numVar = len(varList)

    plt.figure(figsize=(10,numVar*4))

    for i,var in enumerate(varList):
        df[var].hist()

    return

'''
Given a dataframe, information regarding the missing/null values
of the dataframe is produced.
'''
def print_tab_miss_val(df,miss_val_thresh=50,numColPrint=10,printData=False):
    # Evaluate missing values in the data
    num_miss_val = df.isnull().sum()
    pct_miss_val = num_miss_val/df.shape[0]*100

    tab_miss_val = pd.concat([num_miss_val,pct_miss_val],axis=1)
    tab_miss_val.columns = ['Missing Values','Percentage']
    tab_miss_val  = tab_miss_val[tab_miss_val['Missing Values']>0]
    tab_miss_val['Percentage'] = tab_miss_val['Percentage'].round(1)
    tab_miss_val.sort_values(['Percentage'],ascending=False,inplace=True)

    numCol_miss_val = tab_miss_val.shape[0]
    numCol_total = df.shape[1]
    pctCol_miss_val = round((numCol_miss_val/numCol_total)*100)

    numCol_crit_miss_val = tab_miss_val[tab_miss_val['Percentage'] > miss_val_thresh].shape[0]
    pctCol_crit_miss_val = round(numCol_crit_miss_val/numCol_total*100)

    info_miss_val = pd.Series(data=[numCol_miss_val,pctCol_miss_val,numCol_crit_miss_val,pctCol_crit_miss_val],
              index=['Cols Missing Values','Cols Missing Values (%)',
            'Cols Critical Missing Values', 'Cols Critical Missing Values (%)'])

    if printData==True:
        print(info_miss_val)
        print('\n Top {} columns with missing values is as follows:'.format(numColPrint))
        print(tab_miss_val['Percentage'].head(numColPrint))

    return info_miss_val, tab_miss_val

# basic helper function to help print values that are in a series dataformat
def convSeries2Str(seriesData):
    strList = ''
    for idx,val in seriesData.iteritems():
        strVal = '{}({}), '.format(idx,val)
        strList = strList + strVal

    return strList




'''
prints basic information regarding the dataframe
'''
def print_basic_info_df(df,bal_thresh=30):

    (numRow,numCol) = df.shape
    memory = int(sys.getsizeof(df)/(10**6))

    dtypeVals = df.dtypes.value_counts()
    dtypeStr = convSeries2Str(dtypeVals)

    # Extract the unique variables of each  column that are strings, and extract the unique variables including NaNs
    catVals = df.select_dtypes('object').nunique(dropna=False)
    catStr = convSeries2Str(catVals)

    # Is the dataframe balanced?
    if 'TARGET' in df:
        (numRow,numCol) = df.shape
        pctTarget_true = int(df['TARGET'].sum()/numRow*100)
        if pctTarget_true > 100-bal_thresh or pctTarget_true < bal_thresh:
            isBalanced='No'
        else:
            isBalanced='True'
    else:
        isBalanced='N/A'
        pctTarget_true='N/A'

    series_data = [numRow, numCol, dtypeStr,memory,pctTarget_true,isBalanced,catStr]
    series_idx = ['Num rows','Num cols','Dtype','Memory (MB)','True (%)','Is Balanced','Categorical cols']
    series_info = pd.Series(series_data,index = series_idx)

    dict_info = [{'Num rows': numRow, 'Num cols': numCol,'Dtype': dtypeStr,
    'Memory (MB)': memory,'True (%)': pctTarget_true,'Is Balanced':isBalanced,
    'Category cols': catStr} ]

    return series_info

'''
Provides a comparison of two dataframes.

Used to compare characteristics between a test and training dataset.
'''
def print_compare_df(df1,df2,miss_val_thresh=50,bal_thresh=30,printCompareData=False):

    # Prints combined basic data of each dataframe
    df1_basicinfo = print_basic_info_df(df1)
    df2_basicinfo = print_basic_info_df(df2)
    comb_basic_info = pd.concat([df1_basicinfo,df2_basicinfo],axis=1)

    # Compare missing value data
    miss_val_info_df1, miss_val_tab_df1 =  print_tab_miss_val(df1)
    miss_val_info_df2, miss_val_tab_df2 =  print_tab_miss_val(df2)
    comb_miss_val_info = pd.concat([miss_val_info_df1,miss_val_info_df2],axis=1)


    s1 = set(df1.dtypes)
    s2 = set(df2.dtypes)

    # Compare two dataframes for number of missing categories, and values in each category
    # As the training and test datasets are of different sizes, the training dataset may have values
    # in the feature columns that are not in the test datasets.
    # This code analyzes whether there are more than 5 different unique variables between feature columns
    # of the test and training datasets.
    if s1 == s2:
        for x in list(s1):

            df1_catCols = df1.select_dtypes(x).nunique(dropna=False)
            df2_catCols = df2.select_dtypes(x).nunique(dropna=False)
            diff_catColsList = df1_catCols - df2_catCols
            diff_catCols = diff_catColsList[(diff_catColsList<5) & (diff_catColsList>-5) & (diff_catColsList!=0)]
            for y in diff_catCols.index:
                df1_valCnt = df1[y].value_counts()
                df1_valCnt.name = df1_valCnt.name+'_DF1'
                df2_valCnt = df2[y].value_counts()
                df2_valCnt.name = df2_valCnt.name+'_DF2'
                comb_valCnt = pd.concat([df1_valCnt,df2_valCnt],axis=1)

                if printCompareData==True:
                    print(comb_valCnt)
                    plt.figure()
                    comb_valCnt.plot.bar(rot=60,title=y)



    return comb_basic_info, comb_miss_val_info, miss_val_tab_df1, miss_val_tab_df2


'''
Returns the column name if a certain value occurs in any column of the dataframe.
Returns data on the frequency of that value in the column.

Used when dataframe contain certain types of values to denote NaNs.

Inputs:
df
val

Outputs:
df_errCol
errCol_list
'''
def chk_val_col(df,val):
    errCol_list = [x for x in df if val in df[x].unique()]
    errPct_list = []
    for errCol in errCol_list:
        numAll = df.shape[0]
        numErr = df[df[errCol]==val].shape[0]
        errPct_list.append(numErr/numAll*100)

    df_errCol = pd.DataFrame(data=errPct_list,index=errCol_list,columns=['Error val %'])

    errCol_Pct_list = list(zip(errCol_list,errPct_list))

    return df_errCol, errCol_list

'''
Replaces all error values in a specified list of columns in a dataframe with np.NaN

Inputs:
df: DataFrame
errCol_list: List of column names in the DataFrame where the error values are
errVal: The error value

Outputs:
df: Returns a dataframe with all the error values in each specified column in the dataframe with np.NaN
'''
def fill_errorVal_df(df,errCol_list,errVal):


    for errCol in errCol_list:
        df[errCol].replace({errVal: np.nan},inplace=True)

    return df

'''
Plots a bar chart of the most/least important features in a dataset after Random Forest/GBT model fit.

Inputs:
df: DataFrame with a column named `Importance` that was extracted from the Random Forest/GBT feature importance
numFeat: Number of top/bottom features to produce in the plot

Outputs:
Produces the most important and least important features in the DataFrame.
'''
def plot_feat_importance(df,numFeat=10):

    df = df.sort_values('Importance',ascending=False).reset_index()
    top_feat = df.head(numFeat)
    bottom_feat = df.tail(numFeat)

    fig,axes = plt.subplots(1,2,figsize=(15,10))
    ax0 = sns.barplot(x='Feature',y='Importance',data=top_feat, ax=axes[0])
    ax0.set_title('Top {} features'.format(numFeat))
    for item in ax0.get_xticklabels():
        item.set_rotation(90)
    ax1 = sns.barplot(x='Feature',y='Importance',data=bottom_feat, ax=axes[1])
    for item in ax1.get_xticklabels():
        item.set_rotation(90)
    ax1.set_title('Bottom {} features'.format(numFeat))

    return

Feature Engineering

Use expert knowledge to create additional features.

Use sklearn.preprocessing.PolynomialFeatures to create additional internation and polynomial features.

Ensure that both the test and training datasets have the same number of features