Model validation for classification trees

This is an overview of the diagnostic and performance tests that need to be performed to ensure the validity of a linear regression model that has one or more continuous/categorical independent variables. These are for linear regression models that are optimized using (Ordinary Least Squares (OLS) or Maximum Likelihood Estimation (MLE). The equation for the linear regression model is as follows:

\begin{equation*} Y_{i} = \alpha + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \ldots + \epsilon_{i} \text{, where } \epsilon \backsim N(0,\sigma) \end{equation*}

The assumptions of the linear regression model are:


Relationship between dependent and independent variables is linear.


Error term is normally distributed.


Error terms are statistically independent.


Error term has constant variance for all observations.

Lack of multicollinearity

No excessive correlation between independent variables.

Data Diagnostics

The data used for modelling should be evaluated for the following:

  1. Compliance with relevant regulatory requirements

    Often these requirements refer to data length requirements for different types of portfolios, ensuring the data length is representative of the economic cycle, and requirements for use of data proxies (e.g., BCC13-5 [Conservatism to risk parameters in Advanced Approaches], BCC14-3 [Selection of reference data periods and data deficiencies]).

  2. Outliers, missing or special values.

    Outliers or influential data points should be identified (i.e., Cook's distance) and model performance should be evaluated with the exclusion of these outliers.

Model Diagnostics

Goodness of fit

These tests evaluate how well a regression model fits the data. The tests are formal regression statistics and descriptive fit statistics all of which assess the statistical significance of the independent variables individually and as a whole.



Adj. $R^2$

The Adj. $R^2$ is a measure of the strength of the relationship between the independent and dependent variables. Measures the degree of variation in the dependent variable that can be explained by the independent variables. Takes into account the number of independent variables/degrees of freedom in the measure. Has an output between 0 and 100%. Generally an Adj. $R^2$ of < 15% is considered very low. Useful metric to compare between different model specifications.

Root Mean Square Error (RMSE)

Standard deviation of the error term, that is the square root of the mean square error (RMSE). Measures the average difference between the predicted and actual values. As it has the same unit of measurement as the quantity being estimated by the regression function, RMSE can be directly linked to business usefulness. Useful metric to compare between different model specifications.


Tests the significance of each independent variable via the null hypothesis such that the respective coefficients are equal to zero. Higher t-values suggest rejection of the null hypothesis, and an indication that the variable is appropriately included. Critical values are bsed on the t distribution and selected confidence level. A high F-value with low t-values can be suggestive of multicollinearity across the independent variables.


Tests the null hypothesis that none of the independent variables explain the variation of the dependent variable. Higher F-values suggest rejection of the null hypothesis and an indication of good model fit. Critical values are based on the F distribution, and selected confidence level (i.e., 95%)


This measures the probability of obtaining data that generate parameter estimates that are equal or greater than the model estimates given that the null hypotheses that none of the independent variables explain the variation of the dependent variable is true. Small (high) p-values are desired as they suggest a rejection (acceptance) of the null hypotheses. The null hypotheses is that the model does not have predictive value. P-values are often used with a 0.05 significance level to reject the null hypotheses.


This test is where independent variables may be added/omitted from the model to evaluate the new model fit diagnostics. The independent variable's individual contribution by examining the statistical significance of each variable's coefficient (i.e., t-test) and the overall model fit via the Adj $R^2$, RMSE, or F-test.

Linearity tests



Ramsey RESET Test

The null hypothesis is that the regression relationships are linear. An additional regression is of the dependent variable agasint the independent variables and second order powers of the predicted variable. An F-test is applied on the additional regression. If F-test exceeds a threshold (i.e., one or more nonlinear terms are significant), the null nypotheses can be rejected. This test is not to be used when the independent variables are categorical.

Chow Test

The null hypotheses is that there is no structural break in the data. On a graphical or theoretical basis, the data is split into two samples and regressions are run on each sample The Chow test is used to evaluate whether the model paramters from the two data samples are statistically similar. Evidence of a structural break means that the model may need to be estimated using different specifications (i.e., spline functions) or data (i.e., data subsets, data exclusions).

Dependent vs Independent variable plot

Scatter plots of dependent variables on the y-axis and the independent variable on the x-axis can indicate non-linear relationships in the data. Sometimes it may be necessary to apply sampling, binning, and moving averages to enhance the visualization of the relationship to evaluate non-linearities. A boxplot is optimal when plotting categorical independent variables vs a continuous dependent variable. A categorical independent variable with predictive characteristics will have different dependent variable means/distributions across categories.

Residual Plot

Scatter plots of residual of the regression model on the y-axis vs the independent variable on the x-axis can indicate non-linear relationships in the data. A boxplot is optimal when plotting categorical independent variables vs a residual distribution of each category. Residuals should have means that are close to zero and constant variance across categories.

Multicollinearity tests

Multicollinearity is more important when you are trying to explain or find a relationship between the dependent and independent variables. It is less important if you are focused on prediction.

Variance Inflation Factors

VIFs of 1-5 are moderate correlation VIFS of > 5 are excessive correlation

  • Remove some of the highly correlated independent variables

  • Combine the independent variables linearly (e.g., adding together)

  • Use principal components analysis or partial least squares regression

Stationary testing

Stationarity testing is important on both the independent and dependent variables as two variables that are non-stationary that are regressed on one another can lead to spurious regressions.

For non-stationary variables, apply co-integration.

A stationary time series is one whose statistical properties such as mean, variance, autocorrelation, etc. are all constant over time.

  1. Augmented Dickey–Fuller (ADF)

  2. Kwiatkowski–Phillips–Schmidt–Shin (KPSS)

  3. Phillips–Perron test (PP)


When you have two non-stationary processes (i.e., $X_1$ and $X_2$), there is a vector (i.e., co-integration vector) that can combine these two processes into a stationary process. Basically there are the stochastic trends in both $X_1$ and $X_2$ are the same and can be cancelled.


Comments powered by Disqus