Model validation for linear regression models

This is an overview of the diagnostic and performance tests that need to be performed to ensure the validity of a linear regression/ordinary least squares model that has one or more continuous/categorical independent variables. These are for linear regression models that are optimized using Ordinary Least Squares (OLS) or Maximum Likelihood Estimation (MLE). The equation for the linear regression model is as follows:

\begin{equation*} Y_{i} = \alpha + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \ldots + \epsilon_{i} \text{, where } \epsilon \backsim N(0,\sigma) \end{equation*}

The assumptions of the linear regression model are:

Linearity

Relationship between dependent and independent variables is linear.

Normality

Error term is normally distributed.

Independence

Error terms are statistically independent.

Homoscedasticity

Error term has constant variance for all observations.

Lack of multicollinearity

No excessive correlation between independent variables.

Fitting the regression line

We can use a simple model as follows to fit a straight line onto measured data.

\begin{equation*} \hat{y}=b_{0}+b_{1}x \end{equation*}

The fitted line is determined by the method of "least squares". The least squares fit minimizes the sum of squared deviations from the fitted line.

\begin{equation*} \text{minimize} \sum (y_{i}-\hat{y}_{i})^2 \end{equation*}

The deviations are also called "residuals". Thus we are minimizing the sum of squared residuals i.e.,(residual sum of squares). Thus combining both equations together, we obtain the following:

\begin{equation*} \text{minimize} \sum (y_{i}-b_{0}+b_{1}x_{i})^2 \end{equation*}

We can also write the regression in a matrix format and solve it as a closed form solution.

\begin{equation*} y = X \beta + \epsilon \end{equation*}

\(X\beta\) is the systematic component and \(\epsilon\) is the stochastic (i.e., random) component, and we are trying to find the parameters for the \(\beta\) vector. We are tring to minimize the sum of squared residuals, so let's express the equation using \(\epsilon\).

\begin{equation*} \epsilon = y-X \hat{\beta} \end{equation*}

The sum of squared residuals is as follows:

\begin{equation*} \epsilon ^\intercal \epsilon = (y-X \hat{\beta})^\intercal(y-X\hat{\beta}) \end{equation*}
\begin{equation*} \epsilon ^\intercal \epsilon = y ^\intercal y -\hat{\beta}^\intercal X^\intercal y - y^\intercal X \beta^\intercal + \hat{\beta}^\intercal X^\intercal X \hat{\beta} \end{equation*}
\begin{equation*} \epsilon ^\intercal \epsilon = y ^\intercal y - 2 \hat{\beta}^\intercal X^\intercal y + \hat{\beta}^\intercal X^\intercal X \hat{\beta} \end{equation*}

To obtain \(\beta\), we differentiate the sum of squared residuals by \(X\) and equate to zero as follows:

\begin{equation*} \frac{\delta \epsilon^\intercal \epsilon}{\delta \hat{\beta}} = -2 X^\intercal y + 2 X^\intercal X \hat{\beta} \end{equation*}
\begin{equation*} -2 X^\intercal y + 2 X^\intercal X \hat{\beta} = 0 \end{equation*}
\begin{equation*} 2 X^\intercal X \hat{\beta} = 2 X^\intercal y \end{equation*}

In solving for \(\beta\), we obtain the following:

\begin{equation*} \hat{\beta} = (X^\intercal X)^{-1} X^\intercal y \end{equation*}

Note:

  1. There are also non-linear least squares problems and numerical algorithsm are used to find the values of the \(\beta\) parameters to minimize the objective function. The Jacobian function is the matrix of all first-order partial derivatives and is used in the optimization of the objective function.

  2. Weighted least squares (WLS) regression are used when there is heteroscedasticity in the error terms of the model.

  3. Generalized Least Squares (GLS) allows for estimation of \(\beta\) when there is heteroscedasticity or correlation amongst the error terms (i.e., the residuals are not iid). To handle heteroscedasticity when error terms are uncorrelated, GLS minimizes a weighted analogue to the sum of squared residuals from OLS regression, where the weight for the \(i^{th}\) is inversely proportional to \(\sigma^2(\epsilon_{i})\). The GLS solution to the estimation problem is

\begin{equation*} \hat{\beta} = (X^\intercal \Omega^{-1} X)^-1 X^\intercal \Omega^{-1} y, \end{equation*}

where \(\Omega\) is the covariance matrix of errors. GLS can be viewed as applying a linear transformation to the data so that the OLS assumptions are met for the transformed data.

Data Diagnostics

The data used for modelling should be evaluated for the following:

  1. Compliance with relevant regulatory requirements

    Often these requirements refer to data length requirements for different types of portfolios, ensuring the data length is representative of the economic cycle, and requirements for use of data proxies (e.g., BCC13-5 [Conservatism to risk parameters in Advanced Approaches], BCC14-3 [Selection of reference data periods and data deficiencies]).

  2. Outliers, missing or special values.

    Outliers or influential data points should be identified (i.e., Cook's distance) and model performance should be evaluated with the exclusion of these outliers.

Model Diagnostics

Goodness of fit

These tests evaluate how well a regression model fits the data. The tests are formal regression statistics and descriptive fit statistics all of which assess the statistical significance of the independent variables individually and as a whole.

Test

Description

Adj. $R^2$

The Adj. $R^2$ is a measure of the strength of the relationship between the independent and dependent variables. Measures the degree of variation in the dependent variable that can be explained by the independent variables. Takes into account the number of independent variables/degrees of freedom in the measure. Has an output between 0 and 100%. Generally an Adj. $R^2$ of < 15% is considered very low. Useful metric to compare between different model specifications.

Akaike Information Criterion (AIC)

Likelihood function based criterion for model selection, where model with the lowest AIC is preferred. Accounts for the trade-off between goodness-of-fit and model complexity by including a penalty term for the number of model parameters.

Bayesian Information Criterion (BIC)

Similar to AIC, except has a larger penalty term for model parameters.

Root Mean Square Error (RMSE)

Standard deviation of the error term, that is the square root of the mean square error (RMSE). Measures the average difference between the predicted and actual values. As it has the same unit of measurement as the quantity being estimated by the regression function, RMSE can be directly linked to business usefulness. Useful metric to compare between different model specifications.

T-Test

Tests the significance of each independent variable via the null hypothesis such that the respective coefficients are equal to zero. Higher t-values suggest rejection of the null hypothesis, and an indication that the variable is appropriately included. Critical values are bsed on the t distribution and selected confidence level. A high F-value with low t-values can be suggestive of multicollinearity across the independent variables.

F-test

Tests the null hypothesis that none of the independent variables explain the variation of the dependent variable. Higher F-values suggest rejection of the null hypothesis and an indication of good model fit. Critical values are based on the F distribution, and selected confidence level (i.e., 95%)

P-values

This measures the probability of obtaining data that generate parameter estimates that are equal or greater than the model estimates given that the null hypotheses that none of the independent variables explain the variation of the dependent variable is true. Small (high) p-values are desired as they suggest a rejection (acceptance) of the null hypotheses. The null hypotheses is that the model does not have predictive value. P-values are often used with a 0.05 significance level to reject the null hypotheses.

Drop-out

This test is where independent variables may be added/omitted from the model to evaluate the new model fit diagnostics. The independent variable's individual contribution by examining the statistical significance of each variable's coefficient (i.e., t-test) and the overall model fit via the Adj $R^2$, RMSE, or F-test.

Linearity tests

Non-linearities in the dependent and independent variables can lead to significant prediction errors, especially for predictions beyond the model development data.

Test

Description

Ramsey RESET Test

The null hypothesis is that the regression relationships are linear. An additional regression is of the dependent variable agasint the independent variables and second order powers of the predicted variable. An F-test is applied on the additional regression. If F-test exceeds a threshold (i.e., one or more nonlinear terms are significant), the null nypotheses can be rejected. This test is not to be used when the independent variables are categorical.

Chow Test

The null hypotheses is that there is no structural break in the data. On a graphical or theoretical basis, the data is split into two samples and regressions are run on each sample The Chow test is used to evaluate whether the model paramters from the two data samples are statistically similar. Evidence of a structural break means that the model may need to be estimated using different specifications (i.e., spline functions) or data (i.e., data subsets, data exclusions).

Dependent vs Independent variable plot

Scatter plots of dependent variables on the y-axis and the independent variable on the x-axis can indicate non-linear relationships in the data. Sometimes it may be necessary to apply sampling, binning, and moving averages to enhance the visualization of the relationship to evaluate non-linearities. A boxplot is optimal when plotting categorical independent variables vs a continuous dependent variable. A categorical independent variable with predictive characteristics will have different dependent variable means/distributions across categories.

Residual Plot

Scatter plots of residual of the regression model on the y-axis vs the independent variable on the x-axis can indicate non-linear relationships in the data. A boxplot is optimal when plotting categorical independent variables vs a residual distribution of each category. Residuals should have means that are close to zero and constant variance across categories.

Heteroscedasticity tests

Heteroscedasticity leads to OLS estimates that increase weight on observations with large error variances, thus resulting in larger confidence intervals and less reliable hypotheses testing. Thus, heteroscedasticity makes it difficult to reliably assess the appropriateness of the regression model by simply evaluating the statistical test and confidence intervals. This leads to poor predictive poor especially when inputs are different from observed data.

Test

Description

Serial Residual Plot

Plot residuals on the Y-axis, and on the X-axis plot the predicted values or the independent variables. If residual dispersion related to the dependent/independent variables is present, this may indicate heteroscedasticity. For categorical independent variables, generate a box plot with the residual distribution of each category. If there is no heteroscedasticity, there should be zero mean and constant variance across the categories.

Breusch-Pagan

Performs a regression of the squared residual against the independent variables. If the sum of squares divided by 2 is above a threshold, then the null hypothesis of constant variance is rejected. The higher the value, the more likely that heteroscedasticity exists. Only linear heteroscedasticity is detected.

White

Performs a regression of the squared residual with the independent variables, and their second order interactions (i.e., \(X_{1} \times X_{2}, X_{1} \times X_{3}, X_{1}^{2}\)). The higher the value, the more likely that heteroscedasticity exists. Non-linear heteroscedasticity is detected. If there are many independent variables, interpretation may be challenging as the dataset may not be able to offset the loss of degrees of freedom.

Independence tests

Independence tests evaluate the assumption that errors of different observations are uncorrelated or independent. Violation of independence leads to less reliable confidence intervals and hypotheses testing. Serial correlation usually occurs in time series data, but it can occur in other dimensions such as size, geography, category, portfolio, product etc. Thus, the dimensions that may cause serial correlation need ot be identified and ordered accordingly to evaluate the existence of serial correlation.

Test

Description

ACF of the residuals

Serial Residual Plot

Order observations according to a driving dimension (i.e., time, geography, category) on the X-axis and plot the corresponding residuals on the Y-axis. There may be systematic patterns (i.e., spikes) that indicate positive correlation of the error term across the driving dimension.

Ljung-Box

Durbin-Watson

Test statistic on the residuals of the OLS regression of the null hypotheses that no serial correlation exists. The range of values are 0-4. Postive (negative) serial correlation is detected for values lower than 2 (higher than 2). The DW test cann be applied if the regression contains any lagged dependent variables as this leads to a DW of 2 even when errors are serially correlated.

Normality tests

Test

Description

Normal Probability Plot

Histogram of residuals

Shapiro-Wilk (SW)

Anderson-Darling (AD)

Kolmogorov-Smirnov (KS)

Lilliefors (LS)

Multicollinearity tests

Multicollinearity is more important when you are trying to explain or find a relationship between the dependent and independent variables. It is less important if you are focused on prediction.

Test

Description

Correlation matrix

Condition Index

Variance Inflation Factor (VIF) test

VIF of 1-5 are moderate correlation. VIF of > 5 are excessive correlation.

VIF Dropout test

  • Remove some of the highly correlated independent variables

  • Combine the independent variables linearly (e.g., adding together)

  • Use principal components analysis or partial least squares regression

Co-integration

When you have two non-stationary processes (i.e., $X_1$ and $X_2$), there is a vector (i.e., co-integration vector) that can combine these two processes into a stationary process. Basically there are the stochastic trends in both $X_1$ and $X_2$ are the same and can be cancelled.

Performance testing

Test

Description

Forecast error

Model stability

Comments

Comments powered by Disqus