Model validation for linear regression models
This is an overview of the diagnostic and performance tests that need to be performed to ensure the validity of a linear regression/ordinary least squares model that has one or more continuous/categorical independent variables. These are for linear regression models that are optimized using Ordinary Least Squares (OLS) or Maximum Likelihood Estimation (MLE). The equation for the linear regression model is as follows:
Contents
The assumptions of the linear regression model are:
 Linearity

Relationship between dependent and independent variables is linear.
 Normality

Error term is normally distributed.
 Independence

Error terms are statistically independent.
 Homoscedasticity

Error term has constant variance for all observations.
 Lack of multicollinearity

No excessive correlation between independent variables.
Fitting the regression line
We can use a simple model as follows to fit a straight line onto measured data.
The fitted line is determined by the method of "least squares". The least squares fit minimizes the sum of squared deviations from the fitted line.
The deviations are also called "residuals". Thus we are minimizing the sum of squared residuals i.e.,(residual sum of squares). Thus combining both equations together, we obtain the following:
We can also write the regression in a matrix format and solve it as a closed form solution.
\(X\beta\) is the systematic component and \(\epsilon\) is the stochastic (i.e., random) component, and we are trying to find the parameters for the \(\beta\) vector. We are tring to minimize the sum of squared residuals, so let's express the equation using \(\epsilon\).
The sum of squared residuals is as follows:
To obtain \(\beta\), we differentiate the sum of squared residuals by \(X\) and equate to zero as follows:
In solving for \(\beta\), we obtain the following:
Note:
There are also nonlinear least squares problems and numerical algorithsm are used to find the values of the \(\beta\) parameters to minimize the objective function. The Jacobian function is the matrix of all firstorder partial derivatives and is used in the optimization of the objective function.
Weighted least squares (WLS) regression are used when there is heteroscedasticity in the error terms of the model.
Generalized Least Squares (GLS) allows for estimation of \(\beta\) when there is heteroscedasticity or correlation amongst the error terms (i.e., the residuals are not iid). To handle heteroscedasticity when error terms are uncorrelated, GLS minimizes a weighted analogue to the sum of squared residuals from OLS regression, where the weight for the \(i^{th}\) is inversely proportional to \(\sigma^2(\epsilon_{i})\). The GLS solution to the estimation problem is
where \(\Omega\) is the covariance matrix of errors. GLS can be viewed as applying a linear transformation to the data so that the OLS assumptions are met for the transformed data.
Data Diagnostics
The data used for modelling should be evaluated for the following:
 Compliance with relevant regulatory requirements

Often these requirements refer to data length requirements for different types of portfolios, ensuring the data length is representative of the economic cycle, and requirements for use of data proxies (e.g., BCC135 [Conservatism to risk parameters in Advanced Approaches], BCC143 [Selection of reference data periods and data deficiencies]).
 Outliers, missing or special values.

Outliers or influential data points should be identified (i.e., Cook's distance) and model performance should be evaluated with the exclusion of these outliers.
Model Diagnostics
Goodness of fit
These tests evaluate how well a regression model fits the data. The tests are formal regression statistics and descriptive fit statistics all of which assess the statistical significance of the independent variables individually and as a whole.
Test 
Description 

Adj. $R^2$ 
The Adj. $R^2$ is a measure of the strength of the relationship between the independent and dependent variables. Measures the degree of variation in the dependent variable that can be explained by the independent variables. Takes into account the number of independent variables/degrees of freedom in the measure. Has an output between 0 and 100%. Generally an Adj. $R^2$ of < 15% is considered very low. Useful metric to compare between different model specifications. 
Akaike Information Criterion (AIC) 
Likelihood function based criterion for model selection, where model with the lowest AIC is preferred. Accounts for the tradeoff between goodnessoffit and model complexity by including a penalty term for the number of model parameters. 
Bayesian Information Criterion (BIC) 
Similar to AIC, except has a larger penalty term for model parameters. 
Root Mean Square Error (RMSE) 
Standard deviation of the error term, that is the square root of the mean square error (RMSE). Measures the average difference between the predicted and actual values. As it has the same unit of measurement as the quantity being estimated by the regression function, RMSE can be directly linked to business usefulness. Useful metric to compare between different model specifications. 
TTest 
Tests the significance of each independent variable via the null hypothesis such that the respective coefficients are equal to zero. Higher tvalues suggest rejection of the null hypothesis, and an indication that the variable is appropriately included. Critical values are bsed on the t distribution and selected confidence level. A high Fvalue with low tvalues can be suggestive of multicollinearity across the independent variables. 
Ftest 
Tests the null hypothesis that none of the independent variables explain the variation of the dependent variable. Higher Fvalues suggest rejection of the null hypothesis and an indication of good model fit. Critical values are based on the F distribution, and selected confidence level (i.e., 95%) 
Pvalues 
This measures the probability of obtaining data that generate parameter estimates that are equal or greater than the model estimates given that the null hypotheses that none of the independent variables explain the variation of the dependent variable is true. Small (high) pvalues are desired as they suggest a rejection (acceptance) of the null hypotheses. The null hypotheses is that the model does not have predictive value. Pvalues are often used with a 0.05 significance level to reject the null hypotheses. 
Dropout 
This test is where independent variables may be added/omitted from the model to evaluate the new model fit diagnostics. The independent variable's individual contribution by examining the statistical significance of each variable's coefficient (i.e., ttest) and the overall model fit via the Adj $R^2$, RMSE, or Ftest. 
Linearity tests
Nonlinearities in the dependent and independent variables can lead to significant prediction errors, especially for predictions beyond the model development data.
Test 
Description 

Ramsey RESET Test 
The null hypothesis is that the regression relationships are linear. An additional regression is of the dependent variable agasint the independent variables and second order powers of the predicted variable. An Ftest is applied on the additional regression. If Ftest exceeds a threshold (i.e., one or more nonlinear terms are significant), the null nypotheses can be rejected. This test is not to be used when the independent variables are categorical. 
Chow Test 
The null hypotheses is that there is no structural break in the data. On a graphical or theoretical basis, the data is split into two samples and regressions are run on each sample The Chow test is used to evaluate whether the model paramters from the two data samples are statistically similar. Evidence of a structural break means that the model may need to be estimated using different specifications (i.e., spline functions) or data (i.e., data subsets, data exclusions). 
Dependent vs Independent variable plot 
Scatter plots of dependent variables on the yaxis and the independent variable on the xaxis can indicate nonlinear relationships in the data. Sometimes it may be necessary to apply sampling, binning, and moving averages to enhance the visualization of the relationship to evaluate nonlinearities. A boxplot is optimal when plotting categorical independent variables vs a continuous dependent variable. A categorical independent variable with predictive characteristics will have different dependent variable means/distributions across categories. 
Residual Plot 
Scatter plots of residual of the regression model on the yaxis vs the independent variable on the xaxis can indicate nonlinear relationships in the data. A boxplot is optimal when plotting categorical independent variables vs a residual distribution of each category. Residuals should have means that are close to zero and constant variance across categories. 
Heteroscedasticity tests
Heteroscedasticity leads to OLS estimates that increase weight on observations with large error variances, thus resulting in larger confidence intervals and less reliable hypotheses testing. Thus, heteroscedasticity makes it difficult to reliably assess the appropriateness of the regression model by simply evaluating the statistical test and confidence intervals. This leads to poor predictive poor especially when inputs are different from observed data.
Test 
Description 

Serial Residual Plot 
Plot residuals on the Yaxis, and on the Xaxis plot the predicted values or the independent variables. If residual dispersion related to the dependent/independent variables is present, this may indicate heteroscedasticity. For categorical independent variables, generate a box plot with the residual distribution of each category. If there is no heteroscedasticity, there should be zero mean and constant variance across the categories. 
BreuschPagan 
Performs a regression of the squared residual against the independent variables. If the sum of squares divided by 2 is above a threshold, then the null hypothesis of constant variance is rejected. The higher the value, the more likely that heteroscedasticity exists. Only linear heteroscedasticity is detected. 
White 
Performs a regression of the squared residual with the independent variables, and their second order interactions (i.e., \(X_{1} \times X_{2}, X_{1} \times X_{3}, X_{1}^{2}\)). The higher the value, the more likely that heteroscedasticity exists. Nonlinear heteroscedasticity is detected. If there are many independent variables, interpretation may be challenging as the dataset may not be able to offset the loss of degrees of freedom. 
Independence tests
Independence tests evaluate the assumption that errors of different observations are uncorrelated or independent. Violation of independence leads to less reliable confidence intervals and hypotheses testing. Serial correlation usually occurs in time series data, but it can occur in other dimensions such as size, geography, category, portfolio, product etc. Thus, the dimensions that may cause serial correlation need ot be identified and ordered accordingly to evaluate the existence of serial correlation.
Test 
Description 

ACF of the residuals 

Serial Residual Plot 
Order observations according to a driving dimension (i.e., time, geography, category) on the Xaxis and plot the corresponding residuals on the Yaxis. There may be systematic patterns (i.e., spikes) that indicate positive correlation of the error term across the driving dimension. 
LjungBox 

DurbinWatson 
Test statistic on the residuals of the OLS regression of the null hypotheses that no serial correlation exists. The range of values are 04. Postive (negative) serial correlation is detected for values lower than 2 (higher than 2). The DW test cann be applied if the regression contains any lagged dependent variables as this leads to a DW of 2 even when errors are serially correlated. 
Normality tests
Test 
Description 

Normal Probability Plot 

Histogram of residuals 

ShapiroWilk (SW) 

AndersonDarling (AD) 

KolmogorovSmirnov (KS) 

Lilliefors (LS) 
Multicollinearity tests
Multicollinearity is more important when you are trying to explain or find a relationship between the dependent and independent variables. It is less important if you are focused on prediction.
Test 
Description 

Correlation matrix 

Condition Index 

Variance Inflation Factor (VIF) test 
VIF of 15 are moderate correlation. VIF of > 5 are excessive correlation. 
VIF Dropout test 
Remove some of the highly correlated independent variables
Combine the independent variables linearly (e.g., adding together)
Use principal components analysis or partial least squares regression
Cointegration
When you have two nonstationary processes (i.e., $X_1$ and $X_2$), there is a vector (i.e., cointegration vector) that can combine these two processes into a stationary process. Basically there are the stochastic trends in both $X_1$ and $X_2$ are the same and can be cancelled.
Performance testing
Test 
Description 

Forecast error 

Model stability 
Comments
Comments powered by Disqus