Model validation for classification trees
This is an overview of the diagnostic and performance tests that need to be performed to ensure the validity of a linear regression model that has one or more continuous/categorical independent variables. These are for linear regression models that are optimized using (Ordinary Least Squares (OLS) or Maximum Likelihood Estimation (MLE). The equation for the linear regression model is as follows:
Contents
The assumptions of the linear regression model are:
 Linearity

Relationship between dependent and independent variables is linear.
 Normality

Error term is normally distributed.
 Independence

Error terms are statistically independent.
 Homoscedasticity

Error term has constant variance for all observations.
 Lack of multicollinearity

No excessive correlation between independent variables.
Data Diagnostics
The data used for modelling should be evaluated for the following:
 Compliance with relevant regulatory requirements

Often these requirements refer to data length requirements for different types of portfolios, ensuring the data length is representative of the economic cycle, and requirements for use of data proxies (e.g., BCC135 [Conservatism to risk parameters in Advanced Approaches], BCC143 [Selection of reference data periods and data deficiencies]).
 Outliers, missing or special values.

Outliers or influential data points should be identified (i.e., Cook's distance) and model performance should be evaluated with the exclusion of these outliers.
Model Diagnostics
Goodness of fit
These tests evaluate how well a regression model fits the data. The tests are formal regression statistics and descriptive fit statistics all of which assess the statistical significance of the independent variables individually and as a whole.
Test 
Description 

Adj. $R^2$ 
The Adj. $R^2$ is a measure of the strength of the relationship between the independent and dependent variables. Measures the degree of variation in the dependent variable that can be explained by the independent variables. Takes into account the number of independent variables/degrees of freedom in the measure. Has an output between 0 and 100%. Generally an Adj. $R^2$ of < 15% is considered very low. Useful metric to compare between different model specifications. 
Root Mean Square Error (RMSE) 
Standard deviation of the error term, that is the square root of the mean square error (RMSE). Measures the average difference between the predicted and actual values. As it has the same unit of measurement as the quantity being estimated by the regression function, RMSE can be directly linked to business usefulness. Useful metric to compare between different model specifications. 
TTest 
Tests the significance of each independent variable via the null hypothesis such that the respective coefficients are equal to zero. Higher tvalues suggest rejection of the null hypothesis, and an indication that the variable is appropriately included. Critical values are bsed on the t distribution and selected confidence level. A high Fvalue with low tvalues can be suggestive of multicollinearity across the independent variables. 
Ftest 
Tests the null hypothesis that none of the independent variables explain the variation of the dependent variable. Higher Fvalues suggest rejection of the null hypothesis and an indication of good model fit. Critical values are based on the F distribution, and selected confidence level (i.e., 95%) 
Pvalues 
This measures the probability of obtaining data that generate parameter estimates that are equal or greater than the model estimates given that the null hypotheses that none of the independent variables explain the variation of the dependent variable is true. Small (high) pvalues are desired as they suggest a rejection (acceptance) of the null hypotheses. The null hypotheses is that the model does not have predictive value. Pvalues are often used with a 0.05 significance level to reject the null hypotheses. 
Dropout 
This test is where independent variables may be added/omitted from the model to evaluate the new model fit diagnostics. The independent variable's individual contribution by examining the statistical significance of each variable's coefficient (i.e., ttest) and the overall model fit via the Adj $R^2$, RMSE, or Ftest. 
Linearity tests
Test 
Description 

Ramsey RESET Test 
The null hypothesis is that the regression relationships are linear. An additional regression is of the dependent variable agasint the independent variables and second order powers of the predicted variable. An Ftest is applied on the additional regression. If Ftest exceeds a threshold (i.e., one or more nonlinear terms are significant), the null nypotheses can be rejected. This test is not to be used when the independent variables are categorical. 
Chow Test 
The null hypotheses is that there is no structural break in the data. On a graphical or theoretical basis, the data is split into two samples and regressions are run on each sample The Chow test is used to evaluate whether the model paramters from the two data samples are statistically similar. Evidence of a structural break means that the model may need to be estimated using different specifications (i.e., spline functions) or data (i.e., data subsets, data exclusions). 
Dependent vs Independent variable plot 
Scatter plots of dependent variables on the yaxis and the independent variable on the xaxis can indicate nonlinear relationships in the data. Sometimes it may be necessary to apply sampling, binning, and moving averages to enhance the visualization of the relationship to evaluate nonlinearities. A boxplot is optimal when plotting categorical independent variables vs a continuous dependent variable. A categorical independent variable with predictive characteristics will have different dependent variable means/distributions across categories. 
Residual Plot 
Scatter plots of residual of the regression model on the yaxis vs the independent variable on the xaxis can indicate nonlinear relationships in the data. A boxplot is optimal when plotting categorical independent variables vs a residual distribution of each category. Residuals should have means that are close to zero and constant variance across categories. 
Multicollinearity tests
Multicollinearity is more important when you are trying to explain or find a relationship between the dependent and independent variables. It is less important if you are focused on prediction.
Variance Inflation Factors
VIFs of 15 are moderate correlation VIFS of > 5 are excessive correlation
Remove some of the highly correlated independent variables
Combine the independent variables linearly (e.g., adding together)
Use principal components analysis or partial least squares regression
Stationary testing
Stationarity testing is important on both the independent and dependent variables as two variables that are nonstationary that are regressed on one another can lead to spurious regressions.
For nonstationary variables, apply cointegration.
A stationary time series is one whose statistical properties such as mean, variance, autocorrelation, etc. are all constant over time.
Augmented Dickey–Fuller (ADF)
Kwiatkowski–Phillips–Schmidt–Shin (KPSS)
Phillips–Perron test (PP)
Cointegration
When you have two nonstationary processes (i.e., $X_1$ and $X_2$), there is a vector (i.e., cointegration vector) that can combine these two processes into a stationary process. Basically there are the stochastic trends in both $X_1$ and $X_2$ are the same and can be cancelled.
Comments
Comments powered by Disqus