## Overview

Diagnostics for regression models are tools that assess a model's compliance to its assumptions and investigate if there is a single observation or group of observations that are not well represented by the model. These tools allow practitioners to evaluate if a model appropriately represents the data of their study. Practitioners would typically run diagnostics as part of the process to select a model. In this article we separate diagnostics from the other parts of model selection to provide a focus on this important topic; this separation is not meant to imply that these tools are used separately from other regression modeling tools.

This article provides an overview of diagnostics for regression models. The article also includes an example model that is used to demonstrate running these diagnostics. Both R and Stata code for the diagnostic examples are provided. This article should not to be taken as a complete coverage of the theory for model diagnostics or an exhaustive set of diagnostics for all models. The reader is responsible for learning the theory and gaining the experience needed to properly diagnose a regression model.

• Corrective actions for issues identified by diagnostics. Concerns raised by diagnostics are resolved by model selection, with the additional information obtained through the diagnostics.

• Mixed models (models with random effects)

• Issues associated with decisions on which regressors to include in a model. These include but are not limited to

• Choosing between alternate set of regressors that produce similar models.

• Multiple correlation and other issues of redundancy of regressors.

• Missing variable bias and overfit

• Quality of model evaluations using such measures as AIC, BIC, (adjusted) R$$^2$$, etc.

## Model diagnostics

The diagnostics of this article are for application to Ordinary Least Squares (OLS) models and Generalized Linear Models (GLM.) (Most of the diagnostics can be applied to other regression models, often with some modification.) We specify the OLS model,

$\text{E}(Y) = \boldsymbol{X\beta}, \qquad \qquad \text{var}(Y) = \boldsymbol{I}_n \sigma^2, \qquad \qquad \text{E}(Y - \boldsymbol{X\beta}) = 0, \qquad \qquad Y \sim N(\boldsymbol{X\beta}, \boldsymbol{I}_n \sigma^2).$

The GLM is specified as

$\text{E}(Y) = g^{-1}(\boldsymbol{X\beta}) \ \Rightarrow \ g(\text{E}(Y)) = \boldsymbol{X\beta}, \qquad \qquad \text{E}(Y - g^{-1}(\boldsymbol{X\beta})) = 0, \qquad \qquad \text{var}(Y) = \boldsymbol{I}_n v(g^{-1}(\boldsymbol{X\beta}))$

where $$g(\cdot)$$ is the link function and $$v(\cdot)$$ is the variance function (defines the relationship between the expected value of Y and the variance of Y.) Both of these model forms fit the expected means of a set of observations using the functional form as given by the regressors ($$\boldsymbol{X\beta}$$.) The OLS model assumes the means are a linear function of the regressors, where GLM's assume a transformation (link function) of regressors. The OLS model assumes that the residual part of the model, the part of the response which is unexplained by the regressors, has a constant variance and that it is normally distributed. The variance of the residuals of a GLM is based on the $$v(\cdot)$$ function. The GLM residuals may have a distributional assumption depending on the response variable being modeled. For example, a response variable that is the number of successes in a fixed number of trials would be expected to follow a binomial distribution. A separate distributional assumption for the errors is not always required for GLM's.

The model assumptions to be checked are organized as follows in this article.

• Correct functional form of the expected means.

• Linearity of the predicted response (on the link scale for GLM) to the response variable.

• Linearity of each of the individual regressors.

• Link function for GLM provides linearity to the regressors for GLM.

• Homoscedasticty (equal variance) of residuals (after accounting for the variance function) with respect to the predicted response

• Distributional assumptions of residuals.

• Independence of error term

• Mostly to be checked by understanding the data collection method.

• Individual regressors are independent of residuals

• No autocorrelation, or other sequencing issues such as a series type of regressor, time or any other order inducing regressor.

• Isolated departures from the model. (Diagnostics for observations that are not well represented by the model.)

• Checks for changes in effects due to one or a small group of observations.

These diagnostics are not formal tests.

## Example data set and model

The model used in this article is based on PGA data from 2004. This data set was found at http://users.stat.ufl.edu/~winner/datasets.html. The purpose of the model is to understand the relationships between the “on greens in regulation” variable and the other variables of the data set. There are a number of models which are reasonably similar in their ability to explain the greens in regulation. From these possible models we have chosen one which demonstrates several issues that diagnostics are useful in exploring. Not all models of this data set exhibit the same issues. In that regard the example model is a little artificial, since it was chosen to demonstrate difficulties with fit. While we acknowledge this, we also note that these issues can and do occur with some frequency in models created by practitioners who use regression models.

The following code reads in the data set and prepares a few variables for use.

• In R

library(tidyverse)
library(car)
pga_in <- read.fwf("pga2004.txt", widths = c(22, 3, 8, 8, 8, 8, 8, 8, 8, 9, 8) )

pga <-
pga_in %>%
select(
name = V1,
age = V2,
ave_drive = V3,
drive_accur = V4,
greens_regul = V5,
save_perc = V7,
money_rank = V8,
ave_prize = V11
) %>%
mutate(
greens_regul = greens_regul / 100,
ave_prize = ave_prize / 1000,
ave_prize_sqr = ave_prize^2,
)
• In Stata

infix using pga.dct // dictionary file not shown

replace greens_regul = greens_regul / 100
replace ave_prize = ave_prize / 1000
gen ave_prize_sqr = ave_prize^2

save pga2004, replace

The response variable, greens_regul, is a proportion. (The range is $$[0, 1]$$.) This will be modeled using a logit transformation. The greens_regul_link variable was created, to assist in viewing the relationships between the regressors and the response. This can be directly done here since there are no observations that have a 0 or 1 for greens_regul.

The following pairwise plots displayed the relationships between the regressors and the response on the link scale.

• In R

pga %>%
select(greens_regul_link, ave_drive, drive_accur, save_perc, ave_prize, age) %>%
plot() • In Stata

use pga2004
graph matrix greens_regul_link ave_drive drive_accur save_perc ave_prize age