Overview

Diagnostics for regression models are tools that assess a model's compliance to its assumptions and investigate if there is a single observation or group of observations that are not well represented by the model. These tools allow practitioners to evaluate if a model appropriately represents the data of their study. Practitioners would typically run diagnostics as part of the process to select a model. In this article we separate diagnostics from the other parts of model selection to provide a focus on this important topic; this separation is not meant to imply that these tools are used separately from other regression modeling tools.

This article provides an overview of diagnostics for regression models. The article also includes an example model that is used to demonstrate running these diagnostics. Both R and Stata code for the diagnostic examples are provided. This article should not to be taken as a complete coverage of the theory for model diagnostics or an exhaustive set of diagnostics for all models. The reader is responsible for learning the theory and gaining the experience needed to properly diagnose a regression model.

This article does not cover

Model diagnostics

The diagnostics of this article are for application to Ordinary Least Squares (OLS) models and Generalized Linear Models (GLM.) (Most of the diagnostics can be applied to other regression models, often with some modification.) We specify the OLS model,

\[\text{E}(Y) = \boldsymbol{X\beta}, \qquad \qquad \text{var}(Y) = \boldsymbol{I}_n \sigma^2, \qquad \qquad \text{E}(Y - \boldsymbol{X\beta}) = 0, \qquad \qquad Y \sim N(\boldsymbol{X\beta}, \boldsymbol{I}_n \sigma^2). \]

The GLM is specified as

\[\text{E}(Y) = g^{-1}(\boldsymbol{X\beta}) \ \Rightarrow \ g(\text{E}(Y)) = \boldsymbol{X\beta}, \qquad \qquad \text{E}(Y - g^{-1}(\boldsymbol{X\beta})) = 0, \qquad \qquad \text{var}(Y) = \boldsymbol{I}_n v(g^{-1}(\boldsymbol{X\beta})) \]

where \(g(\cdot)\) is the link function and \(v(\cdot)\) is the variance function (defines the relationship between the expected value of Y and the variance of Y.) Both of these model forms fit the expected means of a set of observations using the functional form as given by the regressors (\(\boldsymbol{X\beta}\).) The OLS model assumes the means are a linear function of the regressors, where GLM's assume a transformation (link function) of regressors. The OLS model assumes that the residual part of the model, the part of the response which is unexplained by the regressors, has a constant variance and that it is normally distributed. The variance of the residuals of a GLM is based on the \(v(\cdot)\) function. The GLM residuals may have a distributional assumption depending on the response variable being modeled. For example, a response variable that is the number of successes in a fixed number of trials would be expected to follow a binomial distribution. A separate distributional assumption for the errors is not always required for GLM's.

The model assumptions to be checked are organized as follows in this article.

These diagnostics are not formal tests.

Example data set and model

The model used in this article is based on PGA data from 2004. This data set was found at http://users.stat.ufl.edu/~winner/datasets.html. The purpose of the model is to understand the relationships between the “on greens in regulation” variable and the other variables of the data set. There are a number of models which are reasonably similar in their ability to explain the greens in regulation. From these possible models we have chosen one which demonstrates several issues that diagnostics are useful in exploring. Not all models of this data set exhibit the same issues. In that regard the example model is a little artificial, since it was chosen to demonstrate difficulties with fit. While we acknowledge this, we also note that these issues can and do occur with some frequency in models created by practitioners who use regression models.

The following code reads in the data set and prepares a few variables for use.

The response variable, greens_regul, is a proportion. (The range is \([0, 1]\).) This will be modeled using a logit transformation. The greens_regul_link variable was created, to assist in viewing the relationships between the regressors and the response. This can be directly done here since there are no observations that have a 0 or 1 for greens_regul.

The following pairwise plots displayed the relationships between the regressors and the response on the link scale.