R - A Brief Introduction to Regression
This article is designed for individuals who need only a cursory understanding of R. This might be someone who has to do a single assignment in R or needs to read someone else's R code. Graduate students and other researchers, and those who hope to someday be graduate students or researchers, should read R for Researchers.
This article does not provide any description of the theory for the regression tools covered in this article. It is assumed that the reader already has this background. While the theory is not directly covered in R for Researchers, There is more discussion of the results and their meaning. This can be helpful for a reader who does not have a theoretical background in regression.
Specifying a model
Formula
The response variable and model variables are specified as an R formula. The basic form of a formula is
\[response \sim term_1 + \cdots + term_p.\]
The \(\sim\) separates the response variable, on the left, from the terms of the model, which are on the right. Terms are are either the main effect of a regressor or the interaction of regressors. Model variables are constructed from terms. For example, a factored regressor, R's implementation of a categorical variable, can be used as term in a model and it results in multiple model variables, one for each non reference level category. There is one coefficient for each model variable and possibly multiple coefficients for a term.
The mathematical symbols, +, -, *, and ^ do not have their normal mathematical meaning in a formula. The + symbol is used to add additional terms to a fromula, The - symbol is used to remove a term from a formula. The * and ^ symbol descriptions are included with the formula shortcuts below.
Some common terms are
- Numeric regressor: All numeric variable types result in a single continuous model variable.
- Logical regressor: Results in a single indicator, also known as dummy variable, in the model.
- Factored regressor: Results in a set of indicator variables in the model. The first level of the factor is the reference level and is not coded as an indicator variable. All other levels are encoded as indicator variables.
- \(term_1:term_2\): Creates a term from the interaction between terms \(term_1\) and \(term_2\). The \(:\) operator does not force the main effect terms, \(term_1\) and \(term_2\), into the model.
Some formula short cuts for specifying terms.
- \(term_1*term_2\): This results in the same interaction term as \(term_1:term_2\). The \(*\) operator forces the main effect terms \(term_1\) and \(term_2\) into the model.
- \((term_1 + \cdots + term_j)\)^\(k\): Creates a term for each of the \(j\) terms and all interactions up to order \(k\) which can be formed from the \(j\) terms.
- I(\(expression\)): The I() function is used when you need to use +, -, *, or ^ as math symbols to construct a term in the formula. This is commonly used to construct a quadratic term.
- poly(x, degree=k): Creates a term which is a jth order polynomial of the regressor x. Poly() constructs k model variables which are orthogonal to each other. A poly term, like a factor, is a single term which translates into multiple model variables. These behaviors, orthogonality of the model variables and grouping the k model variables, are advantages when doing variable selection. These orthogonal model variables are difficult to interpret since they are not on the scale of the original regressors. Therefore it is common practice to use poly for model selection and then rerun the selected model using polynomial terms constructed from the regressor after the variable selection process is complete.
- -1: Removes the intercept from the model. The intercept is included in a model by default in R.
As an example consider the formula, \(y \sim A + B*C\). Where \(A\) and \(B\) are numeric regressors and \(C\) is a categorical regressors with three levels. This formula results in the following terms and model variables.
- (Intercept): This term is a model constant.
- (Intercept): The constant term.
- A: This term results in a single continuous model variable.
- A: A continuous model variable.
- A: A continuous model variable.
- B: This term results in a single continuous model variable.
- B: A continuous model variable.
- B: A continuous model variable.
- C: This term results in two model variables.
- CLevel1: An indicator variable that represents a change in the intercept.
- CLevel2: An indicator variable that represents a change in the intercept.
- B:C: This term results in two model variables.
- B:CLevel1: An indicator variable that represents a change in the B slope value.
- B:CLevel2: An indicator variable that represents a change in the B slope value.
Linear models - Ordinary Least Squares models (OLS)
The lm() function fits OLS models.
Syntax and use of the lm() function
lm(formula, weights=w, data=dataFrame)
Returns a model object. This is a list of objects which result from fitting the model.
The formula parameter is of the form described above.
The data parameter is optional. dataFrame specifies the data.frame which contains the response and/or regressors to be fit. R will look in the current environment for variables which are not found in the data.frame.
The weights parameter is optional. When present, a weighted fit is done using the w vector as the weights.
The following example uses lm() to fit the sleepstudy data from the lme4 package.
Enter the following commands in your script and run them.
library(lme4) data(sleepstudy) mod <- lm(Reaction ~ Days + Subject, data = sleepstudy)
There are no console results from these commands.
Loading required package: Matrix
Generalized linear models (GLM)
The family parameter specifies the variance and link functions which are used in the model fit. The variance function specifies the relationship of the variance to the mean. The link defines the transformation to be applied to the response variable. As an example, the family poisson uses the "log" link function and "\(\mu\)" as the variance function. A GLM model is defined by both the formula and the family.
The default link function (the canonical link) for a family can be changed by specifying a link in the family function. For example, if the response variable is non negative and the variance is proportional to the mean, you would use the "identity" link with the "quasipoisson" family function. This would be specified as
family = quasipoisson(link = "identity")
The variance function is specified by the family.
Syntax and use of the glm() function
glm(formula, family = family, weights = w, data = dataFrame)
Returns a model object. This is a list of objects which result from fitting the model.
The formula, dataFrame, and w parameters are as defined for the lm() function above.
The family parameter defines the family used by glm(). Some common families are binomial, poisson, gaussian, quasi, quasibinomial, and quasipoisson.
The following example uses glm() to fit the Cowles data set from the car package.
Enter the following commands in your script and run them.
library(car) data(Cowles) modglm <- glm(volunteer ~ sex + extraversion * neuroticism, family = binomial, data = Cowles )
There are no console results from these commands.
Random parameters
The lmer() and glmer() functions from the lme4 package are used when random effects are included in a model.
Random effects are specified through an extension of R formulas. The specification of random effects is done in two parts. The first part identifies the intercepts and slopes which are to be modelled as random. The second part of the specification identifies the sampled levels. The sampled levels are given by a categorical variable, or a formula expression which evaluates to a categorical variable. The categorical variable given in the random effect specification is the groups identifier for the random parameter.
These two parts are placed inside parenthesis, (), and the two parts are separated by the bar, "|". The syntax for including a random effect in a formula is shown below.
+ ( effect expression | groups )
A random intercept or slope will be modelled for each term provided in the effects expression. R automatically will include an random intercept when a random slope is provided.
Linear models can be fit using REML in place of maximum likelihood and is the default setting for lmer(). When the maximum likelihood is needed, the REML parameter to lmer() can be set to false.
The glmer() function is used to include random effects in generalized models. The family information is specified to the glmer() function using the same family parameter definition as the glm() function.
The following example uses lmer() to fit the sleepstudy data with subject as the groups for a random variable.
Enter the following command in your script and run it.
modre <- lmer(Reaction ~ Days + (Days | Subject), data = sleepstudy)
There are no console results from this command.
Variable selection functions
R supports a number of commonly used criteria for selecting variables. These include BIC, AIC, F-tests, likelihood ratio tests and adjusted R squared. Adjusted R squared is returned in the summary of the model object and will be cover with the summary() function below.
The drop1() function compares all possible models that can be constructed by dropping a single model term. The add1() function compares all possible models that can be constructed by adding a term. The step() function does repeated drop1 and add1 until the optimal AIC value is reached.
Syntax for the drop1(), add1() and step() functions.
drop1(modelObj, test = test)
add1(modelObj, scope = scope, test = test) step(modelObj, scope = scope, test = test, direction = direction )The scope parameter is a formula that provides the largest model to consider for adding a term.
The test parameter can be set to "F" or "LRT". The default test is AIC.
The direction parameter can be "both", "backward", or "forward".
The following example does an F-test of the terms of the OLS model from above and a likelihood ratio test for several possible terms to the GLM model from above.
Enter the following command in your script and run it.
drop1(mod, test = "F") add1(modglm, scope = ~ (sex + extraversion + neuroticism)^2, test = "LRT")
The console results from these commands.
Single term deletions Model: Reaction ~ Days + Subject Df Sum of Sq RSS AIC F value Pr(>F) <none> 154634 1254.0 Days 1 162703 317336 1381.5 169.401 < 2.2e-16 *** Subject 17 250618 405252 1393.5 15.349 < 2.2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Single term additions Model: volunteer ~ sex + extraversion * neuroticism Df Deviance AIC LRT Pr(>Chi) <none> 1897.4 1907.4 sex:extraversion 1 1897.2 1909.2 0.232369 0.6298 sex:neuroticism 1 1897.4 1909.4 0.008832 0.9251
The anova() function is similar to the drop1() and add() functions. The difference is that anova takes as parameters the models to be compared instead of the models being generated by the function. That is, to use anova(), the models must be built by the user first.
The AIC() and BIC() functions provide the AIC and BIC values for a model. The user can then compare these values to the values from other models being considered.
Syntax for the AIC() and BIC() functions.
AIC(modelObj)
BIC(modelObj)
The following example calculates the AIC and BIC for the OLS model from above.
Enter the following commands in your script and run them.
AIC(mod) BIC(mod)
The console results from these commands.
AIC(mod)
[1] 1766.872
BIC(mod)
[1] 1830.731
Diagnostics
A few basic diagnostics are provided in this article. A more complete coverage of diagnostics can be found in the Regression Diagnostics article.
The plot() function provided a set of diagnostic plots for model objects.
Syntax and use of the plot() function for model objects.
plot(modObj, which=plotId)
There is no returned object. Diagnostics plots are generated by this function.
plotId can take values from 1 through 6. The three plots we will examine are, 1 for a residual plot, 2 for the normal q-q of residuals, and 5 for the residual versus leverage plots. The default for plotId is c(1,2,3,5). When there are multiple plots produced from one call to plot, you will be prompted in the console pane to hit enter for each of these plots. This is to allow you time to view and possibly save each plot before the next plot is displayed.
This set of prepared diagnostic plots provides an initial look at all the major model assumptions and is a good place to start the evaluation of the fit of a model.
The following example produces the residual, normality, and leverage plots for the OLS model.
Enter the following commands in your script and run them.
plot(mod, which = 1) plot(mod, which = 2) plot(mod, which = 5)
There are no console results from this command. The following plots are produced.
plot(mod, which = 1)
plot(mod, which = 2)
plot(mod, which = 5)
Results
summary() function
The summary() function provides a nice summary of a model object. You could also use the str() function to see the details of what is included in the model object.
The following examples display the summary for the three models created above.
Enter the following command in your script and run it.
summary(mod)
The results of the above command are shown below.
Call: lm(formula = Reaction ~ Days + Subject, data = sleepstudy) Residuals: Min 1Q Median 3Q Max -100.540 -16.389 -0.341 15.215 131.159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 295.0310 10.4471 28.240 < 2e-16 *** Days 10.4673 0.8042 13.015 < 2e-16 *** Subject309 -126.9008 13.8597 -9.156 2.35e-16 *** Subject310 -111.1326 13.8597 -8.018 2.07e-13 *** Subject330 -38.9124 13.8597 -2.808 0.005609 ** Subject331 -32.6978 13.8597 -2.359 0.019514 * Subject332 -34.8318 13.8597 -2.513 0.012949 * Subject333 -25.9755 13.8597 -1.874 0.062718 . Subject334 -46.8318 13.8597 -3.379 0.000913 *** Subject335 -92.0638 13.8597 -6.643 4.51e-10 *** Subject337 33.5872 13.8597 2.423 0.016486 * Subject349 -66.2994 13.8597 -4.784 3.87e-06 *** Subject350 -28.5311 13.8597 -2.059 0.041147 * Subject351 -52.0361 13.8597 -3.754 0.000242 *** Subject352 -4.7123 13.8597 -0.340 0.734300 Subject369 -36.0992 13.8597 -2.605 0.010059 * Subject370 -50.4321 13.8597 -3.639 0.000369 *** Subject371 -47.1498 13.8597 -3.402 0.000844 *** Subject372 -24.2477 13.8597 -1.750 0.082108 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 30.99 on 161 degrees of freedom Multiple R-squared: 0.7277, Adjusted R-squared: 0.6973 F-statistic: 23.91 on 18 and 161 DF, p-value: < 2.2e-16
The summary display starts with the call to lm which generated the model object.
The residual summary is the five number summary for the residuals. This can be used as a quick check for skewed residuals.
The coefficient's summary shows the estimated value, standard error, and p-value for each coefficient. The p-values are from Wald tests of each coefficient being equal to zero. For OLS models this is equivalent to an F-test of nested models with the variable of interest being removed in the nested model.
The display ends with information on the model fit. This is the residual standard error, R squared of the model, and the F-test of the significance of the model verse the null model.
Enter the following command in your script and run it.
summary(modglm)
The results of the above command are shown below.
Call: glm(formula = volunteer ~ sex + extraversion * neuroticism, family = binomial, data = Cowles) Deviance Residuals: Min 1Q Median 3Q Max -1.4749 -1.0602 -0.8934 1.2609 1.9978 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.358207 0.501320 -4.704 2.55e-06 *** sexmale -0.247152 0.111631 -2.214 0.02683 * extraversion 0.166816 0.037719 4.423 9.75e-06 *** neuroticism 0.110777 0.037648 2.942 0.00326 ** extraversion:neuroticism -0.008552 0.002934 -2.915 0.00355 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1933.5 on 1420 degrees of freedom Residual deviance: 1897.4 on 1416 degrees of freedom AIC: 1907.4 Number of Fisher Scoring iterations: 4
The summary display for glm models includes similar call, residuls, and coefficient sections. The glm model fit summary includes dispersion, deviance, and iteration information.
Enter the following command in your script and run it.
summary(modre)
The results of the above command are shown below.
Linear mixed model fit by REML ['lmerMod'] Formula: Reaction ~ Days + (Days | Subject) Data: sleepstudy REML criterion at convergence: 1743.6 Scaled residuals: Min 1Q Median 3Q Max -3.9536 -0.4634 0.0231 0.4634 5.1793 Random effects: Groups Name Variance Std.Dev. Corr Subject (Intercept) 612.09 24.740 Days 35.07 5.922 0.07 Residual 654.94 25.592 Number of obs: 180, groups: Subject, 18 Fixed effects: Estimate Std. Error t value (Intercept) 251.405 6.825 36.84 Days 10.467 1.546 6.77 Correlation of Fixed Effects: (Intr) Days -0.138
The summary display for random effects models includes similar call, residuals, and coefficient sections.
The summary includes a random effects section. This section will display the estimated variances organized by groups. Within each group there will be a row for each random intercept and slope. The variance and standard deviation of the random parameter are provided. The correlation between random effects is also provided when it is estimated. For linear models, the residual is included. For GLM models with defined residuals, such as the logistic and poisson family, no residual information is provided since the model is scaled to the defined error variance.
The glm model fit summary includes convergence information. Here this is REML convergence.
confint() function
The confint() function can be applied to all of the above model object types. The confint() function will calculate a profiled confidence interval when it is appropriate.
The following example displays the confidence intervals for the three models created above.
Enter the following commands in your script and run them.
confint(mod) confint(modglm) confint(modre)
The results of the above command are shown below.
confint(mod)
2.5 % 97.5 % (Intercept) 274.399941 315.66215 Days 8.879103 12.05547 Subject309 -154.271100 -99.53060 Subject310 -138.502810 -83.76231 Subject330 -66.282660 -11.54216 Subject331 -60.068030 -5.32753 Subject332 -62.202010 -7.46151 Subject333 -53.345770 1.39473 Subject334 -74.202030 -19.46153 Subject335 -119.434040 -64.69354 Subject337 6.216930 60.95743 Subject349 -93.669610 -38.92911 Subject350 -55.901400 -1.16090 Subject351 -79.406330 -24.66583 Subject352 -32.082540 22.65796 Subject369 -63.469440 -8.72894 Subject370 -77.802310 -23.06181 Subject371 -74.520040 -19.77954 Subject372 -51.617950 3.12255
confint(modglm)
Waiting for profiling to be done...
2.5 % 97.5 % (Intercept) -3.35652914 -1.389154923 sexmale -0.46642058 -0.028694911 extraversion 0.09374678 0.241771712 neuroticism 0.03744357 0.185227757 extraversion:neuroticism -0.01434742 -0.002833714
confint(modre)
Computing profile confidence intervals ...
2.5 % 97.5 % .sig01 14.3815822 37.715996 .sig02 -0.4815007 0.684986 .sig03 3.8011641 8.753383 .sigma 22.8982669 28.857997 (Intercept) 237.6806955 265.129515 Days 7.3586533 13.575919
predict() function
The predict() function is used to determine the predicted values for a particular set of values of the regressors. The predict() function can also return the confidence interval or prediction interval with the predictions for OLS models.
Syntax and use of the predict() function
predict(modelObj, newObs, interval = intervaltype, level = level, type = type)
The modelObj is an an object returned from a regression function.
The newObs parameter is optional. If it is not provided, the predictions will be for the observed values the model was fit to. The form of newObs is a data.frame with the same columns as used in modelObj.
The intervaltype parameter is available for OLS models. It can be set to "none", "confidence", or "prediction". The default is none, no interval, and alternatively it can be a confidence interval or a prediction interval.
The level parameter is the confidence or prediction level.
The type parameter is used with geralized linear models. The default value is "link", for the linear predictor scale. It can be set to "response" for predictions on the scale of the response variable.
The following example makes predictions for each of the three models from above.
Enter the following commands in your script and run them.
newObs <- data.frame(Days = c(10, 8), Subject = c("331", "372") ) predict(mod, newObs, interval = "prediction") newObsGlm <- data.frame(sex = c("male"), extraversion = c(8), neuroticism = c(15) ) predict(modglm, newObsGlm, type = "response") predict(modre, newObs, type = "response")
The results of the above commands are shown below.
newObs <- data.frame(Days = c(10, 8), Subject = c("331", "372") ) predict(mod, newObs, interval = "prediction")
fit lwr upr 1 367.0061 302.2256 431.7867 2 354.5216 290.0925 418.9508
newObsGlm <- data.frame(sex = c("male"), extraversion = c(8), neuroticism = c(15) ) predict(modglm, newObsGlm, type = "response")
1 0.3462704
predict(modre, newObs, type = "response")
1 2 347.6392 357.7302
Extractors
Extractor functions are the preferred method for retrieving information on the model. Some commonly used extractor functions are listed below.
fitted()
The fitted() function returns the predicted values for the observation in the data set used to fit the model.
residual()
The residual() function returns the residual values from the fitted model.
hatvalues()
The hatvalues() function returns the hat values, leverage measures, that result from fitting the model.
Influence measures
The cooks.distance() and influence() functions returns the Cook's distance or a set of influence measures that resulted from fitting the model.
Further reading
To learn more about fitting regression models in R see R for Researchers.
Last Revised: 3/28/2017