Page not found – Social Science Computing Cooperative – UW–Madison

It looks like nothing was found at this location. Maybe try a search?

4 Regression

This chapter is under construction.

4.1 Overview

This chapter is designed for individuals who need only a cursory introduction to regression using R. Included is how to specify model formulas, run Ordinary Least Squares and Generalized Linear Models, and examine results and diagnostics. This does not cover the theory for regressions.

The examples in this chapter use the sleepstudy data frame from the lme4 library and the Cowles data frame from the car library. You will need both of these packages loaded on your computer to run these examples. If you are on an SSCC computer, these packages are already loaded, if you have installed R from the software center.

# eval must be FALSE here, otherwise functions from the car package
# overwrite functions from the tidyverse
library(car)
library(lme4)
data(sleepstudy)
data(Cowles)

4.2 Specifying a model

4.2.1 Formula

The response variable and model variables are specified as an R formula. The basic form of a formula is

\[response \sim term_1 + \cdots + term_p.\]

The tilda, ~, separates the response variable, on the left, from the terms of the model, which are on the right. Terms are are either the main effect of a regressor or the interaction of regressors. Model variables are constructed from terms. For example, a factored regressor, R's implementation of a categorical variable, can be used as term in a model and it results in multiple model variables, one for each non reference level category. There is one coefficient for each model variable and possibly multiple coefficients for a term.

The mathematical symbols, +, -, *, and ^ do not have their normal mathematical meaning in a formula. The + symbol is used to add additional terms to a fromula, The - symbol is used to remove a term from a formula. The decriptions for * and ^ are included with the formula shortcuts below.

Some common terms are

  • Numeric regressor: All numeric variable types result in a single continuous model variable.
  • Logical regressor: Results in a single indicator, also known as dummy variable, in the model.
  • Factored regressor: Results in a set of indicator variables in the model. The first level of the factor is the reference level and is not coded as an indicator variable. All other levels are encoded as indicator variables.
  • \(term_1:term_2\): Creates a term from the interaction between terms \(term_1\) and \(term_2\). The : operator does not force the main effect terms, \(term_1\) and \(term_2\), into the model.

Some formula short cuts for specifying terms.

  • \(term_1*term_2\): This results in the same interaction term as \(term_1:term_2\). The * operator forces the main effect terms \(term_1\) and \(term_2\) into the model.
  • \((term_1 + \cdots + term_j)\)^\(k\): Creates a term for each of the \(j\) terms and all interactions up to order \(k\) which can be formed from the \(j\) terms.
  • I(\(expression\)): The I() function is used when you need to use +, -, *, or ^ as math symbols to construct a term in the formula. This is commonly used to construct a quadratic term.
  • poly(x, degree=k): Creates a term which is a jth order polynomial of the regressor x. Poly() constructs k model variables which are orthogonal to each other. A poly term, like a factor, is a single term which translates into multiple model variables. These behaviors, orthogonality of the model variables and grouping the k model variables, are advantages when doing variable selection. These orthogonal model variables are difficult to interpret since they are not on the scale of the original regressors. Therefore it is common practice to use poly for model selection and then rerun the selected model using polynomial terms constructed from the regressor after the variable selection process is complete.
  • -1: Removes the intercept from the model. The intercept is included in a model by default in R.

As an example consider the formula, y ~ A + B*C. Where A and B are numeric regressors and C is a categorical regressors with three levels, level1, level2, and level3. This formula results in the following terms and model variables.

  • (Intercept): This term is a model constant.
    • (Intercept): The constant term.
  • A: This term results in a single continuous model variable.
    • A: A continuous model variable.
  • B: This term results in a single continuous model variable.
    • B: A continuous model variable.
  • C: This term results in two model variables.
    • Clevel2: An indicator variable that represents a change in the intercept.
    • Clevel3: An indicator variable that represents a change in the intercept.
  • B:C: This term results in two model variables.
    • B:Clevel2: An indicator variable that represents a change in the B slope value.
    • B:Clevel3: An indicator variable that represents a change in the B slope value.

4.3 Running regressions

4.3.1 Linear models - Ordinary Least Squares models (OLS)

The lm() function fits OLS models.

  • Syntax and use of the lm() function

    lm(formula, weights=w, data=dataFrame)

    Returns a model object. This is a list of objects which result from fitting the model.

    The formula parameter is of the form described above.

    The data parameter is optional. dataFrame specifies the data.frame which contains the response and/or regressors to be fit. R will look in the current environment for variables which are not found in the data.frame.

    The weights parameter is optional. When present, a weighted fit is done using the w vector as the weights.

The following example uses lm() to fit the sleepstudy data from the lme4 package.

  1. Fit an OLS model.

    mod <- lm(Reaction ~ Days + Subject, data = sleepstudy)

4.3.2 Generalized linear models (GLM)

The glm() function fits a GLM model. The family parameter specifies the variance and link functions which are used in the model fit. The variance function specifies the relationship of the variance to the mean. The link defines the transformation to be applied to the response variable. As an example, the family poisson uses the "log" link function and "\(\mu\)" as the variance function. A GLM model is defined by both the formula and the family.

The default link function (the canonical link) for a family can be changed by specifying a link in the family function. For example, if the response variable is non negative and the variance is proportional to the mean, you migh use the "log" link with the "quasipoisson" family function. This would be specified as

family = quasipoisson()

The variance function is specified by the family and log is the canonical link for poisson.

  • Syntax and use of the glm() function

    glm(formula, family = family, weights = w, data = dataFrame)

    Returns a model object. This is a list of objects which result from fitting the model.

    The formula, dataFrame, and w parameters are as defined for the lm() function above.

    The family parameter defines the family used by glm(). Some common families are binomial, poisson, gaussian, quasi, quasibinomial, and quasipoisson.

The following example uses glm() to fit the Cowles data set from the car package.

  1. Fit an GLM model.

    modglm <- glm(volunteer ~ sex + extraversion * neuroticism,
                  family = binomial,
                  data = Cowles
              )

4.4 Variable selection functions

R supports a number of commonly used criteria for selecting variables. These include BIC, AIC, F-tests, likelihood ratio tests and adjusted R squared. Adjusted R squared is returned in the summary of the model object and will be cover with the summary() function below.

The drop1() function compares all possible models that can be constructed by dropping a single model term. The add1() function compares all possible models that can be constructed by adding a term. The step() function does repeated drop1() and add1() until the optimal AIC value is reached.

  • Syntax for the drop1(), add1(), and step() functions.

    drop1(modelObj, test = test)
    add1(modelObj, scope = scope, test = test) step(modelObj, scope = scope, test = test, direction = direction )

    The scope parameter is a formula that provides the largest model to consider for adding a term.

    The test parameter can be set to "F" or "LRT". The default test is AIC.

    The direction parameter can be "both", "backward", or "forward".

The following example does an F-test of the terms of the OLS model from above and a likelihood ratio test for several possible terms to the GLM model from above.

  1. Using drop1() and add1().

    drop1(mod, test = "F")
    Single term deletions
    
    Model:
    Reaction ~ Days + Subject
            Df Sum of Sq    RSS    AIC F value    Pr(>F)    
    <none>               154634 1254.0                      
    Days     1    162703 317336 1381.5 169.401 < 2.2e-16 ***
    Subject 17    250618 405252 1393.5  15.349 < 2.2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    add1(modglm, scope = ~ (sex + extraversion + neuroticism)^2, test = "LRT")
    Single term additions
    
    Model:
    volunteer ~ sex + extraversion * neuroticism
                     Df Deviance    AIC      LRT Pr(>Chi)
    <none>                1897.4 1907.4                  
    sex:extraversion  1   1897.2 1909.2 0.232369   0.6298
    sex:neuroticism   1   1897.4 1909.4 0.008832   0.9251

The anova() function is similar to the drop1() and add1() functions. The difference is that anova takes as parameters the models to be compared instead of the models being generated by the function. That is, to use anova(), the models must be built by the user first.

The AIC() and BIC() functions provide the AIC and BIC values for a model. The user can then compare these values to the values from other models being considered.

  • Syntax for the AIC() and BIC() functions.

    AIC(modelObj)
    BIC(modelObj)

The following example calculates the AIC and BIC for the OLS model from above.

  1. Getting AIC and BIC.

    AIC(mod)
    [1] 1766.872
    BIC(mod)
    [1] 1830.731

4.5 Diagnostics

A few basic diagnostics are provided in this article. A more complete coverage of diagnostics can be found in the Regression Diagnostics article.

The plot() function provided a set of diagnostic plots for model objects.

  • Syntax and use of the plot() function for model objects.

    plot(modObj, which=plotId)

    There is no returned object. Diagnostics plots are generated by this function.

    plotId can take values from 1 through 6. The three plots we will examine are, 1 for a residual plot, 2 for the normal q-q of residuals, and 5 for the residual versus leverage plots. The default for plotId is c(1,2,3,5). When there are multiple plots produced from one call to plot, you will be prompted in the console pane to hit enter for each of these plots. This is to allow you time to view and possibly save each plot before the next plot is displayed.

This set of prepared diagnostic plots provides an initial look at all the major model assumptions and is a good place to start the evaluation of the fit of a model.

The following example produces the residual, normality, and leverage plots for the OLS model.

  1. Creating a residual plot.

    plot(mod, which = 1)

  2. Creating a normal q-q plot.

    A q-q plot is a plot of the emperical quantiles to the theoritical quantiles.

    plot(mod, which = 2)

  3. Creating a leverage and influence plot.

    plot(mod, which = 5)

4.6 Results

4.6.1 summary() function

The summary() function provides a nice summary of a model object. You could also use the str() function to see the details of what is included in the model object.

The following examples display the summary for the three models created above.

  • Displaying the summary of the linear model from above.

    
    Call:
    lm(formula = Reaction ~ Days + Subject, data = sleepstudy)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -100.540  -16.389   -0.341   15.215  131.159 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  295.0310    10.4471  28.240  < 2e-16 ***
    Days          10.4673     0.8042  13.015  < 2e-16 ***
    Subject309  -126.9008    13.8597  -9.156 2.35e-16 ***
    Subject310  -111.1326    13.8597  -8.018 2.07e-13 ***
    Subject330   -38.9124    13.8597  -2.808 0.005609 ** 
    Subject331   -32.6978    13.8597  -2.359 0.019514 *  
    Subject332   -34.8318    13.8597  -2.513 0.012949 *  
    Subject333   -25.9755    13.8597  -1.874 0.062718 .  
    Subject334   -46.8318    13.8597  -3.379 0.000913 ***
    Subject335   -92.0638    13.8597  -6.643 4.51e-10 ***
    Subject337    33.5872    13.8597   2.423 0.016486 *  
    Subject349   -66.2994    13.8597  -4.784 3.87e-06 ***
    Subject350   -28.5311    13.8597  -2.059 0.041147 *  
    Subject351   -52.0361    13.8597  -3.754 0.000242 ***
    Subject352    -4.7123    13.8597  -0.340 0.734300    
    Subject369   -36.0992    13.8597  -2.605 0.010059 *  
    Subject370   -50.4321    13.8597  -3.639 0.000369 ***
    Subject371   -47.1498    13.8597  -3.402 0.000844 ***
    Subject372   -24.2477    13.8597  -1.750 0.082108 .  
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 30.99 on 161 degrees of freedom
    Multiple R-squared:  0.7277,    Adjusted R-squared:  0.6973 
    F-statistic: 23.91 on 18 and 161 DF,  p-value: < 2.2e-16

    The summary display starts with the call to lm which generated the model object.

    The residual summary is the five number summary for the residuals. This can be used as a quick check for skewed residuals.

    The coefficient's summary shows the estimated value, standard error, and p-value for each coefficient. The p-values are from Wald tests of each coefficient being equal to zero. For OLS models this is equivalent to an F-test of nested models with the variable of interest being removed in the nested model.

    The display ends with information on the model fit. This is the residual standard error, R squared of the model, and the F-test of the significance of the model verse the null model.

  • Displaying the summary of the GLM model from above.

    summary(modglm)
    
    Call:
    glm(formula = volunteer ~ sex + extraversion * neuroticism, family = binomial, 
        data = Cowles)
    
    Deviance Residuals: 
        Min       1Q   Median       3Q      Max  
    -1.4749  -1.0602  -0.8934   1.2609   1.9978  
    
    Coefficients:
                              Estimate Std. Error z value Pr(>|z|)    
    (Intercept)              -2.358207   0.501320  -4.704 2.55e-06 ***
    sexmale                  -0.247152   0.111631  -2.214  0.02683 *  
    extraversion              0.166816   0.037719   4.423 9.75e-06 ***
    neuroticism               0.110777   0.037648   2.942  0.00326 ** 
    extraversion:neuroticism -0.008552   0.002934  -2.915  0.00355 ** 
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 1933.5  on 1420  degrees of freedom
    Residual deviance: 1897.4  on 1416  degrees of freedom
    AIC: 1907.4
    
    Number of Fisher Scoring iterations: 4

    The summary display for glm models includes similar call, residuls, and coefficient sections. The glm model fit summary includes dispersion, deviance, and iteration information.

4.6.2 confint() function

The confint() function can be applied to all of the above model object types. The confint() function will calculate a profiled confidence interval when it is appropriate.

The following example displays the confidence intervals for the three models created above.

  • Displaying the confidence intervals for both of the models.

    confint(mod)
                      2.5 %    97.5 %
    (Intercept)  274.399941 315.66215
    Days           8.879103  12.05547
    Subject309  -154.271100 -99.53060
    Subject310  -138.502810 -83.76231
    Subject330   -66.282660 -11.54216
    Subject331   -60.068030  -5.32753
    Subject332   -62.202010  -7.46151
    Subject333   -53.345770   1.39473
    Subject334   -74.202030 -19.46153
    Subject335  -119.434040 -64.69354
    Subject337     6.216930  60.95743
    Subject349   -93.669610 -38.92911
    Subject350   -55.901400  -1.16090
    Subject351   -79.406330 -24.66583
    Subject352   -32.082540  22.65796
    Subject369   -63.469440  -8.72894
    Subject370   -77.802310 -23.06181
    Subject371   -74.520040 -19.77954
    Subject372   -51.617950   3.12255
    confint(modglm)
    Waiting for profiling to be done...
                                   2.5 %       97.5 %
    (Intercept)              -3.35652914 -1.389154923
    sexmale                  -0.46642058 -0.028694911
    extraversion              0.09374678  0.241771712
    neuroticism               0.03744357  0.185227757
    extraversion:neuroticism -0.01434742 -0.002833714

4.6.3 predict() function

The predict() function is used to determine the predicted values for a particular set of values of the regressors. The predict() function can also return the confidence interval or prediction interval with the predictions for OLS models.

  • Syntax and use of the predict() function

    predict(modelObj, newObs, interval = intervaltype, level = level, type = type)

    The modelObj is an an object returned from a regression function.

    The newObs parameter is optional. If it is not provided, the predictions will be for the observed values the model was fit to. The form of newObs is a data.frame with the same columns as used in modelObj.

    The intervaltype parameter is available for OLS models. It can be set to "none", "confidence", or "prediction". The default is none, no interval, and alternatively it can be a confidence interval or a prediction interval.

    The level parameter is the confidence or prediction level.

    The type parameter is used with geralized linear models. The default value is "link", for the linear predictor scale. It can be set to "response" for predictions on the scale of the response variable.

The following example makes predictions for each of the three models from above.

  • Predicting new observations.

    Predicting subjects at 331 at 10 days and 372 at 8 days.

    newObs <- data.frame(Days = c(10, 8),
                         Subject = c("331", "372")
                        )
    predict(mod, newObs, interval = "prediction")
           fit      lwr      upr
    1 367.0061 302.2256 431.7867
    2 354.5216 290.0925 418.9508

    Predicting a male with an 8 for extraversion and 15 for neuroticism.

    newObsGlm <- data.frame(sex = c("male"),
                            extraversion = c(8),
                            neuroticism = c(15)
                           )
    predict(modglm, newObsGlm, type = "response")
            1 
    0.3462704 

4.6.4 Extractors

Extractor functions are the preferred method for retrieving information on the model. Some commonly used extractor functions are listed below.

  • fitted()

    The fitted() function returns the predicted values for the observation in the data set used to fit the model.

  • residual()

    The residual() function returns the residual values from the fitted model.

  • hatvalues()

    The hatvalues() function returns the hat values, leverage measures, that result from fitting the model.

  • Influence measures

    The cooks.distance() and influence() functions returns the Cook's distance or a set of influence measures that resulted from fitting the model.

Page not found – Social Science Computing Cooperative – UW–Madison

It looks like nothing was found at this location. Maybe try a search?