SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

20 Formulas

Many commands in R are specified in terms of formulas. A formula has a tilde ( ~ ), left-hand side terms and right-hand side terms.

y ~ x

The terms are composed of data object names or expressions. These are usually variables in a data frame, but they can also include matrices, or the results of embedded expressions like log(x). Terms are connected by a variety of math-like symbols that have their own algebra.

y ~ x + log(z)

One caveat is that long expressions in place of short variable names makes output very difficult to read!

Formulas have their own distinct class, and can even be saved as “formula”-class objects.

20.1 Plot Formula

For example, a scatterplot can be specified as a formula with the plot() function. Where we can specify a formula we can almost always also specify a data= parameter to point to the source of the variables.

First, load some example data and look at it.

load(url("http://www.ssc.wisc.edu/~hemken/Rworkshops/hsb.RData"))
head(hsb)
   id gender  race    ses schtyp     prog read write math
1  70   male white    low public  general   57    52   41
2 121 female white middle public vocation   68    59   53
3  86   male white   high public  general   44    33   54
4 141   male white   high public vocation   63    44   47
5 172   male white middle public academic   47    52   57
6 113   male white middle public academic   44    52   51
  science socst
1      47    57
2      63    61
3      58    31
4      53    56
5      53    61
6      63    61
plot(math ~ read, data=hsb)

class(math ~ read)
[1] "formula"

20.2 T-test Formula

We have already seen formulas used in t-tests.

f <- formula(read ~ gender)
t.test(f, data=hsb)

    Welch Two Sample t-test

data:  read by gender
t = -0.74506, df = 188.46, p-value = 0.4572
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.976725  1.796263
sample estimates:
mean in group female   mean in group male 
            51.73394             52.82418 
plot(f, data=hsb)

20.3 Linear Model Formula

Formulas are the central element in specifying regression models, using lm() and a variety of other modeling functions.

20.3.1 Regression

lm(math ~ read, data=hsb)

Call:
lm(formula = math ~ read, data = hsb)

Coefficients:
(Intercept)         read  
    21.0382       0.6051  

20.3.2 ANOVA

Anova models work in the same way.

plot(math ~ prog, data=hsb)

fit2 <- lm(math ~ prog, data=hsb)
summary(fit2)

Call:
lm(formula = math ~ prog, data = hsb)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.7333  -6.4200  -0.5767   6.2667  28.5800 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   56.7333     0.8068  70.321  < 2e-16 ***
proggeneral   -6.7111     1.4730  -4.556 9.12e-06 ***
progvocation -10.3133     1.4205  -7.260 8.73e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.267 on 197 degrees of freedom
Multiple R-squared:  0.2291,    Adjusted R-squared:  0.2213 
F-statistic: 29.28 on 2 and 197 DF,  p-value: 7.364e-12

Notice that the intercept is automatically included. While prog is a single variable in our data it contributes two terms to our model. R model formulae are similar to mathematical formulae, but they are not the same thing.

20.3.3 Adding Variables Adds Terms

A model with more than one variable on the right-hand side:

library(magrittr)
lm(math ~ prog + read, data=hsb) %>%
  summary

Call:
lm(formula = math ~ prog + read, data = hsb)

Residuals:
     Min       1Q   Median       3Q      Max 
-21.7994  -4.6484  -0.8686   4.8846  19.9834 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27.99519    2.96929   9.428  < 2e-16 ***
proggeneral  -3.43297    1.24908  -2.748  0.00655 ** 
progvocation -5.21581    1.27015  -4.106  5.9e-05 ***
read          0.51170    0.05155   9.927  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.761 on 196 degrees of freedom
Multiple R-squared:  0.487, Adjusted R-squared:  0.4792 
F-statistic: 62.03 on 3 and 196 DF,  p-value: < 2.2e-16

20.3.4 Interaction Terms

Interactions may be specified a couple of different ways. An asterisk, *, means to include the higher order term plus all the related lower order terms. Alternatively, specific interaction terms may be specified with the colon, :.

lm(math ~ prog * read, data=hsb)

Call:
lm(formula = math ~ prog * read, data = hsb)

Coefficients:
      (Intercept)        proggeneral       progvocation  
          21.3612            12.8386             6.2033  
             read   proggeneral:read  progvocation:read  
           0.6298            -0.3118            -0.2217  
# the same model, written out
lm(math ~ prog + read + prog:read, data=hsb)

Call:
lm(formula = math ~ prog + read + prog:read, data = hsb)

Coefficients:
      (Intercept)        proggeneral       progvocation  
          21.3612            12.8386             6.2033  
             read   proggeneral:read  progvocation:read  
           0.6298            -0.3118            -0.2217  

Here again, the variable prog contributes multiple terms, and the interaction prog:read contributes multiple interaction terms.

20.3.5 Updates for Formulas

Models may be modified - terms added or removed - through the use of the update() function. In this context, a period, ., represents all the terms included in the previous model.

m1 <- lm(math ~ prog * read, data=hsb)
m1

Call:
lm(formula = math ~ prog * read, data = hsb)

Coefficients:
      (Intercept)        proggeneral       progvocation  
          21.3612            12.8386             6.2033  
             read   proggeneral:read  progvocation:read  
           0.6298            -0.3118            -0.2217  
update(m1, . ~ . -prog:read) # removes two model terms

Call:
lm(formula = math ~ prog + read, data = hsb)

Coefficients:
 (Intercept)   proggeneral  progvocation          read  
     27.9952       -3.4330       -5.2158        0.5117  

In a formula, a minus sign, -, is used to remove a term from a model.

20.3.6 Inhibit Formula Interpretation

In formulas, the plus sign, +, means to add terms, but we may also want to represent simple element-wise addition of vectors in a formula.

For example, one way to constrain the coefficients of several model terms to be equal is to combine the variables into a single model term. To do this, we use the inhibit function, I(). Expressions within the I() function are interpreted as general R expressions, and not as terms within the formula.

m3 <- lm(math ~ read+write+science+socst, hsb)
coefficients(m3) # to see why we might think some are equal
(Intercept)        read       write     science       socst 
 8.96274058  0.27068453  0.22616140  0.25389639  0.08480508 
m4 <- lm(math ~ I(read+write+science)+socst, hsb)
anova(m4, m3) # An F test for difference between models
Analysis of Variance Table

Model 1: math ~ I(read + write + science) + socst
Model 2: math ~ read + write + science + socst
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    197 7699.2                           
2    195 7691.3  2    7.9174 0.1004 0.9046

20.3.7 Polynomials

R does not use the asterisk or colon to form tensor products like other software does. This limitation means we cannot create polynomial terms the same way we create interaction terms

lm(math ~ read + read:read, data=hsb) # no good

Call:
lm(formula = math ~ read + read:read, data = hsb)

Coefficients:
(Intercept)         read  
    21.0382       0.6051  
lm(math ~ read + I(read*read), data=hsb) # but this works

Call:
lm(formula = math ~ read + I(read * read), data = hsb)

Coefficients:
   (Intercept)            read  I(read * read)  
     24.056044        0.487359        0.001106  
lm(math ~ read + I(read^2), data=hsb) # and this works

Call:
lm(formula = math ~ read + I(read^2), data = hsb)

Coefficients:
(Intercept)         read    I(read^2)  
  24.056044     0.487359     0.001106  

(Orthogonal polynomials and centered data are often used when polynomial terms are of interest.)

20.3.8 Limiting higher order interactions

Often second-order interaction terms are of interest, while higher order terms are not.

lm(math ~ read*write*science, data=hsb) %>%
  summary

Call:
lm(formula = math ~ read * write * science, data = hsb)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.9229  -4.2197  -0.0111   4.3823  15.4383 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)
(Intercept)         4.859e+01  6.982e+01   0.696    0.487
read               -2.603e-02  1.505e+00  -0.017    0.986
write              -7.682e-01  1.373e+00  -0.560    0.576
science             1.251e-02  1.382e+00   0.009    0.993
read:write          1.143e-02  2.851e-02   0.401    0.689
read:science       -4.338e-03  2.873e-02  -0.151    0.880
write:science       1.027e-02  2.632e-02   0.390    0.697
read:write:science -2.295e-05  5.301e-04  -0.043    0.966

Residual standard error: 6.237 on 192 degrees of freedom
Multiple R-squared:  0.5724,    Adjusted R-squared:  0.5568 
F-statistic: 36.71 on 7 and 192 DF,  p-value: < 2.2e-16
lm(math ~ (read+write+science)^2, data=hsb) %>%
  summary

Call:
lm(formula = math ~ (read + write + science)^2, data = hsb)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.8834  -4.2061  -0.0184   4.3747  15.4367 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)   
(Intercept)   45.647268  15.589767   2.928  0.00382 **
read           0.036840   0.396119   0.093  0.92600   
write         -0.711032   0.371207  -1.915  0.05691 . 
science        0.070291   0.359881   0.195  0.84535   
read:write     0.010230   0.006507   1.572  0.11755   
read:science  -0.005555   0.005968  -0.931  0.35318   
write:science  0.009163   0.006329   1.448  0.14931   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.221 on 193 degrees of freedom
Multiple R-squared:  0.5724,    Adjusted R-squared:  0.5591 
F-statistic: 43.05 on 6 and 193 DF,  p-value: < 2.2e-16

20.4 Table Formulas

The terms in a table formula serve to define the combinations of categories to be counted.

In individual level data, there is no left-hand side.

xtabs( ~ gender + schtyp, data=hsb)
        schtyp
gender   private public
  female      18     91
  male        14     77

In grouped data, the frequency counts are given on the left-hand side.

gender <- c("f","f","m","m")
schtyp <- c("pri","pub","pri","pub")
freq   <- c(18, 91, 14, 77)

df <- data.frame(gender, schtyp, freq)
df
  gender schtyp freq
1      f    pri   18
2      f    pub   91
3      m    pri   14
4      m    pub   77
xtabs(freq ~ gender + schtyp, data=df)
      schtyp
gender pri pub
     f  18  91
     m  14  77