7 Model Results

If you are starting from this page, please run the code at Libraries and Data Setup before proceeding.

Data visualizations can help us with checking our model assumptions and with understanding what our model means.

Fit a simple model that regresses income on age, race, and edu using the acs_sample dataset.

Then, add the residuals, both raw and scaled, as columns to the dataframe. ggplot() acts on a single dataframe, so it is best to have everything we would like to plot gathered into one dataframe.

mod <- lm(income ~ age + race + edu, data = acs_sample)

acs_sample$mod_res <- residuals(mod)
acs_sample$mod_res_scaled <- scale(residuals(mod))

7.1 Residuals

Ordinary least squares (OLS) regression assumed that the model residuals are normally destributed. To assess what that is true of our fitted model, we can use a QQ plot, where “QQ” stands for “quantile quantile.” A QQ plot shows the quantiles of one dataset plotted against the quantiles of another distribution. If the two distributions are the same, the points should fall along a 45-degree line. We can make a QQ plot with geom_qq(), which specifies a normal distribution by default (distribution = stats::qnorm), and we can pass our residuals to the sample argument. Then, we can add a y = x line to the plot with geom_abline(), and let’s make it red. Deviation from this red line will help us diagnose non-normality.

The coord_fixed() function makes the x and y axes’ scales equal, so that our line (and hopefully our residuals) are 45 degrees.

ggplot(acs_sample) +
  geom_abline(color = "red") +
  geom_qq(aes(sample = mod_res_scaled),
          size = .5) +
  coord_fixed()

The points in our QQ plot do not fall along our red 45-degree line, so we have reason to suspect that the residuals are non-normal. The particular deviation of our residuals suggests that our residuals are positively skewed. (For help interpreting QQ plots, check out A Q-Q Plot Dissection kit.) We used income, which tends to be positively skewed, as our outcome, so this violation of the normality assumption makes sense. We might consider transforming our outcome variable and re-running our model.

We can also create a scatterplot of residuals against a predictor to assess linearity (residual mean is zero at every level of x) and homoscedasticity (residual variance is equal at every level of x). We can add a red geom_hline() (horizontal line) in addition to a smoothed line with geom_smooth().

ggplot(acs_sample, aes(x = age, y = mod_res)) +
  geom_hline(yintercept = 0, color = "red") +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can also visually identify outliers in our model. There are many outlier metrics and many more rules-of-thumb for each one, so let’s just use Cook’s distance and consider something an outlier if it has a value of more than 4/n.

Add a column that contains the value from household if an observation is an outlier, according to our model and residuals. Supply this variable as the label aesthetic, which is passed onto geom_text() or geom_label().

acs_sample %>% 
  mutate(cooks = cooks.distance(mod),
         outlier_household = ifelse(cooks > 4/nrow(.), household, NA)) %>% 
  ggplot(aes(x = age, y = income, label = outlier_household)) +
  geom_point() +
  geom_label()
## Warning: Removed 225 rows containing missing values (geom_label).

ggplot’s geom_text() and geom_label() functions are centered on the datapoint, and they will overlap if datapoints are close together. Here, this makes our plot hard to read.

To correct this, use the geom_text_repel() or geom_label_repel() function from the ggrepel package to create labels that automatically avoid (i.e, are repelled by) each other. In some cases, a line will connect our label to the datapoint, as seen below with household 584480.

acs_sample %>% 
  mutate(cooks = cooks.distance(mod),
         outlier_household = ifelse(cooks > 4/nrow(.), household, NA)) %>% 
  ggplot(aes(x = age, y = income, label = outlier_household)) +
  geom_point() +
  geom_label_repel()
## Warning: Removed 225 rows containing missing values (geom_label_repel).

7.2 Margins Plots

If the residuals do not show a violation of our model assumptions (so let’s pretend we did not see the non-normal residual plot), we can produce margins plots, which show the marginal effect of a predictor, holding other variables constant.

The ggpredict() function from the ggeffects package creates a nice dataframe with marginal effects.

ggpredict(mod)
## $age
## # Predicted values of income
## 
## age | Predicted |               95% CI
## --------------------------------------
##  10 |   5204.00 | [-5467.25, 15875.24]
##  20 |   7837.91 | [-1907.05, 17582.87]
##  30 |  10471.83 | [ 1357.73, 19585.93]
##  40 |  13105.74 | [ 4263.61, 21947.87]
##  50 |  15739.66 | [ 6777.87, 24701.45]
##  60 |  18373.57 | [ 8915.35, 27831.79]
##  70 |  21007.49 | [10730.51, 31284.46]
##  90 |  26275.32 | [13666.82, 38883.82]
## 
## Adjusted for:
## * race =                 White
## *  edu = Less than High School
## 
## $race
## # Predicted values of income
## 
## race     | Predicted |               95% CI
## -------------------------------------------
## White    |  14159.31 | [ 5316.58, 23002.03]
## Black    |   5606.96 | [-7035.53, 18249.45]
## Other    |  17383.83 | [-2662.47, 37430.14]
## Hispanic |   9452.84 | [ -953.17, 19858.86]
## Asian    |  14684.09 | [-8822.55, 38190.74]
## 
## Adjusted for:
## * age =                 44.00
## * edu = Less than High School
## 
## $edu
## # Predicted values of income
## 
## edu                   | Predicted |               95% CI
## --------------------------------------------------------
## Less than High School |  14159.31 | [ 5316.58, 23002.03]
## High School           |  24520.17 | [18428.06, 30612.28]
## Some College          |  29886.60 | [23212.82, 36560.39]
## Bachelors             |  46146.77 | [37344.65, 54948.88]
## Advanced Degree       |  58790.63 | [45193.58, 72387.69]
## 
## Adjusted for:
## *  age = 44.00
## * race = White
## 
## attr(,"class")
## [1] "ggalleffects" "list"        
## attr(,"model.name")
## [1] "mod"

To select one term, specify it with the terms argument.

ggpredict(mod, terms = "age")
## # Predicted values of income
## 
## age | Predicted |               95% CI
## --------------------------------------
##  10 |   5204.00 | [-5467.25, 15875.24]
##  20 |   7837.91 | [-1907.05, 17582.87]
##  30 |  10471.83 | [ 1357.73, 19585.93]
##  40 |  13105.74 | [ 4263.61, 21947.87]
##  50 |  15739.66 | [ 6777.87, 24701.45]
##  60 |  18373.57 | [ 8915.35, 27831.79]
##  70 |  21007.49 | [10730.51, 31284.46]
##  90 |  26275.32 | [13666.82, 38883.82]
## 
## Adjusted for:
## * race =                 White
## *  edu = Less than High School

We will need to pick variables by name while plotting, so double-check the names of the columns of this dataframe:

ggpredict(mod, terms = "age") %>% names()
## [1] "x"         "predicted" "std.error" "conf.low"  "conf.high" "group"

The names differ from the printed output, so it is good we checked. We can use x as our x variable, predicted as y, and conf.low and conf.high as the lower and upper limits of our confidence interval.

To plot the marginal effect of age, we can use geom_errorbar(). We saw that race was held at “White”, and edu at “Less than High School,” so we should add a caption saying this, as well as that our error bars are 95% confidence intervals. Using \n in a character string will add a line break.

The labs() function allows us to set labels for various elements of the plot, including axes, titles, and captions.

ggpredict(mod, terms = "age") %>%
  ggplot(aes(x = x, y = predicted)) +
  geom_point() +
  geom_line() +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high)) +
  labs(x = "age",
       y = "predicted income",
       caption = "for White individuals with education less than high school\nerror bars represent 95% confidence interval")

We can choose to use error bars with barplots as well, being sure to set stat = "identity".

ggpredict(mod, terms = "edu") %>% 
  ggplot(aes(x = x, y = predicted)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high)) +
  labs(x = "education",
       y = "predicted income",
       caption = "for White individuals with education less than high school\nerror bars represent 95% confidence interval")

Or, we can lose the bars in favor of points, make our error bar limits a little bit narrower with a width argument, and add text with the rounded predicted value with geom_label().

ggpredict(mod, terms = "edu") %>% 
  ggplot(aes(x = x, y = predicted, label = round(predicted))) +
  geom_point() +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = .5) +
  geom_label(hjust = -.2) +
  labs(x = "education",
       y = "predicted income",
       caption = "for White individuals with education less than high school\nerror bars represent 95% confidence interval")