2 Libraries and Data Setup

Data manipulation (calculating means, filtering observations, etc.) is typically handled outside the ggplot() call, so the examples below will make use of dplyr’s data manipulation functions and pipe operator (%>%) to prepare the data and pass it to ggplot. For a review, read our chapter on First Steps with Dataframes from the Data Wrangling with R course.

Load libraries and the dataset, which includes data from the 2000 American Community Survey.

You will gain the most by running the code yourself. To load libraries with library(), you must first install them with install.packages() and supply the package name in quotes (e.g., install.packages("ggplot2")). The dataset is available for download by clicking here.

library(ggplot2)   # for plotting
library(dplyr)     # for dataframe manipulation
library(forcats)   # for handling factors
library(scales)    # for axis scale formatting
library(ggeffects) # for marginal effects plots
library(ggrepel)   # for better-positioned labels

acs <- read.csv("acs.csv", na.strings = "") # missing values are blanks

First, browse the dataframe.

str(acs)
## 'data.frame':    27410 obs. of  9 variables:
##  $ household    : int  37 37 37 241 242 377 418 465 465 484 ...
##  $ person       : int  1 2 3 1 1 1 1 1 2 1 ...
##  $ age          : int  20 19 19 50 29 69 59 55 47 33 ...
##  $ maritalStatus: chr  "Never married" "Never married" "Never married" "Never married" ...
##  $ income       : int  10000 5300 4700 32500 30000 51900 12200 0 2600 16800 ...
##  $ female       : int  1 1 1 1 1 1 1 0 1 0 ...
##  $ hispanic     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ race         : chr  "White" "White" "Black" "White" ...
##  $ edu          : chr  "Some College" "Some College" "Some College" "Advanced Degree" ...

When plotting, character vectors (here, maritalStatus, race, and edu) are automatically ordered alphabetically. For unordered categorical variables, such as state names, this default may be fine. For categorical variables with a natural order, we may wish to specify an order. To do this with the edu (education), first turn it into a factor, and then use fct_relevel() from the forcats package to reorder the levels.

acs$edu <- as.factor(acs$edu)

levels(acs$edu)
## [1] "Advanced Degree"       "Bachelors"             "High School"          
## [4] "Less than High School" "Some College"
acs$edu <- fct_relevel(acs$edu, 
                       "Less than High School", 
                       "High School",
                       "Some College", 
                       "Bachelors", 
                       "Advanced Degree")

levels(acs$edu)
## [1] "Less than High School" "High School"           "Some College"         
## [4] "Bachelors"             "Advanced Degree"

Our dataframe has 27,410 observations. Some plot types work better with fewer points, so let’s create a random sample from acs and then filter out rows without education or income data.

The set.seed() function allows us to obtain the same results when we generate random numbers with sample(). Any number we give to set.seed() is fine, as long as you and I have the same number. If we supply different numbers, or if we forego set.seed() altogether, we will (most likely) end up with different rows in our acs_sample dataframes.

set.seed(123)

acs_sample <- 
  acs[sample(1:nrow(acs), 300), ] %>% 
  filter(!is.na(edu), !is.na(income))