11  First Steps with Dataframes

11.1 Warm-Up

Once you get a dataset, what are the first things you do with it? This could be data you or your lab collected, or a secondary dataset you downloaded.

11.2 Outcomes

Objective: To examine and modify datasets’ variables and values.

Why it matters: Before analyzing a dataset, you must first have an understanding of its structure and values. Then you may need to create additional variables from existing ones, modify values in existing variables, and change variable names for ease of use.

Learning outcomes:

Fundamental Skills Extended Skills
  • Create and save scripts and datasets for reproducibility.

  • Examine a dataset’s metadata and univariate summary statistics.

  • Rename individual variables.

  • Create new variables.

  • Change values to other values.

  • Rename multiple variables.

Key functions and operators:

|>
nrow()
ncol()
colnames()
summary()
glimpse()
rename()
mutate()
ifelse()

11.3 Working with Dataframes

The examples on this page require an example file, which you can download by clicking on the link below. (If the data file opens up in your browser instead of downloading, right-click on the link and select “Save link as…”)

Click here to download

This dataset is a subset of the American Community Survey (ACS) from 2000. Much of your data wrangling and statistical work will use dataframes. For a review of this data structure, read the chapter on structure and class.

11.4 The Pipe |>

The base R pipe |> is an operator that allows us to write our code left-to-right, top-to-bottom, instead of inside-to-outside. It does this by inserting an expression as the first argument of the following function.

You may be more familiar with the tidy pipe %>%.

The base R pipe |> was introduced in R 4.1.0. The advantages of |> are that you do not need to load any tidyverse packages to use it, and that it runs faster than the tidy pipe.

The advantage of %>% is that it allows a piped expressed to be placed in any argument, not just the first argument, with the . placeholder:

mtcars %>%
  lm(mpg ~ wt, data = .)

Instead of writing:

sqrt(4)
[1] 2

We can write:

4 |> sqrt()
[1] 2

With more than one function, instead of writing:

sd(rnorm(10))
[1] 1.154927

We can write:

10 |> rnorm() |> sd()
[1] 0.9091363

If we have more than one argument in our functions, the pipe proves its utility in improving readability. Instead of writing:

fct_recode(as.character(sample(1:3, 5, replace = T)), "One" = "1", "Two" = "2", "Three" = "3")
[1] Two   Three One   Three Three
Levels: One Two Three

(Which function does each argument belong to?)

We can write:

1:3 |> 
  sample(5,
         replace = T) |> 
  as.character() |> 
  fct_recode("One" = "1", 
             "Two" = "2",
             "Three" = "3")
[1] Two   Three One   Three One  
Levels: One Two Three

Inserting line breaks after each pipe and comma organizes our code so that we can quickly see:

  • The initial data object: 1:3
  • Each function: sample(), as.character(), and fct_recode()
  • Which arguments belong to each function
Tip

If you change parts of your code or have extra or missing parentheses, commas, etc., reindent your code.

Highlight the code and click Code > Reindent Lines. This makes code easier to read, and helps us identify typos if the code does not seem to be indented correctly. See how RStudio indents lines when the closing parenthesis ()) is missing from as.character(:

1:3 |> 
  sample(5,
         replace = T) |> 
  as.character( |> 
                  fct_recode("One" = "1", 
                             "Two" = "2"
                             "Three" = "3")

We will use this piped approach to writing code as we work with dataframes. We can rewrite some vector operations with pipes, as we see above, but not always. With dataframes, all of our tidyverse functions take a dataframe in the first argument, and they return a dataframe as a result. That means we can string together a series of data wrangling operations into one long pipe, renaming variables, creating and modifying variables, subsetting, merging, aggregating, reshaping, and more!

11.5 Start Your Script

Now, create a new script with File > New File > R Script. Save this script with a sensible name, such as “01_cleaning.R”. We can imagine a series of scripts we might run after this one, such as “02_descriptive_statistics.R”, “03_regression.R”, “04_plots.R”, and so on.

Scripts help ensure reproducibility. Reproducibility means that, if I have your script and original data file, I should be able to run your code and get the same result. To do this, we will practice these four principles:

  1. Scripts contain all code needed to run. If a script needs a package, load it in the script. If a script needs a data object, create it in the script.
  2. Code can be run top-to-bottom. Earlier code should not depend on later code.
  3. Code runs without error. If something does not work, fix it, delete it, or comment it out.
  4. Code uses relative paths whenever possible. This allows you to move your project within and between computers. See section 10.4 to learn about relative paths.

Reproducibility also includes issues we will not discuss here of version management, publishing code and data, and transparency in writeups.

The first few lines of a script should load packages and read in our data. All the packages we will need for this chapter are found in the tidyverse. Load that and then import the ACS dataset.

library(tidyverse)

acs <- read.csv("2000_acs_sample.csv")
Loading Packages

Load all of your packages at the very beginning of a script. This has two advantages:

  1. Anybody else who uses your script can immediately see which packages are needed. “Anybody else” includes future-you, who may need to install new packages after updating R or getting a new computer.

  2. This prevents issues where earlier code requires packages that are loaded later in the script. This breaks the rule that scripts should be able to be run top-to-bottom.

11.6 Browse the Data

When you start working with a data set, especially if it was created by somebody else (that includes past-you!), resist the temptation to start running models immediately. First, take time to understand the data. What information does it contain? What is the structure of the data set? What is the data type of each column? Is there anything strange in the data set? It’s better to find out now, and not when you’re in the middle of modeling!

A dataframe consists of rows called observations and columns called variables. The data recorded for an individual observation are stored as values in the corresponding variables.

Variable Variable Variable
Observation Value Value Value
Observation Value Value Value
Observation Value Value Value

If you have a dataset already in this format, you are in luck. However, we might run into datasets that need a little work, or a lot of work, before we can use them. A single row might have multiple observations, or a single variable might be spread across multiple columns. Organizing, or tidying, datasets is the focus of the remainder of this book.

Now that we have the acs dataset loaded, we need to take a look at the number of rows and columns, type of each column, values, and summary statistics. Do all of this with the glimpse() and summary() functions:

glimpse(acs)
Rows: 28,172
Columns: 16
$ year             <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,…
$ datanum          <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ serial           <int> 37, 37, 37, 241, 242, 296, 377, 418, 465, 465, 484, 4…
$ hhwt             <int> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100…
$ gq               <chr> "Households under 1970 definition", "Households under…
$ us2000c_serialno <int> 365663, 365663, 365663, 2894822, 2896802, 3608029, 47…
$ pernum           <int> 1, 2, 3, 1, 1, 1, 1, 1, 1, 2, 1, 2, 3, 4, 1, 2, 3, 4,…
$ perwt            <int> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100…
$ us2000c_pnum     <int> 1, 2, 3, 1, 1, 1, 1, 1, 1, 2, 1, 2, 3, 4, 1, 2, 3, 4,…
$ us2000c_sex      <int> 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1,…
$ us2000c_age      <int> 20, 19, 19, 50, 29, 20, 69, 59, 55, 47, 33, 26, 4, 2,…
$ us2000c_hispan   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ us2000c_race1    <int> 1, 1, 2, 1, 1, 6, 1, 1, 2, 2, 6, 6, 6, 6, 6, 6, 6, 6,…
$ us2000c_marstat  <int> 5, 5, 5, 5, 5, 5, 5, 2, 4, 5, 1, 1, 5, 5, 1, 1, 5, 5,…
$ us2000c_educ     <int> 11, 11, 11, 14, 13, 9, 1, 8, 12, 1, 9, 7, 1, 0, 4, 8,…
$ us2000c_inctot   <chr> "10000", "5300", "4700", "32500", "30000", "3000", "5…
summary(acs)
      year         datanum      serial             hhwt          gq           
 Min.   :2000   Min.   :4   Min.   :     37   Min.   :100   Length:28172      
 1st Qu.:2000   1st Qu.:4   1st Qu.: 323671   1st Qu.:100   Class :character  
 Median :2000   Median :4   Median : 617477   Median :100   Mode  :character  
 Mean   :2000   Mean   :4   Mean   : 624234   Mean   :100                     
 3rd Qu.:2000   3rd Qu.:4   3rd Qu.: 937528   3rd Qu.:100                     
 Max.   :2000   Max.   :4   Max.   :1236779   Max.   :100                     
 us2000c_serialno      pernum           perwt      us2000c_pnum   
 Min.   :     92   Min.   : 1.000   Min.   :100   Min.   : 1.000  
 1st Qu.:2395745   1st Qu.: 1.000   1st Qu.:100   1st Qu.: 1.000  
 Median :4905730   Median : 2.000   Median :100   Median : 2.000  
 Mean   :4951676   Mean   : 2.208   Mean   :100   Mean   : 2.208  
 3rd Qu.:7444248   3rd Qu.: 3.000   3rd Qu.:100   3rd Qu.: 3.000  
 Max.   :9999402   Max.   :16.000   Max.   :100   Max.   :16.000  
  us2000c_sex     us2000c_age     us2000c_hispan  us2000c_race1  
 Min.   :1.000   Min.   :  0.00   Min.   : 1.00   Min.   :1.000  
 1st Qu.:1.000   1st Qu.: 17.00   1st Qu.: 1.00   1st Qu.:1.000  
 Median :2.000   Median : 35.00   Median : 1.00   Median :1.000  
 Mean   :1.512   Mean   : 35.92   Mean   : 1.77   Mean   :1.935  
 3rd Qu.:2.000   3rd Qu.: 51.00   3rd Qu.: 1.00   3rd Qu.:1.000  
 Max.   :2.000   Max.   :933.00   Max.   :24.00   Max.   :9.000  
 us2000c_marstat  us2000c_educ    us2000c_inctot    
 Min.   :1.000   Min.   : 0.000   Length:28172      
 1st Qu.:1.000   1st Qu.: 4.000   Class :character  
 Median :3.000   Median : 9.000   Mode  :character  
 Mean   :2.973   Mean   : 7.871                     
 3rd Qu.:5.000   3rd Qu.:11.000                     
 Max.   :5.000   Max.   :16.000                     

We should also spend a few minutes just scrolling through the dataset. Open it in the viewer by clicking on acs in the environment, or run this line of code in the console:

View(acs)

We can learn a few things about the data:

  • gq and us2000c_inctot are character, while all others are numeric
  • year, hhwt, and perwt seem to always be the same value (note that the summary statistics are all a single number)
  • us2000c_sex, us2000c_race1, and several other variables are numeric, but their names suggest categorical variables
  • us2000c_age has a maximum value of 933, which sounds impossibly high if age is in years
  • us2000c_inctot appears to be numeric but is coded as a character
    • Browsing the data reveals that this variable has values of “BBBBBBB”, the Census code for missing. Combining numeric and character data resulted in the numeric data being implicitly coerced to character.

To check that year and the other variables are always the same value, we can check whether the column has only a single unique value. In other words, is the length of a vector’s unique values equal to one? Use the pull() function to extract a column from a dataframe, unique() to get a vector of unique values, and length() to count the number of elements in that vector:

acs |> 
  pull(year) |> 
  unique() |> 
  length()
[1] 1

That “1” means year only has a single value in it; it does not vary. We may choose to drop these columns later on with skills we will acquire in the chapter on subsetting, or we may choose to retain it if we plan to combine multiple years of data together. Without a column identifying the year of the data, we will not be able to differentiate data from different years.

11.7 Rename Variables

Renaming variables in dataframes makes them easier to use. Variables may have names that are not readily understandable, too short to have meaning, too long to type repeatedly, or in an undesirable case (if your variables are all UPPERCASE, then your code and error messages will yell at you).

The first thing to do with a dataset is change variable names, which can be done individually or across multiple variables with a function.

11.7.1 Individual Variables

The rename() function allows for easy renaming of individual columns. The pattern is new_name = old_name. Change pernum to person and serial to household.

The rename() function’s first argument is the name of our dataset, so pipe the dataset into this function.

acs <- 
  acs |> 
  rename(person = pernum, 
         household = serial)

Confirm that the names changed:

colnames(acs)
 [1] "year"             "datanum"          "household"        "hhwt"            
 [5] "gq"               "us2000c_serialno" "person"           "perwt"           
 [9] "us2000c_pnum"     "us2000c_sex"      "us2000c_age"      "us2000c_hispan"  
[13] "us2000c_race1"    "us2000c_marstat"  "us2000c_educ"     "us2000c_inctot"  

11.7.2 Multiple Variables

Several columns have the prefix us2000c_, which is redundant since the data is all from the US and from the year 2000. Instead of renaming them one-by-one, we can rename several columns at once with rename_with() and a function. Here, we want to remove a prefix, so use the str_remove() function.

We can use the anonymous function \(x) str_remove(x, "us2000c_").

The \(x) defines a function that takes an argument x (we could have named it anything). rename_with() will pass the column names to str_remove() as x, which acts on that x to remove "us2000c_". Learn more about writing functions here.

acs <-
  acs |> 
  rename_with(\(x) str_remove(x, "us2000c_"))

Confirm the prefix was removed:

colnames(acs)
 [1] "year"      "datanum"   "household" "hhwt"      "gq"        "serialno" 
 [7] "person"    "perwt"     "pnum"      "sex"       "age"       "hispan"   
[13] "race1"     "marstat"   "educ"      "inctot"   

If we wanted to make all variables lowercase, we would run:

acs <-
  acs |> 
  rename_with(\(x) str_to_lower(x))

11.8 Create Variables

Most of this book up to this point has used vectors. Now, all of those vector operations can be used within the context of dataframes. All that hard work is about to pay off!

Before, if we wanted to create some vector, x, that was the vector y times 5, we would have written:

x <- y * 5

Now, if x and y were two columns in some dataframe, dat, we would use the mutate() function.

mutate() is used for variable creation and replacement. If a new variable name is supplied, a new variable is created. If an existing variable name is supplied, that variable is replaced (without any warning).

The first argument in mutate() is the dataframe, so we would pipe dat into the function:

dat |> 
  mutate(x = y * 5)

Compare the two lines of code:

x <- y * 5
dat |> mutate(x = y * 5)

The only differences are that we use an equals sign (=) instead of the assignment operator (<-), and that we pass the name of the dataframe into mutate() so it knows where to look for those variables (vectors!).

That means all those skills working with numeric, logical, character, date, and categorical vectors can be easily transferred into the context of dataframes.

Recycling Revisited

mutate() can either take a single value that gets recycled, or a vector with the same length as the dataframe. We do not need to worry about shorter vectors being recycled, much less uneven recycling, which are usually undesirable. mutate() will be happy to return an error:

acs |> 
  mutate(x = 1:2)
Error in `mutate()`:
ℹ In argument: `x = 1:2`.
Caused by error:
! `x` must be size 28172 or 1, not 2.

Now, we can create variables of several types.

11.8.1 Numeric

One way we can create a numeric variable is by multiplying a single existing column by a constant. Multiplying age (in years) by 12 results in age in months. The variable age_months does not currently exist in acs, so a new variable is created.

acs <-
  acs |> 
  mutate(age_months = age * 12)

Variables can also be created from multiple existing columns, through addition, multiplication, logarithms, averages, minimums, or any combination of functions and operators.

11.8.2 Logical

Create an indicator variable called teenager that indicates whether an individual’s age is in the range 13-19:

acs <-
  acs |> 
  mutate(teenager = age >= 13 & age <= 19)

11.8.3 Character

In our dataset, the identifier is currently spread out across two variables: household and person. We can put these two together with paste() so that we have a single variable that uniquely identifies observations.

acs <-
  acs |> 
  mutate(id = paste(household, person, sep = "_"))

11.8.4 Date

We do not have any date information more specific than the year, but if we did, we could create a meaningful date variable, such as the date of survey response. For practice, create a date variable that just pastes together year with January 1 for everybody:

acs <-
  acs |> 
  mutate(date = paste0(year, "0101") |> ymd())

11.8.5 Categorical

Categorical variables can be of type numeric or factor. Numeric categorical variables work when we have only two categories, and we code them as 0 and 1. These are also called dummy or indicator variables. (For variables with only two categories, a variable is conceptually categorical, but its type may be numeric or logical.) Factor categorical variables work with any number of categories.

We can create a new indicator variable called female that contains 0 for male and 1 for female. The sex column is 1s and 2s. According to the codebook, these correspond to male and female, respectively. If sex == 2, we can assign the value 1, and if not, 0.

acs <-
  acs |> 
  mutate(female = ifelse(sex == 2, 1, 0))

Changing categorical variables that start at 1 to start at 0 is a good idea because it makes the intercept in a regression model meaningful. If sex were a predictor, the intercept would be the expected value when sex == 0, which is not a possible value.

Now, recode the gq column. First, look at the values this variable can take:

acs |> pull(gq) |> unique()
[1] "Households under 1970 definition"           
[2] "Other group quarters"                       
[3] "Group quarters--Institutions"               
[4] "Additional households under 1990 definition"

Perhaps we only want two categories: Households and Group Quarters. We can use fct_collapse() to reduce the number of categories:

acs <-
  acs |> 
  mutate(gq_recode = fct_collapse(gq,
                                  "Households" = c("Households under 1970 definition", 
                                                   "Additional households under 1990 definition"),
                                  "Group Quarters" = c("Group quarters--Institutions", 
                                                       "Other group quarters")))

Sometimes we may have a numeric variable that can take on many values, and we want to collapse and recode the variable. A multiple-level categorical variable can be created with case_when(). Each argument within case_when() follows the pattern condition ~ value_if_TRUE. Any values that return FALSE for every condition are assigned a value of NA:

x <- 1:5

case_when(x < 3 ~ "Less than 3",
          x == 3 ~ "Equal to 3",
          x > 3 ~ "Greater than 3")
[1] "Less than 3"    "Less than 3"    "Equal to 3"     "Greater than 3"
[5] "Greater than 3"

Using the definitions from the 2000 ACS codebook (available here), we can recode educ into a categorical variable with levels less than high school (education codes 1-8), high school (9), some college (10-12), bachelors (13), and advanced degree (14-16). A value of 0 for education means not applicable and is only used for individuals less than three years old. If we do not include 0 in any of our case_when() statements, it will be assigned a value of NA.

We can then use fct_relevel() within the same mutate() call to specify an order for our categorical variable since the default is alphabetical order.

acs <-
  acs |> 
  mutate(educ_categories = case_when(educ >= 1 & educ <= 8 ~ "Less than High School", 
                                     educ == 9 ~ "High School", 
                                     educ >=  10 & educ <= 12 ~ "Some College", 
                                     educ == 13 ~ "Bachelors",
                                     educ >= 14 ~ "Advanced Degree"),
         educ_categories = fct_relevel(educ_categories, 
                                       "Less than High School", 
                                       "High School", 
                                       "Some College", 
                                       "Bachelors",
                                       "Advanced Degree"))

We can confirm the coding of educ was successful by examining the output of table():

table(acs$educ, acs$educ_categories, useNA = "ifany")
    
     Less than High School High School Some College Bachelors Advanced Degree
  0                      0           0            0         0               0
  1                   1317           0            0         0               0
  2                   2508           0            0         0               0
  3                   1304           0            0         0               0
  4                   1648           0            0         0               0
  5                    923           0            0         0               0
  6                   1059           0            0         0               0
  7                    906           0            0         0               0
  8                    889           0            0         0               0
  9                      0        5959            0         0               0
  10                     0           0         1578         0               0
  11                     0           0         3191         0               0
  12                     0           0         1221         0               0
  13                     0           0            0      2960               0
  14                     0           0            0         0            1068
  15                     0           0            0         0             347
  16                     0           0            0         0             168
    
     <NA>
  0  1126
  1     0
  2     0
  3     0
  4     0
  5     0
  6     0
  7     0
  8     0
  9     0
  10    0
  11    0
  12    0
  13    0
  14    0
  15    0
  16    0

We can see the original values on the left and the values they now correspond to at the top.

We can imagine using this educ_categories variable as a predictor in a statistical model or for creating bar graphs of income by educational attainment.

11.9 Change Values

In addition to modifying whole variables, mutate() can change some values within a variable.

11.9.1 To Missing

In our exploration of the data, recall that inctot is a character vector, but the first few values shown by glimpse() appear to be numbers.

Open the data set with View(acs) to see why it is a character vector. Some of the values are BBBBBBB, the Census code for missing data. We can recode the B’s as missing values with the ifelse() function while leaving the other values as they are.

acs <-
  acs |> 
  mutate(inctot = ifelse(inctot == "BBBBBBB", NA, inctot))

At this step, the column is still a character vector, so we need to convert it into a numeric vector.

acs |> pull(inctot) |> typeof()
[1] "character"
acs <-
  acs |> 
  mutate(inctot = as.numeric(inctot))

acs |> pull(inctot) |> typeof()
[1] "double"

Another approach we could have taken if we knew our missing code in advance would have been to to specify it when importing the data. If you take this approach, you will need to re-run the above code of renaming columns in order to follow along for the remainder of this chapter. This approach also assumes that “BBBBBBB” means missing in every column, not just inctot.

acs <- read.csv("2000_acs_sample.csv", na.strings = "BBBBBBB")

11.9.1.1 Quantify Missing Data

We should now check how much data is missing from the dataframe.

We can calculate how much data is missing from acs as a whole. To do so, first use is.na() as a test of whether the data is missing. This will turn the entire dataframe into TRUE and FALSE values, where TRUE means the data is missing. Then, take the sum or the mean. In doing so, TRUE and FALSE will be coerced into 1 and 0, respectively.

acs |> is.na() |> sum()
[1] 7283
acs |> is.na() |> mean()
[1] 0.01123996

A total of 7283 values are missing, equal to 1.1% of the dataset.

To calculate missingness by column, run the same code above but with colSums() or colMeans() instead of sum() or mean()

acs |> is.na() |> colSums()
           year         datanum       household            hhwt              gq 
              0               0               0               0               0 
       serialno          person           perwt            pnum             sex 
              0               0               0               0               0 
            age          hispan           race1         marstat            educ 
              0               0               0               0               0 
         inctot      age_months        teenager              id            date 
           6157               0               0               0               0 
         female       gq_recode educ_categories 
              0               0            1126 
acs |> is.na() |> colMeans()
           year         datanum       household            hhwt              gq 
     0.00000000      0.00000000      0.00000000      0.00000000      0.00000000 
       serialno          person           perwt            pnum             sex 
     0.00000000      0.00000000      0.00000000      0.00000000      0.00000000 
            age          hispan           race1         marstat            educ 
     0.00000000      0.00000000      0.00000000      0.00000000      0.00000000 
         inctot      age_months        teenager              id            date 
     0.21855033      0.00000000      0.00000000      0.00000000      0.00000000 
         female       gq_recode educ_categories 
     0.00000000      0.00000000      0.03996876 

We now see that all the missing data is in the columns inctot and educ_categories.

If you work with missing data, read about Blimp, an easy-to-use software for estimating models with missing data.

11.9.2 To Other Values

Earlier we saw that age had a maximum value of 933. If we assume this variable is in years, this value seems way too high. Use count() to tabulate the values of age, sort its values, and then view counts for the highest values of age with slice_max():

acs |> count(age) |> slice_max(age, n = 3)
  age   n
1 933   1
2  93 149
3  92   1
Note

The original ACS dataset did not have this value of 933 for age. The value was intentionally edited from 93 to 933 for this example

Only one observation has a value of 933, and the next highest value is 93. We could take at least three approaches to deal with this number:

  1. Change it to 93 because we think it is a typo. (Or maybe 33? Or 9?)
  2. Change the value to missing because we are not sure what value it should be.
  3. Drop this observation because we are unsure about its quality.

If we know the 933 should be 93, we can change it with ifelse():

acs <- 
  acs |> 
  mutate(age = ifelse(age == 933, 93, age))

Verify that this variable no longer has a 933, and that there is one more 93 than before:

acs |> count(age) |> slice_max(age, n = 3)
  age   n
1  93 150
2  92   1
3  89  39

933 no longer appears in the dataset, and the count of 93 increased by 1. Success.

11.10 Save Your Dataframe and Script

Now that we have cleaned up the ACS data set, it is a good idea to end the script by saving the cleaned data set. Save it as an RDS to preserve the formatting of dates and factors.

saveRDS(acs, "acs_cleaned.rds")

By saving the resulting data set, you can now begin the next script (“02_…”) with acs <- readRDS("acs_cleaned.rds"). This first script is your record of how you made changes to the raw data. It serves as a record to future-you, to remind you of what you did, and to colleagues and journal reviewers who have questions.

11.11 Exercises

11.11.1 Fundamental

  1. Start a script that loads the tidyverse. Use the built-in penguins_raw dataset.

If we are working with external data files and make a mistake like overwriting the data, all we need to do is re-import the data.

With built-in datasets, our process looks a little different. Built-in datasets exist in the datasets or other packages. When we reference one of those objects, like penguins_raw, R will first look in our environment and then in the other packages we have loaded.

That means, if we have something else called penguins_raw in our environment, R will find that one instead of the copy in the datasets package:

penguins_raw <- 1
str(penguins_raw)
 num 1

The solution is simple. Just remove the copy in the environment with rm(), so that R will find the copy in the datasets package:

rm(penguins_raw)
str(penguins_raw)
'data.frame':   344 obs. of  17 variables:
 $ studyName          : chr  "PAL0708" "PAL0708" "PAL0708" "PAL0708" ...
 $ Sample Number      : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Species            : chr  "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" ...
 $ Region             : chr  "Anvers" "Anvers" "Anvers" "Anvers" ...
 $ Island             : chr  "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
 $ Stage              : chr  "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" ...
 $ Individual ID      : chr  "N1A1" "N1A2" "N2A1" "N2A2" ...
 $ Clutch Completion  : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ Date Egg           : Date, format: "2007-11-11" "2007-11-11" ...
 $ Culmen Length (mm) : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ Culmen Depth (mm)  : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ Flipper Length (mm): num  181 186 195 NA 193 190 181 195 193 190 ...
 $ Body Mass (g)      : num  3750 3800 3250 NA 3450 ...
 $ Sex                : chr  "MALE" "FEMALE" "FEMALE" NA ...
 $ Delta 15 N (o/oo)  : num  NA 8.95 8.37 NA 8.77 ...
 $ Delta 13 C (o/oo)  : num  NA -24.7 -25.3 NA -25.3 ...
 $ Comments           : chr  "Not enough blood for isotopes." NA NA "Adult not sampled." ...

If you make a mistake with penguins_raw in the exercises below, just run rm(penguins_raw)!

  1. Read the documentation at ?penguins_raw.

  2. Examine the data. What type is each column? How are the data distributed (use summary())? Is any data missing?

  3. Rename Flipper Length (mm) to flipper_len_mm. Note: to reference a non-syntactic column name, use back ticks ` around the column name: `Flipper Length (mm)`

  4. Create a new variable called flipper_len_in that is flipper_len_mm divided by 25.4 to rescale it to inches.

  5. Create a new binary variable called not_enough_blood if the Comments column contains the text “Not enough blood for isotopes.” Hint: use str_detect().

  6. Make a factor from the Island column where “Dream” is the reference (first) level. Replace the existing Island column.

  7. Recode Sex to have “M” for “MALE” and “F” for “FEMALE.”

  8. Save the dataset as an RDS file, and save your script.

11.11.2 Extended

  1. Use rename_with() to make all of the column names in penguins_raw syntactic by removing spaces and punctuation.