%>%
mtcars lm(mpg ~ wt, data = .)
11 First Steps with Dataframes
11.1 Warm-Up
Once you get a dataset, what are the first things you do with it? This could be data you or your lab collected, or a secondary dataset you downloaded.
11.2 Outcomes
Objective: To examine and modify datasets’ variables and values.
Why it matters: Before analyzing a dataset, you must first have an understanding of its structure and values. Then you may need to create additional variables from existing ones, modify values in existing variables, and change variable names for ease of use.
Learning outcomes:
Fundamental Skills | Extended Skills |
|
|
Key functions and operators:
|>
nrow()
ncol()
colnames()
summary()
glimpse()
rename()
mutate()
ifelse()
11.3 Working with Dataframes
The examples on this page require an example file, which you can download by clicking on the link below. (If the data file opens up in your browser instead of downloading, right-click on the link and select “Save link as…”)
This dataset is a subset of the American Community Survey (ACS) from 2000. Much of your data wrangling and statistical work will use dataframes. For a review of this data structure, read the chapter on structure and class.
11.4 The Pipe |>
The base R pipe |>
is an operator that allows us to write our code left-to-right, top-to-bottom, instead of inside-to-outside. It does this by inserting an expression as the first argument of the following function.
%>%
?
You may be more familiar with the tidy pipe %>%
.
The base R pipe |>
was introduced in R 4.1.0. The advantages of |>
are that you do not need to load any tidyverse
packages to use it, and that it runs faster than the tidy pipe.
The advantage of %>%
is that it allows a piped expressed to be placed in any argument, not just the first argument, with the .
placeholder:
Instead of writing:
sqrt(4)
[1] 2
We can write:
4 |> sqrt()
[1] 2
With more than one function, instead of writing:
sd(rnorm(10))
[1] 1.154927
We can write:
10 |> rnorm() |> sd()
[1] 0.9091363
If we have more than one argument in our functions, the pipe proves its utility in improving readability. Instead of writing:
fct_recode(as.character(sample(1:3, 5, replace = T)), "One" = "1", "Two" = "2", "Three" = "3")
[1] Two Three One Three Three
Levels: One Two Three
(Which function does each argument belong to?)
We can write:
1:3 |>
sample(5,
replace = T) |>
as.character() |>
fct_recode("One" = "1",
"Two" = "2",
"Three" = "3")
[1] Two Three One Three One
Levels: One Two Three
Inserting line breaks after each pipe and comma organizes our code so that we can quickly see:
- The initial data object:
1:3
- Each function:
sample()
,as.character()
, andfct_recode()
- Which arguments belong to each function
If you change parts of your code or have extra or missing parentheses, commas, etc., reindent your code.
Highlight the code and click Code > Reindent Lines. This makes code easier to read, and helps us identify typos if the code does not seem to be indented correctly. See how RStudio indents lines when the closing parenthesis ()
) is missing from as.character(
:
1:3 |>
sample(5,
replace = T) |>
as.character( |>
fct_recode("One" = "1",
"Two" = "2"
"Three" = "3")
We will use this piped approach to writing code as we work with dataframes. We can rewrite some vector operations with pipes, as we see above, but not always. With dataframes, all of our tidyverse
functions take a dataframe in the first argument, and they return a dataframe as a result. That means we can string together a series of data wrangling operations into one long pipe, renaming variables, creating and modifying variables, subsetting, merging, aggregating, reshaping, and more!
11.5 Start Your Script
Now, create a new script with File > New File > R Script. Save this script with a sensible name, such as “01_cleaning.R”. We can imagine a series of scripts we might run after this one, such as “02_descriptive_statistics.R”, “03_regression.R”, “04_plots.R”, and so on.
Scripts help ensure reproducibility. Reproducibility means that, if I have your script and original data file, I should be able to run your code and get the same result. To do this, we will practice these four principles:
- Scripts contain all code needed to run. If a script needs a package, load it in the script. If a script needs a data object, create it in the script.
- Code can be run top-to-bottom. Earlier code should not depend on later code.
- Code runs without error. If something does not work, fix it, delete it, or comment it out.
- Code uses relative paths whenever possible. This allows you to move your project within and between computers. See section 10.4 to learn about relative paths.
Reproducibility also includes issues we will not discuss here of version management, publishing code and data, and transparency in writeups.
The first few lines of a script should load packages and read in our data. All the packages we will need for this chapter are found in the tidyverse
. Load that and then import the ACS dataset.
library(tidyverse)
<- read.csv("2000_acs_sample.csv") acs
Load all of your packages at the very beginning of a script. This has two advantages:
Anybody else who uses your script can immediately see which packages are needed. “Anybody else” includes future-you, who may need to install new packages after updating R or getting a new computer.
This prevents issues where earlier code requires packages that are loaded later in the script. This breaks the rule that scripts should be able to be run top-to-bottom.
11.6 Browse the Data
When you start working with a data set, especially if it was created by somebody else (that includes past-you!), resist the temptation to start running models immediately. First, take time to understand the data. What information does it contain? What is the structure of the data set? What is the data type of each column? Is there anything strange in the data set? It’s better to find out now, and not when you’re in the middle of modeling!
A dataframe consists of rows called observations and columns called variables. The data recorded for an individual observation are stored as values in the corresponding variables.
Variable | Variable | Variable | … | |
---|---|---|---|---|
Observation | Value | Value | Value | |
Observation | Value | Value | Value | |
Observation | Value | Value | Value | |
… |
If you have a dataset already in this format, you are in luck. However, we might run into datasets that need a little work, or a lot of work, before we can use them. A single row might have multiple observations, or a single variable might be spread across multiple columns. Organizing, or tidying, datasets is the focus of the remainder of this book.
Now that we have the acs
dataset loaded, we need to take a look at the number of rows and columns, type of each column, values, and summary statistics. Do all of this with the glimpse()
and summary()
functions:
glimpse(acs)
Rows: 28,172
Columns: 16
$ year <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,…
$ datanum <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ serial <int> 37, 37, 37, 241, 242, 296, 377, 418, 465, 465, 484, 4…
$ hhwt <int> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100…
$ gq <chr> "Households under 1970 definition", "Households under…
$ us2000c_serialno <int> 365663, 365663, 365663, 2894822, 2896802, 3608029, 47…
$ pernum <int> 1, 2, 3, 1, 1, 1, 1, 1, 1, 2, 1, 2, 3, 4, 1, 2, 3, 4,…
$ perwt <int> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100…
$ us2000c_pnum <int> 1, 2, 3, 1, 1, 1, 1, 1, 1, 2, 1, 2, 3, 4, 1, 2, 3, 4,…
$ us2000c_sex <int> 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1,…
$ us2000c_age <int> 20, 19, 19, 50, 29, 20, 69, 59, 55, 47, 33, 26, 4, 2,…
$ us2000c_hispan <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ us2000c_race1 <int> 1, 1, 2, 1, 1, 6, 1, 1, 2, 2, 6, 6, 6, 6, 6, 6, 6, 6,…
$ us2000c_marstat <int> 5, 5, 5, 5, 5, 5, 5, 2, 4, 5, 1, 1, 5, 5, 1, 1, 5, 5,…
$ us2000c_educ <int> 11, 11, 11, 14, 13, 9, 1, 8, 12, 1, 9, 7, 1, 0, 4, 8,…
$ us2000c_inctot <chr> "10000", "5300", "4700", "32500", "30000", "3000", "5…
summary(acs)
year datanum serial hhwt gq
Min. :2000 Min. :4 Min. : 37 Min. :100 Length:28172
1st Qu.:2000 1st Qu.:4 1st Qu.: 323671 1st Qu.:100 Class :character
Median :2000 Median :4 Median : 617477 Median :100 Mode :character
Mean :2000 Mean :4 Mean : 624234 Mean :100
3rd Qu.:2000 3rd Qu.:4 3rd Qu.: 937528 3rd Qu.:100
Max. :2000 Max. :4 Max. :1236779 Max. :100
us2000c_serialno pernum perwt us2000c_pnum
Min. : 92 Min. : 1.000 Min. :100 Min. : 1.000
1st Qu.:2395745 1st Qu.: 1.000 1st Qu.:100 1st Qu.: 1.000
Median :4905730 Median : 2.000 Median :100 Median : 2.000
Mean :4951676 Mean : 2.208 Mean :100 Mean : 2.208
3rd Qu.:7444248 3rd Qu.: 3.000 3rd Qu.:100 3rd Qu.: 3.000
Max. :9999402 Max. :16.000 Max. :100 Max. :16.000
us2000c_sex us2000c_age us2000c_hispan us2000c_race1
Min. :1.000 Min. : 0.00 Min. : 1.00 Min. :1.000
1st Qu.:1.000 1st Qu.: 17.00 1st Qu.: 1.00 1st Qu.:1.000
Median :2.000 Median : 35.00 Median : 1.00 Median :1.000
Mean :1.512 Mean : 35.92 Mean : 1.77 Mean :1.935
3rd Qu.:2.000 3rd Qu.: 51.00 3rd Qu.: 1.00 3rd Qu.:1.000
Max. :2.000 Max. :933.00 Max. :24.00 Max. :9.000
us2000c_marstat us2000c_educ us2000c_inctot
Min. :1.000 Min. : 0.000 Length:28172
1st Qu.:1.000 1st Qu.: 4.000 Class :character
Median :3.000 Median : 9.000 Mode :character
Mean :2.973 Mean : 7.871
3rd Qu.:5.000 3rd Qu.:11.000
Max. :5.000 Max. :16.000
We should also spend a few minutes just scrolling through the dataset. Open it in the viewer by clicking on acs
in the environment, or run this line of code in the console:
View(acs)
We can learn a few things about the data:
gq
andus2000c_inctot
are character, while all others are numericyear
,hhwt
, andperwt
seem to always be the same value (note that the summary statistics are all a single number)us2000c_sex
,us2000c_race1
, and several other variables are numeric, but their names suggest categorical variables- Here we should refer to the 2000 ACS codebook to recode these variables
us2000c_age
has a maximum value of 933, which sounds impossibly high if age is in yearsus2000c_inctot
appears to be numeric but is coded as a character- Browsing the data reveals that this variable has values of “BBBBBBB”, the Census code for missing. Combining numeric and character data resulted in the numeric data being implicitly coerced to character.
To check that year
and the other variables are always the same value, we can check whether the column has only a single unique value. In other words, is the length of a vector’s unique values equal to one? Use the pull()
function to extract a column from a dataframe, unique()
to get a vector of unique values, and length()
to count the number of elements in that vector:
|>
acs pull(year) |>
unique() |>
length()
[1] 1
That “1” means year
only has a single value in it; it does not vary. We may choose to drop these columns later on with skills we will acquire in the chapter on subsetting, or we may choose to retain it if we plan to combine multiple years of data together. Without a column identifying the year of the data, we will not be able to differentiate data from different years.
11.7 Rename Variables
Renaming variables in dataframes makes them easier to use. Variables may have names that are not readily understandable, too short to have meaning, too long to type repeatedly, or in an undesirable case (if your variables are all UPPERCASE
, then your code and error messages will yell at you).
The first thing to do with a dataset is change variable names, which can be done individually or across multiple variables with a function.
11.7.1 Individual Variables
The rename()
function allows for easy renaming of individual columns. The pattern is new_name = old_name
. Change pernum
to person
and serial
to household
.
The rename()
function’s first argument is the name of our dataset, so pipe the dataset into this function.
<-
acs |>
acs rename(person = pernum,
household = serial)
Confirm that the names changed:
colnames(acs)
[1] "year" "datanum" "household" "hhwt"
[5] "gq" "us2000c_serialno" "person" "perwt"
[9] "us2000c_pnum" "us2000c_sex" "us2000c_age" "us2000c_hispan"
[13] "us2000c_race1" "us2000c_marstat" "us2000c_educ" "us2000c_inctot"
11.7.2 Multiple Variables
Several columns have the prefix us2000c_
, which is redundant since the data is all from the US and from the year 2000. Instead of renaming them one-by-one, we can rename several columns at once with rename_with()
and a function. Here, we want to remove a prefix, so use the str_remove()
function.
We can use the anonymous function \(x) str_remove(x, "us2000c_")
.
The \(x)
defines a function that takes an argument x
(we could have named it anything). rename_with()
will pass the column names to str_remove()
as x
, which acts on that x
to remove "us2000c_"
. Learn more about writing functions here.
<-
acs |>
acs rename_with(\(x) str_remove(x, "us2000c_"))
Confirm the prefix was removed:
colnames(acs)
[1] "year" "datanum" "household" "hhwt" "gq" "serialno"
[7] "person" "perwt" "pnum" "sex" "age" "hispan"
[13] "race1" "marstat" "educ" "inctot"
If we wanted to make all variables lowercase, we would run:
<-
acs |>
acs rename_with(\(x) str_to_lower(x))
11.8 Create Variables
Most of this book up to this point has used vectors. Now, all of those vector operations can be used within the context of dataframes. All that hard work is about to pay off!
Before, if we wanted to create some vector, x
, that was the vector y
times 5, we would have written:
<- y * 5 x
Now, if x
and y
were two columns in some dataframe, dat
, we would use the mutate()
function.
mutate()
is used for variable creation and replacement. If a new variable name is supplied, a new variable is created. If an existing variable name is supplied, that variable is replaced (without any warning).
The first argument in mutate()
is the dataframe, so we would pipe dat
into the function:
|>
dat mutate(x = y * 5)
Compare the two lines of code:
x <- y * 5
dat |> mutate(x = y * 5)
The only differences are that we use an equals sign (=
) instead of the assignment operator (<-
), and that we pass the name of the dataframe into mutate()
so it knows where to look for those variables (vectors!).
That means all those skills working with numeric, logical, character, date, and categorical vectors can be easily transferred into the context of dataframes.
mutate()
can either take a single value that gets recycled, or a vector with the same length as the dataframe. We do not need to worry about shorter vectors being recycled, much less uneven recycling, which are usually undesirable. mutate()
will be happy to return an error:
|>
acs mutate(x = 1:2)
Error in `mutate()`:
ℹ In argument: `x = 1:2`.
Caused by error:
! `x` must be size 28172 or 1, not 2.
Now, we can create variables of several types.
11.8.1 Numeric
One way we can create a numeric variable is by multiplying a single existing column by a constant. Multiplying age (in years) by 12 results in age in months. The variable age_months
does not currently exist in acs
, so a new variable is created.
<-
acs |>
acs mutate(age_months = age * 12)
Variables can also be created from multiple existing columns, through addition, multiplication, logarithms, averages, minimums, or any combination of functions and operators.
11.8.2 Logical
Create an indicator variable called teenager
that indicates whether an individual’s age is in the range 13-19:
<-
acs |>
acs mutate(teenager = age >= 13 & age <= 19)
11.8.3 Character
In our dataset, the identifier is currently spread out across two variables: household
and person
. We can put these two together with paste()
so that we have a single variable that uniquely identifies observations.
<-
acs |>
acs mutate(id = paste(household, person, sep = "_"))
11.8.4 Date
We do not have any date information more specific than the year, but if we did, we could create a meaningful date variable, such as the date of survey response. For practice, create a date variable that just pastes together year
with January 1 for everybody:
<-
acs |>
acs mutate(date = paste0(year, "0101") |> ymd())
11.8.5 Categorical
Categorical variables can be of type numeric or factor. Numeric categorical variables work when we have only two categories, and we code them as 0 and 1. These are also called dummy or indicator variables. (For variables with only two categories, a variable is conceptually categorical, but its type may be numeric or logical.) Factor categorical variables work with any number of categories.
We can create a new indicator variable called female
that contains 0 for male and 1 for female. The sex
column is 1s and 2s. According to the codebook, these correspond to male and female, respectively. If sex == 2
, we can assign the value 1, and if not, 0.
<-
acs |>
acs mutate(female = ifelse(sex == 2, 1, 0))
Changing categorical variables that start at 1 to start at 0 is a good idea because it makes the intercept in a regression model meaningful. If sex
were a predictor, the intercept would be the expected value when sex == 0
, which is not a possible value.
Now, recode the gq
column. First, look at the values this variable can take:
|> pull(gq) |> unique() acs
[1] "Households under 1970 definition"
[2] "Other group quarters"
[3] "Group quarters--Institutions"
[4] "Additional households under 1990 definition"
Perhaps we only want two categories: Households and Group Quarters. We can use fct_collapse()
to reduce the number of categories:
<-
acs |>
acs mutate(gq_recode = fct_collapse(gq,
"Households" = c("Households under 1970 definition",
"Additional households under 1990 definition"),
"Group Quarters" = c("Group quarters--Institutions",
"Other group quarters")))
Sometimes we may have a numeric variable that can take on many values, and we want to collapse and recode the variable. A multiple-level categorical variable can be created with case_when()
. Each argument within case_when()
follows the pattern condition ~ value_if_TRUE
. Any values that return FALSE for every condition are assigned a value of NA
:
<- 1:5
x
case_when(x < 3 ~ "Less than 3",
== 3 ~ "Equal to 3",
x > 3 ~ "Greater than 3") x
[1] "Less than 3" "Less than 3" "Equal to 3" "Greater than 3"
[5] "Greater than 3"
Using the definitions from the 2000 ACS codebook (available here), we can recode educ
into a categorical variable with levels less than high school (education codes 1-8), high school (9), some college (10-12), bachelors (13), and advanced degree (14-16). A value of 0 for education means not applicable and is only used for individuals less than three years old. If we do not include 0 in any of our case_when()
statements, it will be assigned a value of NA
.
We can then use fct_relevel()
within the same mutate()
call to specify an order for our categorical variable since the default is alphabetical order.
<-
acs |>
acs mutate(educ_categories = case_when(educ >= 1 & educ <= 8 ~ "Less than High School",
== 9 ~ "High School",
educ >= 10 & educ <= 12 ~ "Some College",
educ == 13 ~ "Bachelors",
educ >= 14 ~ "Advanced Degree"),
educ educ_categories = fct_relevel(educ_categories,
"Less than High School",
"High School",
"Some College",
"Bachelors",
"Advanced Degree"))
We can confirm the coding of educ
was successful by examining the output of table()
:
table(acs$educ, acs$educ_categories, useNA = "ifany")
Less than High School High School Some College Bachelors Advanced Degree
0 0 0 0 0 0
1 1317 0 0 0 0
2 2508 0 0 0 0
3 1304 0 0 0 0
4 1648 0 0 0 0
5 923 0 0 0 0
6 1059 0 0 0 0
7 906 0 0 0 0
8 889 0 0 0 0
9 0 5959 0 0 0
10 0 0 1578 0 0
11 0 0 3191 0 0
12 0 0 1221 0 0
13 0 0 0 2960 0
14 0 0 0 0 1068
15 0 0 0 0 347
16 0 0 0 0 168
<NA>
0 1126
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
We can see the original values on the left and the values they now correspond to at the top.
We can imagine using this educ_categories
variable as a predictor in a statistical model or for creating bar graphs of income by educational attainment.
11.9 Change Values
In addition to modifying whole variables, mutate()
can change some values within a variable.
11.9.1 To Missing
In our exploration of the data, recall that inctot
is a character vector, but the first few values shown by glimpse()
appear to be numbers.
Open the data set with View(acs)
to see why it is a character vector. Some of the values are BBBBBBB
, the Census code for missing data. We can recode the B’s as missing values with the ifelse()
function while leaving the other values as they are.
<-
acs |>
acs mutate(inctot = ifelse(inctot == "BBBBBBB", NA, inctot))
At this step, the column is still a character vector, so we need to convert it into a numeric vector.
|> pull(inctot) |> typeof() acs
[1] "character"
<-
acs |>
acs mutate(inctot = as.numeric(inctot))
|> pull(inctot) |> typeof() acs
[1] "double"
Another approach we could have taken if we knew our missing code in advance would have been to to specify it when importing the data. If you take this approach, you will need to re-run the above code of renaming columns in order to follow along for the remainder of this chapter. This approach also assumes that “BBBBBBB” means missing in every column, not just inctot
.
<- read.csv("2000_acs_sample.csv", na.strings = "BBBBBBB") acs
11.9.1.1 Quantify Missing Data
We should now check how much data is missing from the dataframe.
We can calculate how much data is missing from acs
as a whole. To do so, first use is.na()
as a test of whether the data is missing. This will turn the entire dataframe into TRUE
and FALSE
values, where TRUE
means the data is missing. Then, take the sum or the mean. In doing so, TRUE
and FALSE
will be coerced into 1 and 0, respectively.
|> is.na() |> sum() acs
[1] 7283
|> is.na() |> mean() acs
[1] 0.01123996
A total of 7283 values are missing, equal to 1.1% of the dataset.
To calculate missingness by column, run the same code above but with colSums()
or colMeans()
instead of sum()
or mean()
|> is.na() |> colSums() acs
year datanum household hhwt gq
0 0 0 0 0
serialno person perwt pnum sex
0 0 0 0 0
age hispan race1 marstat educ
0 0 0 0 0
inctot age_months teenager id date
6157 0 0 0 0
female gq_recode educ_categories
0 0 1126
|> is.na() |> colMeans() acs
year datanum household hhwt gq
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
serialno person perwt pnum sex
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
age hispan race1 marstat educ
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
inctot age_months teenager id date
0.21855033 0.00000000 0.00000000 0.00000000 0.00000000
female gq_recode educ_categories
0.00000000 0.00000000 0.03996876
We now see that all the missing data is in the columns inctot
and educ_categories
.
If you work with missing data, read about Blimp, an easy-to-use software for estimating models with missing data.
11.9.2 To Other Values
Earlier we saw that age
had a maximum value of 933. If we assume this variable is in years, this value seems way too high. Use count()
to tabulate the values of age
, sort its values, and then view counts for the highest values of age
with slice_max()
:
|> count(age) |> slice_max(age, n = 3) acs
age n
1 933 1
2 93 149
3 92 1
The original ACS dataset did not have this value of 933 for age. The value was intentionally edited from 93 to 933 for this example
Only one observation has a value of 933, and the next highest value is 93. We could take at least three approaches to deal with this number:
- Change it to 93 because we think it is a typo. (Or maybe 33? Or 9?)
- Change the value to missing because we are not sure what value it should be.
- Drop this observation because we are unsure about its quality.
If we know the 933 should be 93, we can change it with ifelse()
:
<-
acs |>
acs mutate(age = ifelse(age == 933, 93, age))
Verify that this variable no longer has a 933, and that there is one more 93 than before:
|> count(age) |> slice_max(age, n = 3) acs
age n
1 93 150
2 92 1
3 89 39
933 no longer appears in the dataset, and the count of 93 increased by 1. Success.
11.10 Save Your Dataframe and Script
Now that we have cleaned up the ACS data set, it is a good idea to end the script by saving the cleaned data set. Save it as an RDS to preserve the formatting of dates and factors.
saveRDS(acs, "acs_cleaned.rds")
By saving the resulting data set, you can now begin the next script (“02_…”) with acs <- readRDS("acs_cleaned.rds")
. This first script is your record of how you made changes to the raw data. It serves as a record to future-you, to remind you of what you did, and to colleagues and journal reviewers who have questions.
11.11 Exercises
11.11.1 Fundamental
- Start a script that loads the
tidyverse
. Use the built-inpenguins_raw
dataset.
If we are working with external data files and make a mistake like overwriting the data, all we need to do is re-import the data.
With built-in datasets, our process looks a little different. Built-in datasets exist in the datasets
or other packages. When we reference one of those objects, like penguins_raw
, R will first look in our environment and then in the other packages we have loaded.
That means, if we have something else called penguins_raw
in our environment, R will find that one instead of the copy in the datasets
package:
<- 1
penguins_raw str(penguins_raw)
num 1
The solution is simple. Just remove the copy in the environment with rm()
, so that R will find the copy in the datasets
package:
rm(penguins_raw)
str(penguins_raw)
'data.frame': 344 obs. of 17 variables:
$ studyName : chr "PAL0708" "PAL0708" "PAL0708" "PAL0708" ...
$ Sample Number : num 1 2 3 4 5 6 7 8 9 10 ...
$ Species : chr "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" "Adelie Penguin (Pygoscelis adeliae)" ...
$ Region : chr "Anvers" "Anvers" "Anvers" "Anvers" ...
$ Island : chr "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
$ Stage : chr "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" "Adult, 1 Egg Stage" ...
$ Individual ID : chr "N1A1" "N1A2" "N2A1" "N2A2" ...
$ Clutch Completion : chr "Yes" "Yes" "Yes" "Yes" ...
$ Date Egg : Date, format: "2007-11-11" "2007-11-11" ...
$ Culmen Length (mm) : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ Culmen Depth (mm) : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ Flipper Length (mm): num 181 186 195 NA 193 190 181 195 193 190 ...
$ Body Mass (g) : num 3750 3800 3250 NA 3450 ...
$ Sex : chr "MALE" "FEMALE" "FEMALE" NA ...
$ Delta 15 N (o/oo) : num NA 8.95 8.37 NA 8.77 ...
$ Delta 13 C (o/oo) : num NA -24.7 -25.3 NA -25.3 ...
$ Comments : chr "Not enough blood for isotopes." NA NA "Adult not sampled." ...
If you make a mistake with penguins_raw
in the exercises below, just run rm(penguins_raw)
!
Read the documentation at
?penguins_raw
.Examine the data. What type is each column? How are the data distributed (use
summary()
)? Is any data missing?Rename
Flipper Length (mm)
toflipper_len_mm
. Note: to reference a non-syntactic column name, use back ticks`
around the column name:`Flipper Length (mm)`
Create a new variable called
flipper_len_in
that isflipper_len_mm
divided by 25.4 to rescale it to inches.Create a new binary variable called
not_enough_blood
if theComments
column contains the text “Not enough blood for isotopes.” Hint: usestr_detect()
.Make a factor from the
Island
column where “Dream” is the reference (first) level. Replace the existingIsland
column.Recode
Sex
to have “M” for “MALE” and “F” for “FEMALE.”Save the dataset as an RDS file, and save your script.
11.11.2 Extended
- Use
rename_with()
to make all of the column names inpenguins_raw
syntactic by removing spaces and punctuation.