11 First Steps with Dataframes

The examples on this page require an example file, which you can download by clicking on the link below. (If the data file opens up in your browser instead of downloading, right-click on the link and select “Save link as…”)

Click here to download

The majority of your statistical work will revolve around data sets. For statistical work, a data set is a rectangular (matrix-like) arrangement of measurements (variables, columns) collected on units of observation that are arranged in rows. This structure is crucial to keep the correct measurements (data values) connected with each observation.

The typical data analysis project begins by reading a data set into R from some external file.

In R this structure is called a dataframe. A dataframe is an ordered list of vectors, where the vectors are all the same length. A dataframe also always has column names (variable names) and row names (often just the observation number as a character value). When indexing the elements of a dataframe we always have all three methods available: by position, by name, or by condition.

A variation on the dataframe is the tidyverse tibble class. Tibbles have their own methods for some generic functions (like print). They also have data.frame as a secondary class - any tibble can be used as a dataframe.

The following examples use a subset of the American Community Survey (ACS) from 2000. This is the same data set used in the Data Wrangling in Stata curriculum. To follow along, download the dataset here.

When you start working with a data set, especially if it was created by somebody else (that includes past-you!), resist the temptation to start running models immediately. First, take time to understand the data. What information does it contain? What is the structure of the data set? What is the data type of each column? Is there anything strange in the data set? It’s better to find out now, and not when you’re in the middle of modeling!

11.1 Start Your Script

Now, create a new script with File - New File - New R Script or by clicking on the New File icon in the toolbar. Save this script with a sensible name, such as “01_cleaning.R”. We can imagine a series of scripts we might run after this one, such as “02_descriptive_statistics.R”, “03_regression.R”, “04_plots.R”, and so on.

The first few lines of a script should load libraries and read in our data. The dplyr package includes a wide range of functions for manipulating dataframes that are essential for basic and advanced data wrangling, and forcats has several functions we will use for working with categorical variables.

library(dplyr)
library(forcats)

acs <- read.csv("2000_acs_sample.csv")

11.1.1 Piping

As of version 4.1.0, R has a native pipe operator, |>. The pipe uses the result of an expression as the first argument of the following expression.

We can write our code without pipes in a nested manner:

sqrt(mean(seq(1, 5, 1)))
[1] 1.732051

Or we can use pipes to take the seq(1, 5, 1), give it to mean() to calculate the average, and then take that average and pass it to sqrt() to calculate the square root:

seq(1, 5, 1) |> mean() |> sqrt()
[1] 1.732051

Using pipes makes writing and reading code easier. As we write strings of functions, we are less likely to misplace or miscount parentheses. As we read ours and others’ code, we see the operations in the order they are done. The nested example above starts with sqrt(), which is the last operation we carry out, while the piped code reveals that the first thing we do is create a vector of one through five.

11.2 Look at the Data

A dataframe consists of rows called observations and columns called variables. The data recorded for an individual observation are stored as values in the corresponding variable.

Variable Variable Variable
Observation Value Value Value
Observation Value Value Value
Observation Value Value Value

If you have a dataset already in this format, you are in luck. However, we might run into datasets that need a little work before we can use them. A single row might have multiple observations, or a single variable might be spread across multiple columns. Organizing, or tidying, datasets is the focus of the remainder of this book.

Now that we have the acs dataset loaded, a first step in looking at our data is checking its dimensions, row names (if it has any), and column names.

nrow(acs)
[1] 28172
rownames(acs)[1:10] # just look at the first 10
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
ncol(acs)
[1] 16
colnames(acs)
 [1] "year"             "datanum"          "serial"           "hhwt"             "gq"               "us2000c_serialno"
 [7] "pernum"           "perwt"            "us2000c_pnum"     "us2000c_sex"      "us2000c_age"      "us2000c_hispan"  
[13] "us2000c_race1"    "us2000c_marstat"  "us2000c_educ"     "us2000c_inctot"  
dim(acs) # nrow() and ncol() together
[1] 28172    16

Next, the summary() function provides simple summary statistics for numeric vectors, and str() will, in the case of dataframes, tell us the data type and the first few values of each column.

summary(acs)
      year         datanum      serial             hhwt          gq            us2000c_serialno      pernum      
 Min.   :2000   Min.   :4   Min.   :     37   Min.   :100   Length:28172       Min.   :     92   Min.   : 1.000  
 1st Qu.:2000   1st Qu.:4   1st Qu.: 323671   1st Qu.:100   Class :character   1st Qu.:2395745   1st Qu.: 1.000  
 Median :2000   Median :4   Median : 617477   Median :100   Mode  :character   Median :4905730   Median : 2.000  
 Mean   :2000   Mean   :4   Mean   : 624234   Mean   :100                      Mean   :4951676   Mean   : 2.208  
 3rd Qu.:2000   3rd Qu.:4   3rd Qu.: 937528   3rd Qu.:100                      3rd Qu.:7444248   3rd Qu.: 3.000  
 Max.   :2000   Max.   :4   Max.   :1236779   Max.   :100                      Max.   :9999402   Max.   :16.000  
     perwt      us2000c_pnum     us2000c_sex     us2000c_age     us2000c_hispan  us2000c_race1   us2000c_marstat
 Min.   :100   Min.   : 1.000   Min.   :1.000   Min.   :  0.00   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
 1st Qu.:100   1st Qu.: 1.000   1st Qu.:1.000   1st Qu.: 17.00   1st Qu.: 1.00   1st Qu.:1.000   1st Qu.:1.000  
 Median :100   Median : 2.000   Median :2.000   Median : 35.00   Median : 1.00   Median :1.000   Median :3.000  
 Mean   :100   Mean   : 2.208   Mean   :1.512   Mean   : 35.92   Mean   : 1.77   Mean   :1.935   Mean   :2.973  
 3rd Qu.:100   3rd Qu.: 3.000   3rd Qu.:2.000   3rd Qu.: 51.00   3rd Qu.: 1.00   3rd Qu.:1.000   3rd Qu.:5.000  
 Max.   :100   Max.   :16.000   Max.   :2.000   Max.   :933.00   Max.   :24.00   Max.   :9.000   Max.   :5.000  
  us2000c_educ    us2000c_inctot    
 Min.   : 0.000   Length:28172      
 1st Qu.: 4.000   Class :character  
 Median : 9.000   Mode  :character  
 Mean   : 7.871                     
 3rd Qu.:11.000                     
 Max.   :16.000                     
str(acs)
'data.frame':   28172 obs. of  16 variables:
 $ year            : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
 $ datanum         : int  4 4 4 4 4 4 4 4 4 4 ...
 $ serial          : int  37 37 37 241 242 296 377 418 465 465 ...
 $ hhwt            : int  100 100 100 100 100 100 100 100 100 100 ...
 $ gq              : chr  "Households under 1970 definition" "Households under 1970 definition" "Households under 1970 definition" "Households under 1970 definition" ...
 $ us2000c_serialno: int  365663 365663 365663 2894822 2896802 3608029 4720742 5176658 5856346 5856346 ...
 $ pernum          : int  1 2 3 1 1 1 1 1 1 2 ...
 $ perwt           : int  100 100 100 100 100 100 100 100 100 100 ...
 $ us2000c_pnum    : int  1 2 3 1 1 1 1 1 1 2 ...
 $ us2000c_sex     : int  2 2 2 2 2 2 2 2 1 2 ...
 $ us2000c_age     : int  20 19 19 50 29 20 69 59 55 47 ...
 $ us2000c_hispan  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ us2000c_race1   : int  1 1 2 1 1 6 1 1 2 2 ...
 $ us2000c_marstat : int  5 5 5 5 5 5 5 2 4 5 ...
 $ us2000c_educ    : int  11 11 11 14 13 9 1 8 12 1 ...
 $ us2000c_inctot  : chr  "10000" "5300" "4700" "32500" ...

We can learn a few things about the data: - it seems like year, hhwt, and perwt are always the same values (note that the summary statistics are all a single number) - several variables, such as us2000c_sex and us2000c_race1, are integers but their names suggest categorical variables; here we should refer to the 2000 ACS codebook to recode these variables - gq and us2000c_inctot are character vectors, while all others are numeric - us2000c_age has a maximum value of 933, which sounds impossibly high if age is in years

To check that year and the other variables are always the same, use the table() function to reveal that the value 2000 occurs 28172 times.

table(acs$year)

 2000 
28172 

Alternatively, the unique() function in conjunction with the length() function to find the number of unique values in a vector. One unique value means that every value is identical.

unique(acs$year) # 2000 is the only value
[1] 2000
acs$year |> unique() |> length() # 1 unique value
[1] 1

We may choose to drop these columns later on, and this can be done with subset(acs, select = -year) or select(acs, -year) (see the chapters on Subsetting).

11.3 Renaming Variables

11.3.1 Rename Individual Columns

The rename() function in dplyr allows for easy renaming of individual columns. The pattern is new_name = old_name. We can change pernum to “person” and serial to “household”.

acs <- 
  acs |> 
  rename(person = pernum, 
         household = serial)

11.3.2 Rename Multiple Columns

Several columns have the prefix “us2000c_”, which is a bit redundant since the data is all from the US and from the year 2000. Instead of renaming them one-by-one, we can rename several columns at once with rename_with() and the sub() function. In Substitution, we learned that sub() allwos us to substitute one character string for another. Substituting “us2000c_” for “” (nothing) will delete this prefix.

sub() contains three arguments. It looks through the third and replaces the first with the second. Here, go through the column names of acs and replace the first occurrence of “us2000c_” with ““. Within rename_with(), we can reference the column names as .x.

colnames(acs)
 [1] "year"             "datanum"          "household"        "hhwt"             "gq"               "us2000c_serialno"
 [7] "person"           "perwt"            "us2000c_pnum"     "us2000c_sex"      "us2000c_age"      "us2000c_hispan"  
[13] "us2000c_race1"    "us2000c_marstat"  "us2000c_educ"     "us2000c_inctot"  
acs <-
  acs |>
  rename_with(~ sub("us2000c_", "", .x))

colnames(acs)
 [1] "year"      "datanum"   "household" "hhwt"      "gq"        "serialno"  "person"    "perwt"     "pnum"     
[10] "sex"       "age"       "hispan"    "race1"     "marstat"   "educ"      "inctot"   

11.4 Creating Variables

mutate() is a variable creation and replacement function. If a new variable name is supplied, a new variable is created. If an existing variable name is supplied, that variable is replaced without any warning.

11.4.1 Numeric

One way we can create a numeric variable is by multiplying a single existing column by a constant. Multiplying age (assumed to be in years) by 12 results in age in months. The variable age_months does not currently exist in acs, so a new variable is created.

acs <-
  acs |> 
  mutate(age_months = age * 12)

Variables can also be created from multiple existing columns, through addition, multiplication, logarithms, averages, minimums, or any combination of functions and operators.

11.4.2 Character

In our dataset, the identifier is currently spread out across two variables: household and person. We can put these two together with paste() so that we have a single variable that uniquely identifies observations.

acs <-
  acs |> 
  mutate(id = paste(household, person, sep = "_"))

11.4.3 Categorical

Categorical variables can be of type numeric or factor. Numeric categorical variables work when we have only two categories, and we code them as 0 and 1. These are also called dummy or indicator variables. Factor categorical variables work with any number of categories.

We can create a new indicator variable called female that contains 0 for male and 1 for female. The sex column is 1s and 2s. According to the codebook, these correspond to male and female, respectively. If sex == 2, we can assign the value 1, and if not, 0.

acs <-
  acs |> 
  mutate(female = ifelse(sex == 2, 1, 0))

We could also have recoded our original sex variable to take values Male and Female by supplying character values in ifelse()’s second and third arguments. Here we assign the result to sex, a column that already exists, so the original column is replaced.

acs <-
  acs |> 
  mutate(sex = ifelse(sex == 2, "Female", "Male"))

Let’s recode the gq column too. First, look at the values this variable can take.

unique(acs$gq)
[1] "Households under 1970 definition"            "Other group quarters"                       
[3] "Group quarters--Institutions"                "Additional households under 1990 definition"

Perhaps we only want two categories: Households and Group Quarters. We can use fct_collapse() to reduce the number of categories:

acs <-
  acs |> 
  mutate(gq_recode = fct_collapse(gq,
                                  "Households" = c("Households under 1970 definition", 
                                                   "Additional households under 1990 definition"),
                                  "Group Quarters" = c("Group quarters--Institutions", 
                                                       "Other group quarters")))

We can also create indicators for continuous variables using other logical comparisons. Let’s create an indicator called adult that has a 1 if an individual is at least 18 years old, and a 0 if not.

acs <-
  acs |> 
  mutate(adult = ifelse(age >= 18, 1, 0))

All of our categorical variables so far have had only two categories. A multiple-level categorical variable can be created with case_when(). Each argument within case_when() follows the pattern condition ~ value_if_TRUE. Any values that return FALSE for every condition are assigned a value of NA.

x <- 1:5

case_when(x < 3 ~ "Less than 3",
          x == 3 ~ "Equal to 3",
          x > 3 ~ "Greater than 3")
[1] "Less than 3"    "Less than 3"    "Equal to 3"     "Greater than 3" "Greater than 3"

Using the definitions from the 2000 ACS codebook (available here), we can recode educ into a categorical variable with levels less than high school (education codes 1-8), high school (9), some college (10-12), bachelors (13), and advanced degree (14-16). A value of 0 for education means not applicable and is only used for individuals less than three years old. If we do not include 0 in any of our case_when() statements, it will be assigned a value of NA.

We can then use fct_relevel() within the same mutate() call to specify an order for our categorical variable since the default is alphabetical order.

acs <-
  acs |> 
  mutate(educ_categories = case_when(educ >= 1 & educ <= 8 ~ "Less than High School", 
                                     educ == 9 ~ "High School", 
                                     educ >=  10 & educ <= 12 ~ "Some College", 
                                     educ == 13 ~ "Bachelors",
                                     educ >= 14 ~ "Advanced Degree"),
         educ_categories = fct_relevel(educ_categories, 
                                       "Less than High School", 
                                       "High School", 
                                       "Some College", 
                                       "Bachelors",
                                       "Advanced Degree"))

We can confirm the coding of educ was successful by examining the output of table():

table(acs$educ, acs$educ_categories, useNA = "ifany")
    
     Less than High School High School Some College Bachelors Advanced Degree <NA>
  0                      0           0            0         0               0 1126
  1                   1317           0            0         0               0    0
  2                   2508           0            0         0               0    0
  3                   1304           0            0         0               0    0
  4                   1648           0            0         0               0    0
  5                    923           0            0         0               0    0
  6                   1059           0            0         0               0    0
  7                    906           0            0         0               0    0
  8                    889           0            0         0               0    0
  9                      0        5959            0         0               0    0
  10                     0           0         1578         0               0    0
  11                     0           0         3191         0               0    0
  12                     0           0         1221         0               0    0
  13                     0           0            0      2960               0    0
  14                     0           0            0         0            1068    0
  15                     0           0            0         0             347    0
  16                     0           0            0         0             168    0

We can imagine using this educ_categories variable as a predictor in a statistical model or for creating bar graphs of income by educational attainment.

11.5 Changing Values

In addition to modifying whole variables, we can also use mutate() to change some values within a variable.

11.5.1 Change Values to Missing

In our exploration of the data, recall that inctot is a character vector, but the first few values shown by str() appear to be numbers.

Open the data set with View(acs) to see why it is a character vector. Some of the values are BBBBBBB. The Census uses this code for missing data. We can recode the B’s as missing values with the na_if() function while leaving the other values as they are.

acs <-
  acs |> 
  mutate(inctot = ifelse(inctot == "BBBBBBB", NA, inctot))

At this step, the column is still a character vector, so we need to convert it into a numeric vector.

typeof(acs$inctot)
[1] "character"
acs <-
  acs |> 
  mutate(inctot = as.numeric(inctot))

Another approach we could take if we knew our missing code in advance is to specify this when reading in the data. If you take this approach, you will need to re-run the above code of renaming columns in order to follow along for the remainder of this chapter.

acs <- read.csv("2000_acs_sample.csv", na.strings = "BBBBBBB")

11.5.1.1 Quantify Missing Data

We should now check how much data is missing from the dataframe.

We can calculate how much data is missing from acs as a whole. To do so, first use is.na() as a test of whether the data is missing. This will turn the entire dataframe into TRUE and FALSE values, where TRUE means the data is missing. Then, take the sum or the mean. In doing so, TRUE and FALSE will be coerced into 1 and 0, respectively.

acs |> is.na() |> sum()
[1] 7283
acs |> is.na() |> mean()
[1] 0.01175087

A total of 6157 values are missing, 0.7% of our dataset.

To calculate missingness by individual columns, first turn the dataframe into logical values with is.na() as above, and then take column sums or means.

acs |> is.na() |> colSums()
           year         datanum       household            hhwt              gq        serialno          person 
              0               0               0               0               0               0               0 
          perwt            pnum             sex             age          hispan           race1         marstat 
              0               0               0               0               0               0               0 
           educ          inctot      age_months              id          female       gq_recode           adult 
              0            6157               0               0               0               0               0 
educ_categories 
           1126 
acs |> is.na() |> colMeans()
           year         datanum       household            hhwt              gq        serialno          person 
     0.00000000      0.00000000      0.00000000      0.00000000      0.00000000      0.00000000      0.00000000 
          perwt            pnum             sex             age          hispan           race1         marstat 
     0.00000000      0.00000000      0.00000000      0.00000000      0.00000000      0.00000000      0.00000000 
           educ          inctot      age_months              id          female       gq_recode           adult 
     0.00000000      0.21855033      0.00000000      0.00000000      0.00000000      0.00000000      0.00000000 
educ_categories 
     0.03996876 

We now see that all of the missing values are in inctot, and 21.9% of this variable is missing.

If you work with missing data and imputation, be sure to check out this excellent resource by Stef van Buren on using the mice package.

11.5.2 Change Values to Other Values

Earlier we saw that age had a maximum value of 933. If we assume this variable is in years, this value seems way too high. Use table() to see the values age takes.

table(acs$age)

  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29 
371 391 364 423 390 388 376 392 426 450 444 403 461 437 441 396 400 367 375 407 415 362 363 412 345 352 376 378 374 397 
 30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59 
441 397 373 402 387 440 466 454 462 435 466 473 454 421 431 410 440 425 395 363 361 365 357 351 285 279 275 288 266 225 
 60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89 
234 226 227 230 236 185 195 187 182 203 175 187 188 186 183 155 150 138 141 121 118 109 109  98  90  74  71  67  54  39 
 92  93 933 
  1 149   1 

Only one observation has a value of 933, and the next highest value is 93. We could take at least three approaches to deal with this number. We might think it should be 93 and whoever entered the data made a typo, we could change this value to missing because we are not certain what the value should have been, or we could drop this case altogether.

(Note: The original ACS dataset did not have this value of 933 for age. The value was intentionally edited from 93 to 933 for this exercise.)

If we take the third approach, we can drop this particular row by taking a subset, either with subset(acs, subset = !age %in% 933) or filter(acs, !age %in% 933) (see chapters on Subsetting for more). If we want to make it missing, we can use the na_if() approach above. If we think, rather, if we know, it should be 93, we can use mutate() in conjunction with ifelse().

The arguments of ifelse() are condition, value if TRUE, and value if FALSE. The code below checks if a value of age is 933. If it is, it changes it to 93. If not, it uses the value of that observation from age.

acs <- 
  acs |> 
  mutate(age = ifelse(age %in% 933, 93, age))

11.6 Save Your Dataframe and Script

Now that we have cleaned up the ACS data set, it is a good idea to end the script by saving the cleaned data set.

saveRDS(acs, "acs_cleaned.rds")

By saving the resulting data set, you can now begin the next script (“02_…”) with acs <- readRDS("acs_cleaned.rds"). This first script is your record of how you made changes to the raw data. It serves as a record to future-you, to remind you of what you did, and to colleagues and journal reviewers who have questions.

11.7 Exercises

  1. Start a script that loads dplyr and the sleep dataset.

  2. Read the documentation at help(sleep).

  3. Examine the data. What type is each column? How are the data distributed? Is any data missing?

  4. Add a new column that says “One” if group is 1, and “Two” if group is 2.

  5. Replace extra with NA if it is below zero.

  6. Multiply extra by 60 so that it is minutes rather than hours.

  7. Change the name of extra to extra_minutes.

  8. Save the dataset as an RDS file, and save your script.