11 First Steps with Dataframes

The majority of your statistical work will revolve around data sets. For statistical work, a data set is a rectangular (matrix-like) arrangement of measurements (variables, columns) collected on units of observation that are arranged in rows. This structure is crucial to keep the correct measurements (data values) connected with each observation.

The typical data analysis project begins by reading a data set into R from some external file.

In R this structure is called a dataframe. A dataframe is an ordered list of vectors, where the vectors are all the same length. A dataframe also always has column names (variable names) and row names (often just the observation number as a character value). When indexing the elements of a dataframe we always have all three methods available: by position, by name, or by condition.

A variation on the dataframe is the tidyverse tibble class. Tibbles have their own methods for some generic functions (like print). They also have data.frame as a secondary class - any tibble can be used as a dataframe.

The following examples use a subset of the American Community Survey (ACS) from 2000. This is the same data set used in the Data Wrangling in Stata curriculum. To follow along, download the dataset here.

When you start working with a data set, especially if it was created by somebody else (that includes past-you!), resist the temptation to start running models immediately. First, take time to understand the data. What information does it contain? What is the structure of the data set? What is the data type of each column? Is there anything strange in the data set? It’s better to find out now, and not when you’re in the middle of modeling!

11.1 Start Your Script

Now, create a new script with File - New File - New R Script or by clicking on the New File icon in the toolbar. Save this script with a sensible name, such as “01_cleaning.R”. We can imagine a series of scripts we might run after this one, such as “02_descriptive_statistics.R”, “03_regression.R”, “04_plots.R”, and so on.

The first few lines of a script should load libraries and read in our data. the dplyr package includes a wide range of functions for manipulating dataframes that are essential for basic and advanced data wrangling.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
acs <- read.csv("2000_acs_sample.csv")

11.1.1 Piping

dplyr loads the basic pipe operator %>% (shortcut: ctrl + shift + m). (Other pipe operators are available in the magrittr package.) The pipe uses the result of an expression as the first argument of the following expression.

We can write our code without pipes in a nested manner:

sqrt(mean(1:5))
[1] 1.732051

Or we can use pipes to take the 1:5, give it to mean() to calculate the average, and then take that average and pass it to sqrt() to calculate the square root:

1:5 %>% mean() %>% sqrt()
[1] 1.732051

Using pipes makes writing and reading code easier. As we write strings of functions, we are less likely to misplace or miscount parentheses. As we read ours and others’ code, we see the operations in the order they are done. The nested example above starts with sqrt(), which is the last operation we carry out, while the piped code reveals that the first thing we do is create a vector of one through five.

11.2 Look at the Data

A dataframe consists of rows called observations and columns called variables. The data recorded for an individual observation are stored as values in the corresponding variable.

Variable Variable Variable
Observation Value Value Value
Observation Value Value Value
Observation Value Value Value

If you have a dataset already in this format, you are in luck. However, we might run into datasets that need a little work before we can use them. A single row might have multiple observations, or a single variable might be spread across multiple columns. Organizing, or tidying, datasets is the focus of the remainder of this book.

Now that we have the acs dataset loaded, a first step in looking at our data is checking its dimensions, row names (if it has any), and column names.

nrow(acs)
[1] 28172
row.names(acs)[1:10] # just look at the first 10
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
ncol(acs)
[1] 16
colnames(acs)
 [1] "year"             "datanum"          "serial"           "hhwt"             "gq"              
 [6] "us2000c_serialno" "pernum"           "perwt"            "us2000c_pnum"     "us2000c_sex"     
[11] "us2000c_age"      "us2000c_hispan"   "us2000c_race1"    "us2000c_marstat"  "us2000c_educ"    
[16] "us2000c_inctot"  
dim(acs) # nrow() and ncol() together
[1] 28172    16

Next, the summary() function provides simple summary statistics for numeric vectors, and str() will, in the case of dataframes, tell us the data type and the first few values of each column.

summary(acs)
      year         datanum      serial             hhwt          gq            us2000c_serialno 
 Min.   :2000   Min.   :4   Min.   :     37   Min.   :100   Length:28172       Min.   :     92  
 1st Qu.:2000   1st Qu.:4   1st Qu.: 323671   1st Qu.:100   Class :character   1st Qu.:2395745  
 Median :2000   Median :4   Median : 617477   Median :100   Mode  :character   Median :4905730  
 Mean   :2000   Mean   :4   Mean   : 624234   Mean   :100                      Mean   :4951676  
 3rd Qu.:2000   3rd Qu.:4   3rd Qu.: 937528   3rd Qu.:100                      3rd Qu.:7444248  
 Max.   :2000   Max.   :4   Max.   :1236779   Max.   :100                      Max.   :9999402  
     pernum           perwt      us2000c_pnum     us2000c_sex     us2000c_age     us2000c_hispan 
 Min.   : 1.000   Min.   :100   Min.   : 1.000   Min.   :1.000   Min.   :  0.00   Min.   : 1.00  
 1st Qu.: 1.000   1st Qu.:100   1st Qu.: 1.000   1st Qu.:1.000   1st Qu.: 17.00   1st Qu.: 1.00  
 Median : 2.000   Median :100   Median : 2.000   Median :2.000   Median : 35.00   Median : 1.00  
 Mean   : 2.208   Mean   :100   Mean   : 2.208   Mean   :1.512   Mean   : 35.92   Mean   : 1.77  
 3rd Qu.: 3.000   3rd Qu.:100   3rd Qu.: 3.000   3rd Qu.:2.000   3rd Qu.: 51.00   3rd Qu.: 1.00  
 Max.   :16.000   Max.   :100   Max.   :16.000   Max.   :2.000   Max.   :933.00   Max.   :24.00  
 us2000c_race1   us2000c_marstat  us2000c_educ    us2000c_inctot    
 Min.   :1.000   Min.   :1.000   Min.   : 0.000   Length:28172      
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 4.000   Class :character  
 Median :1.000   Median :3.000   Median : 9.000   Mode  :character  
 Mean   :1.935   Mean   :2.973   Mean   : 7.871                     
 3rd Qu.:1.000   3rd Qu.:5.000   3rd Qu.:11.000                     
 Max.   :9.000   Max.   :5.000   Max.   :16.000                     
str(acs)
'data.frame':   28172 obs. of  16 variables:
 $ year            : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
 $ datanum         : int  4 4 4 4 4 4 4 4 4 4 ...
 $ serial          : int  37 37 37 241 242 296 377 418 465 465 ...
 $ hhwt            : int  100 100 100 100 100 100 100 100 100 100 ...
 $ gq              : chr  "Households under 1970 definition" "Households under 1970 definition" "Households under 1970 definition" "Households under 1970 definition" ...
 $ us2000c_serialno: int  365663 365663 365663 2894822 2896802 3608029 4720742 5176658 5856346 5856346 ...
 $ pernum          : int  1 2 3 1 1 1 1 1 1 2 ...
 $ perwt           : int  100 100 100 100 100 100 100 100 100 100 ...
 $ us2000c_pnum    : int  1 2 3 1 1 1 1 1 1 2 ...
 $ us2000c_sex     : int  2 2 2 2 2 2 2 2 1 2 ...
 $ us2000c_age     : int  20 19 19 50 29 20 69 59 55 47 ...
 $ us2000c_hispan  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ us2000c_race1   : int  1 1 2 1 1 6 1 1 2 2 ...
 $ us2000c_marstat : int  5 5 5 5 5 5 5 2 4 5 ...
 $ us2000c_educ    : int  11 11 11 14 13 9 1 8 12 1 ...
 $ us2000c_inctot  : chr  "10000" "5300" "4700" "32500" ...

We can learn a few things about the data: - it seems like year, hhwt, and perwt are always the same values (note that the summary statistics are all a single number) - several variables, such as us2000c_sex and us2000c_race1, are integers but their names suggest categorical variables - gq and us2000c_inctot are character vectors, while all others are numeric - us2000c_age has a maximum value of 933, which seems implausibly high

To check that year and the other variables are always the same, use the table() function to reveal that the value 2000 occurs 28172 times.

table(acs$year)

 2000 
28172 

Alternatively, the unique() function in conjunction with the length() function to find the number of unique values in a vector. One unique value means that every value is identical.

unique(acs$year) # 2000 is the only value
[1] 2000
acs$year %>% unique() %>% length() # 1 unique value
[1] 1

We may choose to drop these columns later on, and this can be done with subset(acs, select = -year) or select(acs, -year) (see the chapters on Subsetting).

11.3 Renaming Variables

11.3.1 Rename Individual Columns

The rename() function in dplyr allows for easy renaming of individual columns. The pattern is new_name = old_name. We can change pernum to “person” and serial to “household”.

acs <- 
  acs %>% 
  rename(person = pernum, 
         household = serial)

11.3.2 Rename Multiple Columns

Several columns have the prefix “us2000c_”, which is a bit redundant since the data is all from the US and from the year 2000. Instead of renaming them one-by-one, we can rename several columns at once according to a pattern.

A useful function here is sub() which allows us to substitute one character string for another. In this case, we can substitute “us2000c_” for ""(nothing), effectively deleting the prefix.

sub() contains three arguments. It looks through the third and replaces the first with the second. Here, go through the column names of acs and replace the first occurrence of “us2000c_” with "".

colnames(acs)
 [1] "year"             "datanum"          "household"        "hhwt"             "gq"              
 [6] "us2000c_serialno" "person"           "perwt"            "us2000c_pnum"     "us2000c_sex"     
[11] "us2000c_age"      "us2000c_hispan"   "us2000c_race1"    "us2000c_marstat"  "us2000c_educ"    
[16] "us2000c_inctot"  
colnames(acs) <- sub("us2000c_", "", colnames(acs))

colnames(acs)
 [1] "year"      "datanum"   "household" "hhwt"      "gq"        "serialno"  "person"    "perwt"    
 [9] "pnum"      "sex"       "age"       "hispan"    "race1"     "marstat"   "educ"      "inctot"   

11.4 Creating Variables

mutate() is a variable creation and replacement function. If a new variable name is supplied, a new variable is created. If an existing variable name is supplied, that variable is replaced without any warning.

11.4.0.1 Numeric

One way we can create a numeric variable is by multiplying a single existing column by a constant. Multiplying education, assumed to be in years, by 12 results in education in months (assuming year-round school!). The variable educ_months does not currently exist in acs, so a new variable is created.

acs <-
  acs %>% 
  mutate(educ_months = educ * 12)

Variables can also be created from multiple existing columns, through addition, multiplication, averages, minimums, or any other number and combination of functions.

acs <-
  acs %>% 
  mutate(nonsense1 = educ * marstat,
         nonsense2 = (educ + marstat) ^ person,
         nonsense3 = ifelse(educ %% 36 == 0, marstat, person))

11.4.1 Character

Currently, the identifier is spread out across two variables: household and person. We can put these two together with paste() so that we have a single variable that uniquely identifies observations.

acs <-
  acs %>% 
  mutate(id = paste(household, person, sep = "_"))

If we are working with multiple datasets which we plan to join together, it might be good to add a note identifying the source of the data. We can supply a single value to mutate(), and this is recycled down the column. We can also supply a character vector with length greater than one (c("odd", "even")), one or more numbers to be repeated down a column (1:2, 2000), or even today’s date (Sys.Date()).

Recycling a vector with length greater than one requires the use of rep() with the argument len = nrow(.), where . is shorthand for the object piped into the current function. In other words, odd_even = rep(c("odd", "even"), len = nrow(.)) means, “Create (or replace) a column called ‘odd_even’ that repeats the vector c("odd", "even") until it reaches a length equal to the number of rows in our acs dataframe.”

acs <-
  acs %>% 
  mutate(data_source = "acs",
         odd_even = rep(c("odd", "even"), len = nrow(.)),
         one_two = rep(1:2, len = nrow(.)),
         year_again = 2000,
         last_edited = Sys.Date())

11.4.1.1 Categorical

Dummy coded variables can be created from other variables with two or more values. Dummy coded variables have one where the variable has a certain value, and a zero for all other values.

The sex column is 1s and 2s. Let’s assume these correspond to male and female, respectively. We can create a new column called female that contains 0 for male and 1 for female. One way we might do this is by simply subtracting 1 from the sex column, but this only works in cases like this where we have two levels. A slightly longer but more adaptable way is to use a logical test. Check if sex %in% 2, and then convert this to numeric. Remember: TRUE is equal to 1, and FALSE is equal to 0.

acs <-
  acs %>% 
  mutate(female = as.numeric(sex %in% 2))

acs <-
  acs %>% 
  mutate(female = ifelse(sex %in% 2, 1, 0))

We can also recode the gq column. Let’s look at the values this variable can take.

table(acs$gq)

Additional households under 1990 definition                Group quarters--Institutions 
                                         71                                         406 
           Households under 1970 definition                        Other group quarters 
                                      27339                                         356 

Perhaps we do not need all of these categories. Maybe we only care if the individual lives in a household under the 1970 definition. We can adapt the above code to create a new column:

acs <-
  acs %>% 
  mutate(households1970 = as.numeric(gq %in% "Households under 1970 definition"))

Unlike the discrete variables of sex and gq, age is a continuous variable, but we can adapt the above code to dummy code it. Create a dichotomous variable of whether an individual is an adult (age is 18 or more).

acs <-
  acs %>% 
  mutate(adult = as.numeric(age >= 18))

A multiple-level categorical variable can be created with case_when(). Each argument within case_when() is condition ~ value, where if a condition is met, a value is assigned. Anything not covered by any of the conditions we provide is assigned a value of NA.

We might want to turn a continuous variable like educ into distinct categories of less than high school (educ < 12), high school (educ == 12), some college (educ > 12 & educ < 16), and college graduate (educ >= 16). We can assign these the values 0-3.

acs <-
  acs %>% 
  mutate(educ_categories = case_when(educ <  12 ~ 0, 
                                     educ == 12 ~ 1, 
                                     educ >  12 & educ < 16 ~ 2, 
                                     educ >= 16 ~ 3))

We can imagine using this educ_categories variable as a predictor in a statistical model or for creating histograms of income by educational attainment.

11.5 Changing Values

Instead of acting at the whole-variable level, we can also use mutate() to change some values within a variable.

11.5.1 Change Values to Missing

In our exploration of the data, recall that inctot is a character vector, but the first few values shown by str() appear to be numbers.

Open the data set with View(acs) to see why it is a character vector. Some of the values are BBBBBBB. The Census uses this code for missing data. We can recode the B’s as missing values with the na_if() function while leaving the other values as they are.

At this step, the column is still a character vector, so we need to convert it into a numeric vector.

acs <-
  acs %>% 
  mutate(inctot = na_if(inctot, "BBBBBBB"))

At this step, the column is still a character vector, so we need to convert it into a numeric vector.

class(acs$inctot)
[1] "character"
acs <-
  acs %>% 
  mutate(inctot = as.numeric(inctot))

Another approach we could take if we knew our missing code in advance is to specify this when reading in the data. If you take this approach, you will need to re-run the above code of renaming columns in order to follow along for the remainder of this chapter.

acs <- read.csv("2000_acs_sample.csv", na.strings = "BBBBBBB")

11.5.1.1 Quantify Missing Data

We should now check how much data is missing from the dataframe.

We can calculate how much data is missing from acs as a whole. To do so, first use is.na() as a test of whether the data is missing. This will turn the entire dataframe into TRUE and FALSE values, where TRUE means the data is missing. Then, take the sum or the mean. In doing so, TRUE and FALSE will be coerced into 1 and 0, respectively.

acs %>% is.na() %>% sum()
[1] 6157
acs %>% is.na() %>% mean()
[1] 0.007285011

A total of 6157 values are missing, 0.7% of our dataset.

To calculate missingness by individual columns, first turn the dataframe into logical values with is.na() as above, and then take column sums or means.

acs %>% is.na() %>% colSums()
           year         datanum       household            hhwt              gq        serialno 
              0               0               0               0               0               0 
         person           perwt            pnum             sex             age          hispan 
              0               0               0               0               0               0 
          race1         marstat            educ          inctot     educ_months       nonsense1 
              0               0               0            6157               0               0 
      nonsense2       nonsense3              id     data_source        odd_even         one_two 
              0               0               0               0               0               0 
     year_again     last_edited          female  households1970           adult educ_categories 
              0               0               0               0               0               0 
acs %>% is.na() %>% colMeans()
           year         datanum       household            hhwt              gq        serialno 
      0.0000000       0.0000000       0.0000000       0.0000000       0.0000000       0.0000000 
         person           perwt            pnum             sex             age          hispan 
      0.0000000       0.0000000       0.0000000       0.0000000       0.0000000       0.0000000 
          race1         marstat            educ          inctot     educ_months       nonsense1 
      0.0000000       0.0000000       0.0000000       0.2185503       0.0000000       0.0000000 
      nonsense2       nonsense3              id     data_source        odd_even         one_two 
      0.0000000       0.0000000       0.0000000       0.0000000       0.0000000       0.0000000 
     year_again     last_edited          female  households1970           adult educ_categories 
      0.0000000       0.0000000       0.0000000       0.0000000       0.0000000       0.0000000 

We now see that all of the missing values are in inctot, and 21.9% of this variable is missing.

If you work with missing data and imputation, be sure to check out this excellent resource by Stef van Buren on using the mice package.

11.5.2 Change Values to Other Values

Earlier we saw that age had a maximum value of 933. If we assume this variable is in years, this value seems way too high. Use table() to see the values age takes.

table(acs$age)

  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24 
371 391 364 423 390 388 376 392 426 450 444 403 461 437 441 396 400 367 375 407 415 362 363 412 345 
 25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49 
352 376 378 374 397 441 397 373 402 387 440 466 454 462 435 466 473 454 421 431 410 440 425 395 363 
 50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74 
361 365 357 351 285 279 275 288 266 225 234 226 227 230 236 185 195 187 182 203 175 187 188 186 183 
 75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  92  93 933 
155 150 138 141 121 118 109 109  98  90  74  71  67  54  39   1 149   1 

Only one observation has a value of 933, and the next highest value is 93. We could take at least three approaches to deal with this number. We might think it should be 93 and whoever entered the data made a typo, we could change this value to missing because we are not certain what the value should have been, or we could drop this case altogether.

(Note: The original ACS dataset did not have this value of 933 for age. The value was intentionally edited from 93 to 933 for this exercise.)

If we take the third approach, we can drop this particular row by taking a subset, either with subset(acs, subset = !age %in% 933) or filter(acs, !age %in% 933) (see chapters on Subsetting for more). If we want to make it missing, we can use the na_if() approach above. If we think, rather, if we know, it should be 93, we can use mutate() in conjunction with ifelse().

The arguments of ifelse() are condition, value if TRUE, and value if FALSE. The code below checks if a value of age is 933. If it is, it changes it to 93. If not, it uses the value of that observation from age.

acs <- 
  acs %>% 
  mutate(age = ifelse(age %in% 933, 93, age))

11.6 Save Your Dataframe and Script

Now that we have cleaned up the ACS data set, it is a good idea to end the script by saving the cleaned data set.

write.csv(acs, "2000c_acs_cleaned.csv", row.names = F)

By saving the resulting data set, you can now begin the next script (“02_…”) with read.csv("2000c_acs_cleaned.csv"). This first script is your record of how you made changes to the raw data. It serves as a record to future-you, to remind you of what you did, and to colleagues and journal reviewers who have questions.

11.7 Exercises

  1. Start a script that loads dplyr and the sleep dataset.

  2. Read the documentation at help(sleep).

  3. Examine the data. What type is each column? How are the data distributed? Is any data missing?

  4. Add a new column that says “One” if group is 1, and “Two” if group is 2.

  5. Replace extra with NA if it is below zero.

  6. Multiply extra by 60 so that it is minutes rather than hours.

  7. Change the name of extra to extra_minutes.

  8. Save the dataset as a csv file, and save your script.