2.6 Column operations

SSCC - Social Science Computing Cooperative

Supporting Statistical Analysis for Research

The two common column operations are renaming columns, rename(), and selecting columns, select(). The select() function has a number of helper functions that make it easier to select a set of columns, such as, starts_with(), ends_with(), contains(), everything() and the slice operator.

2.6.1 Examples

Renaming the variables of the cps data.

cps_in <- 
  cps_in %>%
  rename(
    id = X1,
    no_deg = nodeg,
    real_earn_74 = re74,
    real_earn_75 = re75,
    real_earn_78 = re78
    )
cps <-
  cps_in

head(cps, 3)

# A tibble: 3 x 11
     id   trt   age  educ black  hisp  marr no_deg real_earn_74
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>        <dbl>
1     1     0    45    11     0     0     1      1       21517.
2     2     0    21    14     0     0     0      0        3176.
3     3     0    38    12     0     0     1      0       23039.
# ... with 2 more variables: real_earn_75 <dbl>, real_earn_78 <dbl>

The above variable names are Snake coded, each word is separated using the underscore, _.

Reordering the variables of the cps data.

We will make the first two columns of the tibble the id and age variables.

cps <-
  cps %>%
  select(id, age, everything())

head(cps, 3)

# A tibble: 3 x 11
     id   age   trt  educ black  hisp  marr no_deg real_earn_74
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>        <dbl>
1     1    45     0    11     0     0     1      1       21517.
2     2    21     0    14     0     0     0      0        3176.
3     3    38     0    12     0     0     1      0       23039.
# ... with 2 more variables: real_earn_75 <dbl>, real_earn_78 <dbl>

The everything() function fills in the names of the other variables that were not listed.

Selecting variables (subsetting)

We will select all the variables except the real_earn_78 variable using inclusion.

cps_part1 <-
  cps %>%
  select(id, age, trt, educ, black, hisp, marr, no_deg, real_earn_74, real_earn_75)

head(cps_part1, 3)

# A tibble: 3 x 10
     id   age   trt  educ black  hisp  marr no_deg real_earn_74
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>        <dbl>
1     1    45     0    11     0     0     1      1       21517.
2     2    21     0    14     0     0     0      0        3176.
3     3    38     0    12     0     0     1      0       23039.
# ... with 1 more variable: real_earn_75 <dbl>

We will select all the variables except the real_earn_74 and real_earn_75 variables using exclusion.

cps_78 <-
  cps %>%
  select(-real_earn_74, -real_earn_75)

head(cps_78, 3)

# A tibble: 3 x 9
     id   age   trt  educ black  hisp  marr no_deg real_earn_78
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>        <dbl>
1     1    45     0    11     0     0     1      1       25565.
2     2    21     0    14     0     0     0      0       13496.
3     3    38     0    12     0     0     1      0       25565.

Removing a variable from a tibble.

subsetting to a single column of a tibble results in a one column tibble.

cps %>%
  select(age) %>%
  str()

Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':    15992 obs. of  1 variable:
 $ age: num  45 21 38 48 18 22 48 18 48 45 ...
 - attr(*, "spec")=
  .. cols(
  ..   X1 = col_double(),
  ..   trt = col_double(),
  ..   age = col_double(),
  ..   educ = col_double(),
  ..   black = col_double(),
  ..   hisp = col_double(),
  ..   marr = col_double(),
  ..   nodeg = col_double(),
  ..   re74 = col_double(),
  ..   re75 = col_double(),
  ..   re78 = col_double()
  .. )

The pull() function is used to get a column from a tibble as a vector.

cps %>%
  pull(age) %>%
  str()

 num [1:15992] 45 21 38 48 18 22 48 18 48 45 ...