Page not found – Social Science Computing Cooperative – UW–Madison

It looks like nothing was found at this location. Maybe try a search?

2 Tidyverse

2.1 Overview

This chapter provides an introduction to the tidyverse, a set of R packages. The functions of tidyverse work together to create a structured approach to wrangling data.

2.2 Tidyverse packages

The core set of tidyverse packages are included in the tidyverse package. Other tidyverse package need to be loaded and attached individually, such as, lubridate, readxl, and broom.

2.2.1 Example

  1. The tidyverse is loaded and attached as follows.

    library(tidyverse)

2.3 The pipe operator

The pipe operator, %>%, passes an object to a function as the first parameter. The function call,

    function_name(data_object, other_parameters)

becomes,

    data_object %>% function_name(other_parameters)

With the pipe operator.

The pipe operator reduces the coding load of saving intermediate results that will only be referencing in next line of code. This reduction in managing intermediate results can make your code easier to read.

2.3.1 Examples

  1. Base R

    The following code creates a vector of 15 numeric values. This vector is then rounded to two significant digits, sorted in descending order, and then head() displays a few of the largest values.

    set.seed(749875)
    number_data <- runif(n = 15, min = 0, max = 1000)
    
    head(sort(round(number_data, digit = 2), decreasing = TRUE))
    [1] 997.62 813.26 797.96 733.98 732.67 675.45

    To read the above base R code, one reads from the inner most parenthises to the outer most. This nesting of functions can make reading base R code challenging.

    Another base R approach that avoids deeply nesting functions is to save the intermediate results. The intermediate results are then used in the next function as a separate command.

    number_round <- round(number_data, digit = 2)
    number_sort <- sort(number_round, decreasing = TRUE)
    head(number_sort)
    [1] 997.62 813.26 797.96 733.98 732.67 675.45

    This is also a more natural order of the functions. It does require the intermediate results to be saved. These intermediate results are only used by the function on the next line.

  2. Using the pipe operator

    The pipe operator allow the order of the data and functions to more closely match the order they are evaluated, without needing to save the intermediate results.

    number_data %>%
      round(digits = 2) %>%
      sort(decreasing = TRUE) %>%
      head()
    [1] 997.62 813.26 797.96 733.98 732.67 675.45

    This coding style places the most important information about what is being operated on and the operations that are being done on the left side of the page. The details of what is being done are found further to the right on the page. This is considered easier to read code.

2.4 Importing csv files and parsers

The tidyverse function to read a csv file is read_csv(). The following are a few important parameters of read_csv().

  • file, the path to the file to be imported.

  • col_names, setting this to FALSE indicates the first row does not contains variable names.

  • col_types, setting this to col() uses guessed types for the columns. Alternatively, the parameters of col() can be used to define the types of each column.

  • na, list of strings that indicate missing data.

  • guess_max, specifies the number of row to consider before making a guess of what type the columns are. The default value of 1000 works well on most csv files.

  • skip, number of lines at the front of the file to be ignored. This is used when a csv file contains metadata at the beginning of the file.

The read_*() functions of the tidyverse use a common set of parsers. These parser are used to format data such as numeric, factors, date and time, etc. These parsers can be directly called to parse a column. The parse_factor() function will be demonstrated in the Modifying variables section below.

2.4.1 Examples

  1. Importing a csv file

    cps_in <- read_csv(file.path("datasets", "cps1.csv"), col_types = cols())
    Warning: Missing column names filled in: 'X1' [1]

    Note, one of the columns did not have name and the read_csv() function gave it a name.

  2. The head() function returns the beginning values of an object

    head(cps_in, 3)    
    # A tibble: 3 x 11
         X1   trt   age  educ black  hisp  marr nodeg   re74   re75   re78
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
    1     1     0    45    11     0     0     1     1 21517. 25244. 25565.
    2     2     0    21    14     0     0     0     0  3176.  5853. 13496.
    3     3     0    38    12     0     0     1     0 23039. 25131. 25565.
  3. The glimpse() function displays the column types and the first few values of each column of a data frame.

    glimpse(cps_in)
    Rows: 15,992
    Columns: 11
    $ X1    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, ~
    $ trt   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
    $ age   <dbl> 45, 21, 38, 48, 18, 22, 48, 18, 48, 45, 34, 16, 53, 19, 27, 32, 27, 46, 24, 22,~
    $ educ  <dbl> 11, 14, 12, 6, 8, 11, 10, 11, 9, 12, 14, 10, 10, 12, 12, 10, 12, 7, 13, 13, 10,~
    $ black <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0~
    $ hisp  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
    $ marr  <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0~
    $ nodeg <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0~
    $ re74  <dbl> 21516.6700, 3175.9710, 23039.0200, 24994.3700, 1669.2950, 16365.7600, 16804.630~
    $ re75  <dbl> 25243.550, 5852.565, 25130.760, 25243.550, 10727.610, 18449.270, 16354.600, 362~
    $ re78  <dbl> 25564.670, 13496.080, 25564.670, 25564.670, 9860.869, 25564.670, 18059.300, 157~

2.5 Tibble structure

The tidyverse object type for a data frame is a tibble. A tibble contains set of columns, variables, that contain the values of the data. The variables of a tibble are identified by column names.

The variables that make up a data frame can be thought of as a list. See the figure below. This list of variables is the column index of the data frame. the name of the columns in the figure are name 1, name 2, and name k. The names for the columns that may exists between name 2 and name k are not given in the figure. Using the column names is the most common, and preferred, method for indexing variables. The named index values of the column index are also called the column names.

Column index of a data frame

Figure 2.1: Column index of a data frame

The figure above shows that the variables of a data frame are only connected by the index. An observation (row) is made up of the values in the same row of each of the variables that make up the tibble.

2.6 Column operations

The two common column operations are renaming columns, rename(), and selecting columns, select(). The select() function has a number of helper functions that make it easier to select a set of columns, such as, starts_with(), ends_with(), contains(), everything() and the slice operator.

2.6.1 Examples

  1. Renaming the variables of the cps data.

    cps_in <- 
      cps_in %>%
      rename(
        id = X1,
        no_deg = nodeg,
        real_earn_74 = re74,
        real_earn_75 = re75,
        real_earn_78 = re78
        )
    cps <-
      cps_in
    
    head(cps, 3)    
    # A tibble: 3 x 11
         id   trt   age  educ black  hisp  marr no_deg real_earn_74 real_earn_75 real_earn_78
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>        <dbl>        <dbl>        <dbl>
    1     1     0    45    11     0     0     1      1       21517.       25244.       25565.
    2     2     0    21    14     0     0     0      0        3176.        5853.       13496.
    3     3     0    38    12     0     0     1      0       23039.       25131.       25565.

    The above variable names are Snake coded, each word is separated using the underscore, _.

  2. Reordering the variables of the cps data.

    We will make the first two columns of the tibble the id and age variables.

    cps <-
      cps %>%
      select(id, age, everything())
    
    head(cps, 3)    
    # A tibble: 3 x 11
         id   age   trt  educ black  hisp  marr no_deg real_earn_74 real_earn_75 real_earn_78
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>        <dbl>        <dbl>        <dbl>
    1     1    45     0    11     0     0     1      1       21517.       25244.       25565.
    2     2    21     0    14     0     0     0      0        3176.        5853.       13496.
    3     3    38     0    12     0     0     1      0       23039.       25131.       25565.

    The everything() function fills in the names of the other variables that were not listed.

  3. Selecting variables (subsetting)

    We will select all the variables except the real_earn_78 variable using inclusion.

    cps_part1 <-
      cps %>%
      select(id, age, trt, educ, black, hisp, marr, no_deg, real_earn_74, real_earn_75)
    
    head(cps_part1, 3)
    # A tibble: 3 x 10
         id   age   trt  educ black  hisp  marr no_deg real_earn_74 real_earn_75
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>        <dbl>        <dbl>
    1     1    45     0    11     0     0     1      1       21517.       25244.
    2     2    21     0    14     0     0     0      0        3176.        5853.
    3     3    38     0    12     0     0     1      0       23039.       25131.

    We will select all the variables except the real_earn_74 and real_earn_75 variables using exclusion.

    cps_78 <-
      cps %>%
      select(-real_earn_74, -real_earn_75)
    
    head(cps_78, 3)    
    # A tibble: 3 x 9
         id   age   trt  educ black  hisp  marr no_deg real_earn_78
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>        <dbl>
    1     1    45     0    11     0     0     1      1       25565.
    2     2    21     0    14     0     0     0      0       13496.
    3     3    38     0    12     0     0     1      0       25565.
  4. Removing a variable from a tibble.

    subsetting to a single column of a tibble results in a one column tibble.

    cps %>%
      select(age) %>%
      str()
    tibble[,1] [15,992 x 1] (S3: tbl_df/tbl/data.frame)
     $ age: num [1:15992] 45 21 38 48 18 22 48 18 48 45 ...

    The pull() function is used to get a column from a tibble as a vector.

    cps %>%
      pull(age) %>%
      str()
     num [1:15992] 45 21 38 48 18 22 48 18 48 45 ...

2.7 Modifying variables

The mutate() functions is used to modify an existing variable and to add new variables to a tibble. The name of the variable to be changed or created is given as the parameter name and parameter value is a column of values.

There is also mutate_at() and mutate_all() functions that can modify multiple variables. These are handy when the same function needs to be applied to more than one variable.

2.7.1 Example

  1. Create a single variable for ethnicity.

    This example assume that anyone who did not identify as black or hisp identifies as white non hispanic ethnicity. This is done to demonstrate tidyverse functionality and may not be a good assumption for a researcher.

    We create an ethnicity variable of type character. Then we parse it to type factor.

    cps <-
      cps %>%
      mutate(
        ethnicity = ifelse(black == 1, "black",
                      ifelse(hisp == 1, "hisp",
                             "white_non_hisp"
                              )
                    ),
        ethnicity = parse_factor(ethnicity,)
      ) %>%
      select(-black, -hisp) %>%
      select(id, age, ethnicity, everything())
    
    head(cps, 3)    
    # A tibble: 3 x 10
         id   age ethnicity        trt  educ  marr no_deg real_earn_74 real_earn_75 real_earn_78
      <dbl> <dbl> <fct>          <dbl> <dbl> <dbl>  <dbl>        <dbl>        <dbl>        <dbl>
    1     1    45 white_non_hisp     0    11     1      1       21517.       25244.       25565.
    2     2    21 white_non_hisp     0    14     0      0        3176.        5853.       13496.
    3     3    38 white_non_hisp     0    12     1      0       23039.       25131.       25565.

    Note, variables created or changed in mutate() can be used in the same mutate() function, as was demonstrated here.

2.8 Subsetting rows

The tidyverse provides several functions to select rows from a tibble.

  • filter() selects rows using a Boolean condition.

  • sample_n() and sample_frac() take a random sample of the rows.

  • slice() selects row by numeric position.

2.8.1 Examples

  1. Create test and training data frames using filter().

    set.seed(145705)
    
    cps <-
      cps %>%
      mutate(
        split = ifelse(runif(n()) > .75, "test", "train")
      )
    cps_train <-
      cps %>%
      filter(split == "train")
    cps_test <-
      cps %>%
      filter(split == "test")
    
    dim(cps)    
    [1] 15992    11
    dim(cps_train)    
    [1] 11902    11
    dim(cps_test)    
    [1] 4090   11
  2. Create test and training data frames using slice().

    set.seed(145705)
    test_indx <- which(runif(nrow(cps)) > .75)
    train_ind <- setdiff(1:nrow(cps), test_indx)
    
    cps_train <-
      cps %>%
      slice(train_ind)
    cps_test <-
      cps %>%
      slice(test_indx)
    
    dim(cps)    
    [1] 15992    11
    dim(cps_train)    
    [1] 11902    11
    dim(cps_test)    
    [1] 4090   11

2.9 Grouping

Categorical variables identify groups of observations that have some shared property. The group_by() function creates a grouped tibble. Tidyverse functions can be applied at the group level on this grouped tibble. The ungroup() function return a grouppedtibbleto an ungroupped state. Thegroup_modify()function is used to alter the thetibble. This can be by collapsing the group to a row of summary statistics or using functions likehead(). Theungroup()function would still be used after thegroup_modify()`.

2.9.1 Examples

  1. Modify a tibble by groups

    We will create a new variable that is the standardized 1978 earning within each ethnicity.

    cps <-
      cps %>%
      group_by(ethnicity) %>%
      mutate(std_eth_earn_78 = scale(real_earn_78)) %>%
      ungroup()
    
    cps %>%
      select(id, age, ethnicity, std_eth_earn_78, real_earn_78) %>%
      head(3)
    # A tibble: 3 x 5
         id   age ethnicity      std_eth_earn_78[,1] real_earn_78
      <dbl> <dbl> <fct>                        <dbl>        <dbl>
    1     1    45 white_non_hisp               1.07        25565.
    2     2    21 white_non_hisp              -0.178       13496.
    3     3    38 white_non_hisp               1.07        25565.
  2. Display the head() of each group

    We will use group_modify() and head() to see a few rows of each ethnicity group.

    cps %>%
      group_by(ethnicity) %>%
      select(id, age, ethnicity, std_eth_earn_78, real_earn_78) %>%
      group_modify(~ head(.x, 3)) %>%
      ungroup() %>%
      print(n = 10)
    # A tibble: 9 x 5
      ethnicity         id   age std_eth_earn_78[,1] real_earn_78
      <fct>          <dbl> <dbl>               <dbl>        <dbl>
    1 white_non_hisp     1    45               1.07        25565.
    2 white_non_hisp     2    21              -0.178       13496.
    3 white_non_hisp     3    38               1.07        25565.
    4 black             26    26              -0.776        4754.
    5 black             39    17              -1.28            0 
    6 black             53    52               0.688       18438.
    7 hisp              38    39               0.340       16462.
    8 hisp              40    22               0.116       14439.
    9 hisp              41    21              -1.48            0 

2.10 Summarizing data

The summarise() function transforms a tibble by applying functions that produce statistics of the variables.

2.10.1 Examples

  1. Summarize one variable of the cps tibble.

    We will find the mean and standard devation of age.

    cps %>%
      summarise(
        mean_age = mean(age),
        std_dev_age = sd(age)
      )
    # A tibble: 1 x 2
      mean_age std_dev_age
         <dbl>       <dbl>
    1     33.2        11.0
  2. Summarizing multiple columns

    We will find the mean earnings in years 74, 75, and 78 for each ethnicity.

    cps_eth_earn <-
      cps %>%
      group_by(ethnicity) %>%
      summarise_at(
        vars(real_earn_74:real_earn_78),
        mean
      )
    
    cps_eth_earn
    # A tibble: 3 x 4
      ethnicity      real_earn_74 real_earn_75 real_earn_78
      <fct>                 <dbl>        <dbl>        <dbl>
    1 white_non_hisp       14376.       13999.       15213.
    2 black                11427.       10941.       12007.
    3 hisp                 12402.       12290.       13397.
  3. Summarizing with multiple grouping variables

    We will find the mean earnings in year 78 for each ethnicity and maritial status.

    cps_eth_marr_earn <-
      cps %>%
      group_by(ethnicity, marr) %>%
      summarise(
        mean_earn_78 = mean(real_earn_78)
      )
    `summarise()` has grouped output by 'ethnicity'. You can override using the `.groups` argument.
    cps_eth_marr_earn
    # A tibble: 6 x 3
    # Groups:   ethnicity [3]
      ethnicity       marr mean_earn_78
      <fct>          <dbl>        <dbl>
    1 white_non_hisp     0       11319.
    2 white_non_hisp     1       16742.
    3 black              0        9199.
    4 black              1       13728.
    5 hisp               0       10138.
    6 hisp               1       14607.

2.11 Reshaping data

The gather() method is used to transform a set of columns into two columns. (This is also call going from wide to long.) One of the new columns is the value column. This is a column that contains all of the values that were in the gathered set of columns. The other is the key, a column that contains the name of the gathered column the value came from. The key column is a categorical variable, with a level for each of the gathered variables.

The spread() function is the opposite of gather(). (This is also call going from long to wide.) It takes two columns and transforms them to a set of variables. The names of the new variables will be taken from the levels of the key variable. The values of the new variables comes from the value variable.

2.11.1 Examples

  1. Converting data to long form

    In this example we will gather the three earnings columns in cps into a year and earnings column.

    The separate() function is used to separate the numeric year part of the string from real and earn in the year variable.

    cps2 <- 
      cps %>%
      gather(
        key = year, 
        value = real_earn,
        real_earn_74,
        real_earn_75,
        real_earn_78
        ) %>%
      separate(year, into = c("X1", "X2", "year"), sep = "_") %>%
      select(-X1, -X2) %>%
      arrange(id, year)
    
    cps2 %>%
      select(id, year,  age, educ, marr, real_earn) %>%
      head()
    # A tibble: 6 x 6
         id year    age  educ  marr real_earn
      <dbl> <chr> <dbl> <dbl> <dbl>     <dbl>
    1     1 74       45    11     1    21517.
    2     1 75       45    11     1    25244.
    3     1 78       45    11     1    25565.
    4     2 74       21    14     0     3176.
    5     2 75       21    14     0     5853.
    6     2 78       21    14     0    13496.

    Note, the arrange() function sorts a tibble on the variables listed as parameters.

  2. Converting data to wide form

    In this example we will spread the mean_earn_78 column in cps_eth_marr_earn into columns for marital status.

    cps_eth_marr_earn <- 
      cps_eth_marr_earn %>%
      spread(
        key = marr, 
        value = mean_earn_78
        )
    
    cps_eth_marr_earn
    # A tibble: 3 x 3
    # Groups:   ethnicity [3]
      ethnicity         `0`    `1`
      <fct>           <dbl>  <dbl>
    1 white_non_hisp 11319. 16742.
    2 black           9199. 13728.
    3 hisp           10138. 14607.

2.12 Combining data sets

The join functions create a new tibble by matching rows from two tibbles. The tibbles are identified as the left side and the right side, also referred to as x and y respectively. The left side tibble is the tibble that is listed first in the parameter list. The left side may be piped into the join function.

The by parameter controls which columns in the two tibbles are used to match the rows of the two tibbles.

The left_join() function adds columns from the right side to the left side. The added columns will be filled with NAs for rows on the left side that are not matched to the right side. Rows in the right side that do not match the left side are not included.

  1. Using left_join() with all common variables.

    In this example the left join is used with no by parameter. This results in a natural join, a join that is done using all columns that have the same name in the two tibbles.

    The cps_part1 tibble is the left side and cps_78 is the right side.

    cps2 <-
      cps_part1 %>%
      left_join(cps_78)
    Joining, by = c("id", "age", "trt", "educ", "black", "hisp", "marr", "no_deg")
    glimpse(cps2)    
    Rows: 15,992
    Columns: 11
    $ id           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 2~
    $ age          <dbl> 45, 21, 38, 48, 18, 22, 48, 18, 48, 45, 34, 16, 53, 19, 27, 32, 27, 46, ~
    $ trt          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
    $ educ         <dbl> 11, 14, 12, 6, 8, 11, 10, 11, 9, 12, 14, 10, 10, 12, 12, 10, 12, 7, 13, ~
    $ black        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
    $ hisp         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
    $ marr         <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, ~
    $ no_deg       <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, ~
    $ real_earn_74 <dbl> 21516.6700, 3175.9710, 23039.0200, 24994.3700, 1669.2950, 16365.7600, 16~
    $ real_earn_75 <dbl> 25243.550, 5852.565, 25130.760, 25243.550, 10727.610, 18449.270, 16354.6~
    $ real_earn_78 <dbl> 25564.670, 13496.080, 25564.670, 25564.670, 9860.869, 25564.670, 18059.3~
  2. Using left_join() specifying the common variables to use for matching rows.

    In this example the by parameter is used to identify the column to joined on.

    cps_78 <- select(cps_78, id, real_earn_78)
    
    
    cps3 <-
      cps_part1 %>%
      left_join(cps_78, by = c("id"))
    
    glimpse(cps3)    
    Rows: 15,992
    Columns: 11
    $ id           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 2~
    $ age          <dbl> 45, 21, 38, 48, 18, 22, 48, 18, 48, 45, 34, 16, 53, 19, 27, 32, 27, 46, ~
    $ trt          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
    $ educ         <dbl> 11, 14, 12, 6, 8, 11, 10, 11, 9, 12, 14, 10, 10, 12, 12, 10, 12, 7, 13, ~
    $ black        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
    $ hisp         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
    $ marr         <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, ~
    $ no_deg       <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, ~
    $ real_earn_74 <dbl> 21516.6700, 3175.9710, 23039.0200, 24994.3700, 1669.2950, 16365.7600, 16~
    $ real_earn_75 <dbl> 25243.550, 5852.565, 25130.760, 25243.550, 10727.610, 18449.270, 16354.6~
    $ real_earn_78 <dbl> 25564.670, 13496.080, 25564.670, 25564.670, 9860.869, 25564.670, 18059.3~
  3. Using left_join() specifying the matching variables that have different names.

    In this example the by parameter is a name vector to identify differently named columns in the two tibbles.

    cps_78 <- rename(cps_78, patient_id = id)
    head(cps_78)
    # A tibble: 6 x 2
      patient_id real_earn_78
           <dbl>        <dbl>
    1          1       25565.
    2          2       13496.
    3          3       25565.
    4          4       25565.
    5          5        9861.
    6          6       25565.
    cps4 <-
      cps_part1 %>%
      left_join(cps_78, by = c("id" = "patient_id"))
    
    glimpse(cps4)    
    Rows: 15,992
    Columns: 11
    $ id           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 2~
    $ age          <dbl> 45, 21, 38, 48, 18, 22, 48, 18, 48, 45, 34, 16, 53, 19, 27, 32, 27, 46, ~
    $ trt          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
    $ educ         <dbl> 11, 14, 12, 6, 8, 11, 10, 11, 9, 12, 14, 10, 10, 12, 12, 10, 12, 7, 13, ~
    $ black        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
    $ hisp         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
    $ marr         <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, ~
    $ no_deg       <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, ~
    $ real_earn_74 <dbl> 21516.6700, 3175.9710, 23039.0200, 24994.3700, 1669.2950, 16365.7600, 16~
    $ real_earn_75 <dbl> 25243.550, 5852.565, 25130.760, 25243.550, 10727.610, 18449.270, 16354.6~
    $ real_earn_78 <dbl> 25564.670, 13496.080, 25564.670, 25564.670, 9860.869, 25564.670, 18059.3~
  4. Appending tibbles.

    We will append the cps training and testing tibbles that were created in earlier examples.

    cps_all_rows <- 
      cps_train %>%
      bind_rows(cps_test)
    
    dim(cps_all_rows)
    [1] 15992    11

Some other joins

  • right_join() - rows in the left side are matched to the right side.

  • inner_join() - includes only rows that are in both data frames.

  • full_join() - includes all row that are in either data frames.

  • semi_join() - keeps rows in left side that match right side. Does not add columns to the data frame. Duplicate rows are dropped.

  • anti_join() - keeps rows in left side that are not matched in the right side.

  • nest_join() - adds a column of tibbles to the left side. Each tibble contains the rows of the right side that match the row on the left side.

2.13 Getting help

Help is available in a variety of places. What follows is some approaches to looking for help.

You may need help in knowing how to accomplish something. That is, you do not know what functions/methods to use or maybe the steps needed. A good place to start here is the cheat sheets for the tidyverse. These can be scanned quickly to see what is provided by the package. What you need to do may already be directly implemented. You may also notice functions/methods that do part of what is needed and leave you something smaller that you do not know how to do. If you do not find what you need on the cheat sheets, goggling is your best option. We suggest starting the goggle search with tidyverse and follow this with what you are trying to do. This may not get you the help you need if there is a technical name for something that you do not know. In this case you may have to read through several of the initial responses to see how others talk about doing what you are doing. This is a good way to learn more of the lingo of programming and wrangling. Additional goggle searches can be done based on the use of new key words you see in the initial responses. You can also ask a question on help sites such as stackexchange or stackoverflow. You will likely have seen these kind of sites in response to your initial queries.

When you know how you want to accomplish something but do not remember the function/method to use, the cheat sheets would be a good place to start. If you do not find what you need on the cheat sheet, reviewing the table of contents of this book my help you find what you need. You can also goggle tidyverse followed by what you want to do.

When you know the name of the function/method you want to use and need help with what the parameters are or details of how the function/method works, the function documentation is a good place to start. This documentation can be found by goggling tidyverse followed by the name of the function or method you want help with.

If you are an SSCC member, you can also send a question to the help desk or come in and see one of the consultants for help if you are not able to resolve the issue for yourself. Information on getting help from the SSCC can be found at the SSCC website.

Page not found – Social Science Computing Cooperative – UW–Madison

It looks like nothing was found at this location. Maybe try a search?