2.12 Combining data sets
The join functions create a new tibble
by matching rows from two tibbles
.
The tibbles
are identified as the left side and the right side,
also referred to as x
and y
respectively.
The left side tibble
is the tibble that is listed first in the
parameter list.
The left side may be piped into the join function.
The by
parameter controls which columns in the two tibbles
are
used to match the rows of the two tibbles
.
The left_join()
function adds columns from the right side to the
left side.
The added columns will be filled with NA
s for rows on the left
side that are not matched to the right side.
Rows in the right side that do not match the left side are not
included.
Using
left_join()
with all common variables.In this example the left join is used with no
by
parameter. This results in a natural join, a join that is done using all columns that have the same name in the twotibbles
.The
cps_part1
tibble is the left side andcps_78
is the right side.cps2 <- cps_part1 %>% left_join(cps_78)
Joining, by = c("id", "age", "trt", "educ", "black", "hisp", "marr", "no_deg")
glimpse(cps2)
Observations: 15,992 Variables: 11 $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15... $ age <dbl> 45, 21, 38, 48, 18, 22, 48, 18, 48, 45, 34, 16, 5... $ trt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ educ <dbl> 11, 14, 12, 6, 8, 11, 10, 11, 9, 12, 14, 10, 10, ... $ black <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ hisp <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ marr <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1... $ no_deg <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0... $ real_earn_74 <dbl> 21516.6700, 3175.9710, 23039.0200, 24994.3700, 16... $ real_earn_75 <dbl> 25243.550, 5852.565, 25130.760, 25243.550, 10727.... $ real_earn_78 <dbl> 25564.670, 13496.080, 25564.670, 25564.670, 9860....
Using
left_join()
specifying the common variables to use for matching rows.In this example the
by
parameter is used to identify the column to joined on.cps_78 <- select(cps_78, id, real_earn_78) cps3 <- cps_part1 %>% left_join(cps_78, by = c("id")) glimpse(cps3)
Observations: 15,992 Variables: 11 $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15... $ age <dbl> 45, 21, 38, 48, 18, 22, 48, 18, 48, 45, 34, 16, 5... $ trt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ educ <dbl> 11, 14, 12, 6, 8, 11, 10, 11, 9, 12, 14, 10, 10, ... $ black <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ hisp <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ marr <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1... $ no_deg <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0... $ real_earn_74 <dbl> 21516.6700, 3175.9710, 23039.0200, 24994.3700, 16... $ real_earn_75 <dbl> 25243.550, 5852.565, 25130.760, 25243.550, 10727.... $ real_earn_78 <dbl> 25564.670, 13496.080, 25564.670, 25564.670, 9860....
Using
left_join()
specifying the matching variables that have different names.In this example the
by
parameter is a name vector to identify differently named columns in the twotibbles
.cps_78 <- rename(cps_78, patient_id = id) head(cps_78)
# A tibble: 6 x 2 patient_id real_earn_78 <dbl> <dbl> 1 1 25565. 2 2 13496. 3 3 25565. 4 4 25565. 5 5 9861. 6 6 25565.
cps4 <- cps_part1 %>% left_join(cps_78, by = c("id" = "patient_id")) glimpse(cps4)
Observations: 15,992 Variables: 11 $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15... $ age <dbl> 45, 21, 38, 48, 18, 22, 48, 18, 48, 45, 34, 16, 5... $ trt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ educ <dbl> 11, 14, 12, 6, 8, 11, 10, 11, 9, 12, 14, 10, 10, ... $ black <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ hisp <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ marr <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1... $ no_deg <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0... $ real_earn_74 <dbl> 21516.6700, 3175.9710, 23039.0200, 24994.3700, 16... $ real_earn_75 <dbl> 25243.550, 5852.565, 25130.760, 25243.550, 10727.... $ real_earn_78 <dbl> 25564.670, 13496.080, 25564.670, 25564.670, 9860....
Appending
tibble
s.We will append the
cps
training and testingtibble
s that were created in earlier examples.cps_all_rows <- cps_train %>% bind_rows(cps_test) dim(cps_all_rows)
[1] 15992 11
Some other joins
right_join()
- rows in the left side are matched to the right side.inner_join()
- includes only rows that are in both data frames.full_join()
- includes all row that are in either data frames.semi_join()
- keeps rows in left side that match right side. Does not add columns to the data frame. Duplicate rows are dropped.anti_join()
- keeps rows in left side that are not matched in the right side.nest_join()
- adds a column oftibble
s to the left side. Each tibble contains the rows of the right side that match the row on the left side.