13 Merging

We often find we want to combine the data in two separate data sets, in order to do some analysis. This is often referred to as a merge or a join.

There are two very straightforward cases to consider first:

  • Adding the observations in one data set as new observations in a second data set. This is sometimes also called "appending" data sets.
  • Adding the variables in one data set as new variables to the observations in a second data set. The term "merge" is sometimes reserved to mean this, and is sometimes also called a "match merge".

13.1 Appending Data Set Observations

Consider an example. We can start with the mtcars data, and construct two data frames - one with manual transmission cars, the other with automatic transmission cars.

manual    <- subset(mtcars, am==1)
automatic <- subset(mtcars, am==0)

Here there are 13 observations in the data frame manual and 19 observations in the data frame automatic. Both data frames have 11 variables/columns.

To put both types of cars in one data frame, the simplest approach is to use the rbind function.

allcars <- rbind(manual, automatic)

# Check the number of observations
nrow(allcars) == nrow(manual) + nrow(automatic)
[1] TRUE

13.1.1 Some rbind Details

13.1.1.1 Matching Variables

A limitation of rbind is that both data frames must include all the same variables.

manual_disp    <- subset(manual, select=c("mpg", "disp"))
automatic_hp <- subset(automatic, select=c("mpg", "hp"))

diffvars <- rbind(manual_disp, automatic_hp)
Error in match.names(clabs, names(xi)): names do not match previous names

(We will see a solution to this problem later, using merge.)

13.1.1.2 Variable Order

Variables do not need to appear in the same order.

manual    <- subset(manual, select=c("mpg", "disp"))
automatic <- subset(automatic, select=c("disp", "mpg"))

lessvars <- rbind(manual, automatic)
head(lessvars)
                mpg  disp
Mazda RX4      21.0 160.0
Mazda RX4 Wag  21.0 160.0
Datsun 710     22.8 108.0
Fiat 128       32.4  78.7
Honda Civic    30.4  75.7
Toyota Corolla 33.9  71.1
nrow(lessvars)
[1] 32
ncol(lessvars)
[1] 2

13.1.1.3 Variable Types

If variables with the same name are not of the same type, they will be coerced to the most general type. Here, numeric values in the manual data frame are coerced to character values by rbind.

automatic$disp <- as.character(automatic$disp)
str(automatic)
'data.frame':   19 obs. of  2 variables:
 $ disp: chr  "258" "360" "225" "360" ...
 $ mpg : num  21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 ...
difftypes <- rbind(manual, automatic)
str(difftypes)
'data.frame':   32 obs. of  2 variables:
 $ mpg : num  21 21 22.8 32.4 30.4 33.9 27.3 26 30.4 15.8 ...
 $ disp: chr  "160" "160" "108" "78.7" ...

Notice that you do not get any warning that the type has been changed!

13.2 Merging Data Set Variables

A simple merge adds the data values in one data set as new variables to the observations in another data set.

Consider the classic sleep study, where we have observations on ten subjects. In the first data set, we have data on how much extra sleep they got with drug A, while in the second data set we have data on how much extra sleep they got with drug B.

A <- subset(sleep, group==1, select=c("ID","extra"))
B <- subset(sleep, group==2, select=c("ID","extra"))

# for clarity, rename `extra`
names(A) <- c("ID", "extra_A")
names(B) <- c("ID", "extra_B")

This is especially simple to merge:

merge(A,B)
   ID extra_A extra_B
1   1     0.7     1.9
2  10     2.0     3.4
3   2    -1.6     0.8
4   3    -0.2     1.1
5   4    -1.2     0.1
6   5    -0.1    -0.1
7   6     3.4     4.4
8   7     3.7     5.5
9   8     0.8     1.6
10  9     0.0     4.6

Two things make this example simple.

  • Every observation in data set A has a matching observation in data set B, and vice versa.
  • The only variable the two data sets have in common is a unique identifier (a "key") that can be used to match the observations in one data set to the observations in the other.

13.2.1 Some merge Details

We can dig into the details of merges by first examining what happens when the conditions above are not met.

13.2.1.1 Common Variables That Are Not Keys

Suppose we had not renamed the extra variables.

A <- subset(sleep, group==1, select=c("ID","extra"))
B <- subset(sleep, group==2, select=c("ID","extra"))

Now our simple merge specification no longer produces the desired data set.

merge(A,B)
  ID extra
1  5  -0.1

By default, merge seeks to match observations based on all the variables common to both data sets. In the first example, only ID was common (because we renamed extra), but in this example both ID and extra are common.

Because ID appears first in the data set, the merge procedure first looks for a matching ID. If that is found, then the procedure looks for an observation with a matching extra value. Only one observation matches both ID and extra across both data sets!

Notice that unmatched observations from both data sets were dropped.

We can control which variables are used for matching (and their order of use) with the by parameter.

merge(A, B, by="ID")
   ID extra.x extra.y
1   1     0.7     1.9
2  10     2.0     3.4
3   2    -1.6     0.8
4   3    -0.2     1.1
5   4    -1.2     0.1
6   5    -0.1    -0.1
7   6     3.4     4.4
8   7     3.7     5.5
9   8     0.8     1.6
10  9     0.0     4.6

Notice that even though extra appears in both data sets, it is no longer used to match observations. Instead, both versions of extra are kept, but now as extra.x (from A) and extra.y (from B). Any variables not used for matching are kept - an attempt is made to rename them if necessary.

13.2.1.2 Unmatched Observations

Now suppose that not every observation has a match.

A <- subset(sleep, group==1, select=c("ID","extra"))
B <- subset(sleep, group==2, select=c("ID","extra"))

# just some observations from B
B <- B[1:5, ]

We only want to match on ID, so we use the by parameter. But recall that only observations with matches are kept by default. We often additionally want to keep all observations from both data sets.

merge(A, B, by="ID", all=TRUE)
   ID extra.x extra.y
1   1     0.7     1.9
2   2    -1.6     0.8
3   3    -0.2     1.1
4   4    -1.2     0.1
5   5    -0.1    -0.1
6   6     3.4      NA
7   7     3.7      NA
8   8     0.8      NA
9   9     0.0      NA
10 10     2.0      NA

Notice that the observations from B have a value for extra.y, but the observations from A have no extra.y where there was no matching observation in B (ID's 6 through 10). Instead these have been filled in with NA, the missing value.

13.3 Merge to Append

We can use merge to get around some of the difficulty using rbind to append data sets when they don't includ all the same variables.

manual_disp    <- subset(manual, select=c("mpg", "disp"))
automatic_hp <- subset(automatic, select=c("mpg", "hp"))
Error in `[.data.frame`(x, r, vars, drop = drop): undefined columns selected

The crux here is to find a unique identifier in both data sets so that none of the observations actually match. The (inobvious) identifier here is the row names in each data set.

manual_disp  <- subset(mtcars, am==1, select=c("mpg", "disp"))
automatic_hp <- subset(mtcars, am==0, select=c("mpg", "hp"))

head(manual_disp) # to show row.names
                mpg  disp
Mazda RX4      21.0 160.0
Mazda RX4 Wag  21.0 160.0
Datsun 710     22.8 108.0
Fiat 128       32.4  78.7
Honda Civic    30.4  75.7
Toyota Corolla 33.9  71.1
morecars <- merge(manual_disp, automatic_hp, by="row.names", all=TRUE)
head(morecars, 10)
            Row.names mpg.x  disp mpg.y  hp
1         AMC Javelin    NA    NA  15.2 150
2  Cadillac Fleetwood    NA    NA  10.4 205
3          Camaro Z28    NA    NA  13.3 245
4   Chrysler Imperial    NA    NA  14.7 230
5          Datsun 710  22.8 108.0    NA  NA
6    Dodge Challenger    NA    NA  15.5 150
7          Duster 360    NA    NA  14.3 245
8        Ferrari Dino  19.7 145.0    NA  NA
9            Fiat 128  32.4  78.7    NA  NA
10          Fiat X1-9  27.3  79.0    NA  NA

This include all our observations and all our variables. However, it leaves mpg divided into two columns. If we realize that mpg is unique within car type (row.names) because car type is itself a unique identifier, then we can include mpg in the by specification.

morecars <- merge(manual_disp, automatic_hp, by=c("row.names","mpg"), all=TRUE)
head(morecars, 10)
            Row.names  mpg  disp  hp
1         AMC Javelin 15.2    NA 150
2  Cadillac Fleetwood 10.4    NA 205
3          Camaro Z28 13.3    NA 245
4   Chrysler Imperial 14.7    NA 230
5          Datsun 710 22.8 108.0  NA
6    Dodge Challenger 15.5    NA 150
7          Duster 360 14.3    NA 245
8        Ferrari Dino 19.7 145.0  NA
9            Fiat 128 32.4  78.7  NA
10          Fiat X1-9 27.3  79.0  NA

13.4 A Final Note

I have left a number of merge options undiscussed. Look at the documentation for more nuances

There are other functions available for these tasks. In particular, many people prefer to use the various join functions from the dplyr package.