Stata for Researchers: Combining Data Sets

This is part eight of the Stata for Researchers series. For a list of topics covered by this series, see the Introduction. If you're new to Stata we highly recommend reading the articles in order.

Combining two data sets is a common data management task, and one that's very easy to carry out. However, it's also very easy to get wrong. Before combining data sets be sure you understand the structure of both data sets and the logic of the way you're combining them. Otherwise you can end up with a data set that you think is ready for analysis, but is really utter nonsense. Stata tries to make sure you've thought through what you're doing, but can't tell you what makes sense and what doesn't.

Stata always works with one data set at a time, so you will always be combining the data set in memory (the master data set) with another data set on disk (called the using data set, for reasons that will be clear when you see the syntax).

Appending Data Sets

When appending, data sets are stacked.

Stata calls it appending when you add the observations from the using data set to the master data set. Appending makes sense when the observations in both data sets represent the same kind of thing, but not the same things. For example, you might append a data set of people from Wisconsin to a data set of people from Illinois. The data sets should have the same or mostly the same variables, with the same names. If a variable only appears in one data set, observations from the other data set will be given missing values for that variable.

The syntax is to carry out an append is simple: load the the master data set and then type:

append using dataset

where dataset is the name of the data set you want to append.

Merging Data Sets

When merging, data sets are placed side by side.

Stata calls it merging when observations from the two data sets are combined. There are, in theory, four kinds of merges:

In a one-to-one merge, one observation from the master data set is combined with one observation from the using data set. A one-to-one merge makes sense when the observations in both data sets describe the same things, but have different information about them. For example, you might merge the answers people gave in wave one of a survey with the answers the same people gave in wave two of the survey.

In a one-to-many or many-to-one merge, one observation from one data set is combined with many observations from the other (the difference between one-to-many and many-to-one being whether the master data set has the "many" or the using data set). These merges make sense when you have hierarchical data, and one data set contains information about the level one units while the other contains information about the level two units. For example, you might merge information about households with information about the individuals who live in those households.

In principle there are also many-to-many merges. In practice they are rarely if ever useful. If you find yourself wanting to do a many-to-many merge, you should rethink what you're doing. Often what you really need to do is identify and correct duplicate identifiers, and then your merge will work as one-to-one or one-to-many.

In all the merges we'll discuss, Stata combines observations that have the same value of a key variable, typically an ID. You can also match based on multiple variables (e.g. combine data for the same state and year). In a one-to-many or many-to-one merge, it is the identifier for the level two units that is the key variable (e.g. household ID, not individual ID). It's very important that the key variable have the same format in both data sets.

If an observation in one data set does not match with an observation in the other, it will be given missing values for the variables from the other data set. Since the viability of a research project often depends on how many observations actually merge (e.g. how many people from wave one of the survey could be found in wave two?) Stata gives you tools for figuring out how many observations actually merged and for examining those that didn't.

If a variable exists in both data sets, the values from the master data set will be kept and the values from the using data set will be discarded. Occasionally this is what you want, but it's more likely to be an error. In general you should set up your data such that the only variables the files to be merged have in common are the key variables.

The syntax for a merge is:

merge type keyvars using dataset

The type must be 1:1 (one-to-one), 1:m (one-to many), m:1 (many-to-one) or m:m (many to many); keyvars is the key variable or variables; and dataset is the name of the data set you want to merge.

An Example of Combining Data Sets

The examples include several files containing fictional student information from 2007. scores.dta contains the students' scores on a standardized test, demographics.dta contains demographic information about them, and teachers.dta contains information on their teachers. Take a moment to look at each file, then start a do file that loads scores.

In this data set, each observation represents a student. browse and you'll see that you have a student ID (id), a teacher ID (teacher) and a score for each.

Your first task is to add in the demographic information. In demographics.dta each observation also represents a student, with the variables being id and race. Thus this is a job for a one-to-one merge and the key variable is id.

merge 1:1 id using demographics

Stata will report that all 60 observations matched. It will also create a variable called _merge. A one in _merge means an observation only came from the master data set; a two means it only came from the using data set; and a three means an observation successfully matched and thus came from both. In this case we see that all observations matched and thus have _merge equal to three, so there's no need to keep the variable. In fact we need to drop it (or rename it) before doing any further merges:

drop _merge

Next add information about teachers. In teachers.dta each observation represents a teacher, and each teacher has many students. That makes this a many-to-one merge (since the many students are currently in memory and the one teacher is in the using data set). The key variable is not id, since that refers to the students, but teacher:

merge m:1 teacher using teachers

Again, all 60 observations merged properly, so you can drop _merge.

drop _merge

Combining Panel Data

Now suppose you were tracking these students for multiple years. The data set panel2007.dta contains a simplified version of this data set: just id and score. The data set panel2008.dta has the same variables for a different year. How would you combine them?

The proper way to combine them depends on what data structure you want. This is hierarchical data where a level two unit is a student and a level one unit is a student's data for a particular year. Thus it can be represented in wide form (one observation per student), or in long form (one observation per student per year).

To put the data in long form simply stack the two data sets using append. However, you'll need to know which year each observation represents. To do that, add a year variable to both data set, with the value 2007 for the 2007 data and the value 2008 for the 2008 data. You can do so with the following do file:

clear all
set more off
capture log close
log using combine1.log, replace

use panel2007
gen year=2007
save panel2007_append

use panel2008
gen year=2008

append using panel2007_append
save appendedData,replace

log close

To put the data in wide form, do a one-to-one merge with id as the key variable. But first you need to change the variable names. Recall that in wide form, it is the variable names that tell you which level one unit you're talking about. So instead of score, you need score2007 and score2008.

clear all
set more off
capture log close
log using combine2.log, replace

use panel2007
rename score score2007
save panel2007_merge

use panel2008
rename score score2008

merge 1:1 id using panel2007_merge
save mergedData,replace

log close

This time you'll see that one observation does not match. You can see which one by typing:

l if _merge==2

Student number 55 was not in panel2008 and thus couldn't be matched. As a result we have no idea what his or her test score was in 2008. Unfortunately this is very common—students move out of school districts between tests all the time.

If your entire research agenda depends on having both test scores, you may need to drop observations that don't exist in both data sets. You can do so at this point by adding:

drop if _merge!=3

You can also specify which observations should be kept directly in the merge command:

merge 1:1 id using panel2007_merge, keep(match)

keep(match) means only keep observations which match. The alternatives are master and using, and you can list more than one. For example, to keep observations which match and observations that only come from the master data set, while throwing away observations that only come from the using data set, you'd say keep(master match).

Common Problems with Merges

Merges will uncover all sorts of problems with your data set (and if they're not fixed merging will introduce new ones). Here are a two common ones and how to fix them:

Key Variables Stored in Different Formats

While Stata will happily match different kinds of numbers (ints and floats, for example) it can't match numbers and strings. IDs can be stored as either (as long as you choose a numeric type that has enough precision—see Working with Data) and it's not uncommon to find that your data sets store the ID in different ways. In that case it's usually best to convert the numbers to strings:

gen idString=string(id)
drop id
rename idString id

The string() function takes a number and converts it to a string. You can give it a second argument containing the format in which the number should be "written" if needed.

Duplicate IDs

If you try to do a merge and you get an error message like "variable id does not uniquely identify observations in the master data" this means you have duplicate IDs in the data set mentioned. This problem needs to be fixed before you proceed. Do not simply change the type of merge for the relevant data set from "one" to "many" hoping to make the error message go away—the resulting data set will not make sense for the affected observations.

One possible source of duplicates is round-off error due to saving the IDs in an inappropriate variable type. In this case you'll need to go back to the original data and ensure the identifiers are stored as a type that won't round them, like long, double, or string.

Note that for purposes of merging, missing values are treated just like any other value. If you have observations with missing IDs Stata will count them as duplicates. You'll probably have to drop them before merging.

But most of the time duplicate IDs result from errors in the data. You'll need to do something about them before you can perform your merge.

Start a do file that loads the data set merge_error. This data set contains students, but some of them have duplicate IDs that need to be fixed.

You can see how many problems you have with duplicates report:

duplicates report id

This tells you how many observations have the same value of id. The output tells you that no ID appears more than twice, but twelve observations have duplicate IDs.

For further examination, create a variable that tells you how many copies there are of each ID:

bysort id: gen copies=_N

Then you can look at just the problem observations with:

browse if copies>1

For the observations with id equal to 9, 26, and 33, the two observations with the same ID are identical. This suggests the same student was entered twice and you can fix the problem by simply dropping the extra observations. Do so with:

duplicates drop

Note that unlike the duplicates report command you ran earlier, this command does not have a varlist. This means it only drops duplicate observations if they have the same value for all variables, not just id. You can give it a varlist so it drops observations if only those variables have the same values, but be very careful doing so.

At this point in complex problems you may want to drop the copies variable you created earlier and recreate it so browse if copies>1 will no longer show the observations you have already fixed, but there's no need in this data set.

For the observations with id equal to 64, 74, and 94, the other variables show that the observations with the same ID are not the same person. This is a major problem for merging: you can't be sure which "student #64" in this data set goes with the "student #64" in the hypothetical other data set.

Adding more variables may help. For example, in this case, merging by both id and race would allow you to correctly match the two subjects with id equal to 94. Merging by id, race, and grade would allow you to correctly match the subjects with id equal to 74. Just keep in mind that if the data set you wanted to merge with this one were for the subsequent school year like in our previous example you'd have to subtract one from its grade variable first. If you choose to go this route, recreate the copies variable using the complete set of variables you plan to match by in the by prefix and see how many duplicates are left.

Nothing will help with the two observations with id equal to 64, and in a larger data set it's less likely that matching by a few more variables will allow you to uniquely identify subjects. In such cases you'll probably need to drop all the observations you can't uniquely identify, since you reliably can't match any them. You can do so with:

drop if copies>1


  1. An alternative way to combine data sets into wide form is to first append them and then reshape. Combine panel2007 and panel2008 into wide form using this method. (Solution)
  2. If you combine the two years of data into long form using append, how can you find the observation which doesn't appear in 2008? (Solution)
  3. Combine error2007.dta and error2008.dta by merging it into the wide form, fixing whatever problems you find. Then append the two data sets and try to locate the same problems. (Solution)

Next: Graphics

Previous: Hierarchical Data

Last Revised: 12/29/2015