Data Wrangling in Stata: Project Management

This is part eight of Data Wrangling in Stata.

Most Stata work takes place in the context of a research project. In a typical project you have a research question you want to answer and some data that you think will answer it, but the data isn't in a form that can actually answer the question—yet. Good project management will help you get from raw data to completed analysis efficiently and reproducibly.

Simple Best Practices

Books have been written about how to manage research projects properly. While we won't go into that level of detail here, we will suggest a few simple best practices that can save a tremendous amount of time and reduce the probability of making serious mistakes.

Don't Skip the First Steps

In First Steps with Your Data, we laid out a process for both cleaning up your data set and gaining a deep understanding of it. Do not skip these steps! They will save you time down the road. A particularly painful scenario they can help you avoid is discovering after months of work that your data set cannot actually answer your research question.

Begin with the End in Mind

Before you write any code, think through what form the data needs to be in so you can analyze it. What should an observation represent? What variables will each observation need to contain? The answers to these questions will most likely be determined by the statistical techniques you plan to use.

Don't Try to do Everything at Once

Once the goal is clear in your mind, don't try to write one massive do file that gets you there in one step, only trying to run it once it's "done." If you do, the do file will most likely have a large number of bugs. Then you may find that in order to make one part work, you need to do something in a different way than you originally planned. You'll then have to change everything that follows.

It's far better to write a bit of code, test and debug it, then write a little more, test and debug it, and so forth. How much to write depends on how difficult what you're writing is for you. If you're very confident in what you're doing, go ahead and write ten or twenty lines at a time. If what you're doing is brand new to you, write one and then test it, or even write it interactively in the command window and only copy it to your do file once it works.

Split Your Code into Multiple Do Files

If a do file gets too long, as you go through the write-test-debug cycle you'll find yourself spending a lot of time waiting for code you know is good to run so it can move on to the code you just added and need to test. More generally, you want to write do files that are short enough that while you're working on one you can remember everything it does.

To break up a long do file into smaller pieces, just pick a logical stopping point, have the do file save the data set at that point, then create a new do file that uses that data set as its starting point. Just remember: never save your output data set over your input data set.

Avoid the practice of running parts of a do file as a substitute for breaking the do file into multiple pieces. Running part of a do file can be useful, but it's inherently not reproducible because it depends on clicking on the right thing. It can also introduce errors, such as code that crashes because it depends on prior code that was not run or because it was run more than once.

Put Code for Different Purposes in Different Do Files

While data wrangling is a linear process with each step depending on what came before, exploratory analysis often has multiple independent branches as you try various things. Then when you've identified the results you want to report or publish, you want the code that produces them to be as clean, clear, and concise as possible. Thus it's best to have separate do files for each of these purposes.

For most projects there should be a "final" data set that's used for all analysis. That way you can open it up interactively and try things, write do files that analyze it in different ways, and generally experiment at will without running the risk of forgetting that, for example, the do file that ran the linear regressions also did a bit more recoding.

Checking your Work

Programming errors can be subtle and very difficult to catch by just staring at your code. Generally it's more effective to spend your time comparing your results to what they should be. Of course this depends on having some sense of what they should be: be constantly on the lookout for information you can use to check your work.

Examine summary statistics and frequencies frequently as you carry out data preparation, especially when you create new variables or change the structure of your data. See if what you get is plausible. If the results change, be sure you can explain why.

Spend even more time looking at individual cases. Use the browse command, often specifying a subset of the data so you can focus on what's currently relevant, and compare what your do file did to individual cases with what you meant it to do. If you have different types of cases, be sure to look at samples of each.

If you do find problems, looking at cases is the best way to solve them. What kinds of cases get the wrong answers? How exactly are they wrong? Figuring out those details will point you to the particular commands that need to be corrected.

Make your Project Reproducible

With proper organization you should be able to reproduce your entire project at will.

Start with the data as you obtained it. Your first do file will read it in, make some changes, and save the results in a different file. Your second do file will read in the output from the first do file, make further changes, and then save its results in another separate file. Repeat until your data wrangling is complete. Then all your analysis do files will read the same final data set and analyze it in various ways.

If you discover errors or need to make changes, having a well-organized and reproducible project will save you significant amounts of time. To track down an error, run your do files one-by-one, checking the results after each, until the error appears. Then you'll know which do file needs to be fixed. Once the error is corrected or the change is made, consider whether subsequent do files also need to be changed. Once all the needed changes are made, simply rerun all your do files.

Consider writing a master do file that runs all the do files required by the project, in the proper order (recall that one do file can run another simply by running the command do otherDoFile). Also write a "readme" document to keep with the project files, containing other relevant information. This will be very valuable to anyone else who has to work with your code, but also to the future you who has to try to remember how it all worked months or years later.

Case Studies

Two stories that illustrate the importance of proper project management:

One day a professor and her research assistant came to the SSCC's statistical consultants. They were working with census data from multiple countries over many years, so a lot of data wrangling was required to make the various data sets compatible and then combine them. The RA had been working on this data wrangling for about six months.

Then the the professor decided to run some basic frequencies on the data they had. The results were clearly wrong. The RA must have made a mistake at some point, and they came to us hoping we'd be able to fix the problem. After some discussion, we found that the RA had been doing all his work interactively. He would open a data set, do things to it, and then save it. He had only a general recollection of what he had done, and had no do files, logs or intermediate data sets to fall back on. Since everything he had created was useless, the project had to be started again from the original data.

The next time we saw her, the professor had a new RA, and she made sure did her work reproducibly.

On a happier note, a grad student once came to the SSCC's statistical consultants because in preparing to present her research she discovered that the values of one variable for three observations had somehow been corrupted (we have never seen that happen before or since). We had no way of knowing how that affected her results.

Fortunately she had done everything using do files. We got the data from the source again, checked that it was intact this time, and then she re-ran all her do files. Months of work were replicated in less than 15 minutes, and she was able to proceed with her presentation.

Far more could be said about project management (we haven't even mentioned collaborating with others). You might find J. Scott Long's Workflow of Data Analysis Using Stata helpful.

Exercise: the files final_demo.dta and final_scores.dta contain fictional demographic information about students and their families, and the student's scores on a standardized test. Your research question is "What is the impact of living in a single-parent family on test scores?" You are also interested in the effect of household income and the education of the parents (which must be aggregated in some way to create a household-level variable). Use everything you've learned to wrangle the data into an appropriate form, and then use whatever analysis tools you're familiar with to answer the research question. Use the project management best practices described in this section, but consider this a review of everything you've learned.

Next: Learning More

Last Revised: 8/24/2019