Stata for Researchers: Do Files

This is part three of the Stata for Researchers series. For a list of topics covered by this series, see the Introduction. If you're new to Stata we highly recommend reading the articles in order.

Up to this point we've used Stata interactively: we've typed commands in the command window, hit Enter, and observed the results. But now that we've covered the basics of Stata syntax, the next step is learning how to create and change variables. You should never change your data interactively, so we'll first talk about how to write do files.

Writing a Do File

Do files are simply text files whose names end with .do and which contain Stata commands exactly the way you'd type them into the command window. Sometimes people call them programs, though Stata uses this term for something else (see Stata Programming Tools). You can write do files using any text editor, but the Do File Editor built into Stata has tools and features designed to help programmers so we recommend using it. Do not write Stata code using Word—it will automatically insert things like "smart quotes" and other formatting that Stata cannot understand.

Start the Do File Editor by clicking on the button that looks like a pencil writing in a notebook or by typing doedit.

Setting Up

Almost every do file should start with the following commands (or something very much like them):

clear all
capture log close
set more off

The first command clears the memory so you don't have to worry about what might have happened before your do file was run. The second closes any open log files. The third tells Stata not to pause whenever the screen fills and wait for you to press a key (while saying --more-- at the bottom).

Starting a Log

A research do file should have a corresponding log file which records all the commands the do file ran and their results. To start logging, the command is:

log using filename.log, replace

where filename is the name of the file you want Stata to use as a log. Give the log file the same name as the do file it records, so it's obvious which log file goes with which do file. The replace option tells Stata that if a log file with that name already exists, usually from a previous attempt to run the do file, it should be replaced by the current log.

If you do not specify the .log at the end of the filename, Stata will save the log using its Stata Markup and Control Language. SMCL has its uses, but it can only be read by Stata's Viewer. If your filename ends with .log, Stata will save the log as plain text which you can read in any text editor.

Loading Data

Next you will usually load a data set:

use dataset

If the dataset is in the current working directory, you don't need to specify its location.

Do Your Work

At this point you'll be ready to do your work. Generally this means data preparation, exploratory analysis, or analysis you intend to report or publish. We recommend you have separate do files for each of these, as they are very different processes and have different requirements. We'll talk more about this in Project Management.

Save your Data

If this do file is for data preparation, you'll need to save your work at the end:

save newDataset, replace

The replace option again allows Stata to overwrite the output from previous attempts to run the do file.

Never, ever save your output data set over your input data set. (In other words, the starting use command and the ending save command should never act on the same file.) If you do, the data set your do file was written to work with will no longer exist. If it turns out you made a mistake, you can't easily recover. If the data set was stored on the SSCC network, you can call the Help Desk and ask to have the file restored from backup but this is definitely not ideal.

Clearing everything from memory, loading the data set you want to use, and then saving any changes you make to a different file makes your do file reproducible: you can run it again any time you want and get the exact same results. If the input data set changes, you'll be applying the exact same procedures to the new data. If it turns out you made a mistake, all you need to do is correct the error in your code and run the do file again. If you need to make changes you can do so without starting over. It may take a bit of effort at first to get into the habit of writing reproducible code, but the effort will pay off very quickly.

Close your log

The last line of the do file will normally be:

log close

If you don't close the do file's log, any commands you run after the do file finishes will be logged as if they were part of the do file. If your do file crashes before reaching the log close command it will leave the log file open. That's why you need capture log close at the beginning. (The capture prefix basically says "If the following command generates any errors I don't care. Please don't crash my do file." We use it here because log close will generate an error if there is no open log. At this point in your Stata career you should not use capture for anything else—fix the errors instead.)

Running a Do File

The easiest way to run a do file is to press Ctrl-d in the Do File Editor, or click the icon on the far right that looks like a "play" button over some code. If you first select just part of the do file then only that part will be run.

Running parts of your code rather than the entire do file can save a lot of time, but code taken out of context won't always work. For example, if you run a command that creates a variable x, realize you made a mistake, and then fix it, you can't simply select that command and run it again unless you first drop the existing version of x. If you find yourself getting confused by these kinds of issues, run the entire do file rather than a selection so everything is run in its proper context.

You can also tell Stata to run a do file with the do command:

do myDoFile

This means do files can run other do files. For complicated projects it can be very helpful to have a master do file that runs all the other do files in the proper sequence.

How long should a do file be?

For data preparation work, it's easy to "daisy-chain" do files: dofile1 loads dataset1, modifies it, and saves it as dataset2; dofile2 loads dataset2, modifies it, and saves it as dataset3, etc. When you're done, a master do file can run them all. Thus there's very little downside to breaking up one long do file into two or more short do files. Our suggestion is that you keep your do files short enough that when you're working on one of them you can easily wrap your head around it. You also want to keep do files short so they run as quickly as possible: working on a do file usually requires running it repeatedly, so moving any code that you consider "done" to a different do file will save time.


Comments are text included in a do file for the benefit of human readers, not for Stata. Comments can explain what the do file does and why, and if anyone else ever needs to read and understand your do file they'll be very grateful for good comments. But you are the most likely beneficiary of your comments, when you have to figure out how your do file works months or years after writing it.

You don't need to comment every command—most Stata code is fairly easy to read. But be sure to comment any code that required particular cleverness to write, or you'll need to be just as clever to figure out what it does later.

Comments need to be marked as such so that Stata will not try to execute them. /* means Stata should ignore everything until it sees */, while // means Stata should ignore the rest of that line. Here's how one might comment the solution to one of the exercises in the previous section:

// make a list of cars I might be interested in buying
list make price mpg rep78 if price<4000 | (price<5000 & rep78>3 & rep78<.)
Some cars will appear on the list even though they have
a missing value for rep78.
This is not an error.
If their price is less than $4,000 I don't care about their
repair record.

A useful programmer's trick is to "comment out" code you don't want to run right now but don't want to delete entirely. For example, if you temporarily wanted to focus on just the cars that meet the price<4000 condition, you could change that command to:

list make price mpg rep78 if price<4000 // | (price<5000 & rep78>3 & rep78<.)

When you're ready to return to the original command, just remove the comment markers.

Three forward slashes (///) means that the current command is continued on the next line. Think of it as commenting out the 'end-of-line' that tells Stata the command is complete. This allows you to break up commands over multiple lines for readability:

list make price mpg rep78 ///
if price<4000 | (price<5000 & rep78>3 & rep78<.)

From now on we'll do everything using do files.

Next: Working with Data

Previous: Usage and Syntax

Last Revised: 1/7/2016