Data Wrangling in Stata: Introduction and Review

Most data sets need to be transformed in some way before they can be analyzed, a process that's come to be known as "data wrangling." Data Wrangling in Stata will introduce you to the key concepts, tools, and skills of data wrangling, implementing them in Stata. You'll learn a lot about Stata from this workshop, but the primary focus is on the tasks you'll need to carry out.

If you're new to Stata, we recommend working through our Introduction to Stata before proceeding. We'll start by very briefly reviewing some basic Stata concepts that should be familiar to you, but if they're not, Introduction to Stata will do a much better job of teaching them to you.

To get the most out of Data Wrangling in Stata you need to be an active participant. Open Stata, and type in and run the example code yourself. This will help you retain more, and ensure you get all the details right—Stata is always happy to tell you when you're wrong. Do the exercises (some of them are straightforward applications of what you just learned; others will require more creativity). Data wrangling is not something you read and understand—it's a skill you must practice.

Data Wrangling in Stata includes the following sections:

The example files for this class can be obtained within Stata by running:

net get dws, from(https://ssc.wisc.edu/sscc/stata/)

If this fails on your computer, try net get dws, from(http://ssc.wisc.edu/sscc/stata/).

This will put the example files in your current working directory. If you are comfortable doing so, create a folder for the example files, make that Stata's working directory, and run the command above. If not, we'll talk about how to do all this in Reading in Data and you can get the example files then.

Stata Syntax Fundamentals

The key to using Stata effectively is understanding its fundamental syntax, which applies to the vast majority of Stata commands.

Stata Commands

Stata is a command-based language. A Stata command is usually a verb, like use, generate, replace, summarize, or regress. Most commands can be abbreviated, the exception being those that can destroy data (thus use, gen, replace, sum, reg). Some commands have subcommands, like label variable and label value. Stata normally has exactly one data set in memory, and commands act on that data set.

Subsetting by Variables

If a command is followed by a variable list or varlist (i.e. the names of one or more variables) the command will only act on those variables.

Subsetting by Observations

If a command is followed by the word if and a logical condition, the command will only act on those observations where the condition is true.

Options

If a command is followed by a comma, then anything that comes after the comma is interpreted as one or more options that change how the command runs. If an option needs additional information, like a number or the name of a variable, that information goes in parentheses after the option.

These syntax elements always go in the same order:

command [varlist] [if condition] [, options]

The square brackets indicate that most of these elements are optional. The command:

sum x if y==1

tells Stata to summarize (i.e. produce summary statistics for) the variable x, but only for those observations where y is 1.

The command:

tab y, sum(x)

tells Stata to tabulate (i.e. produce a frequency table for) the variable y. The sum() option tells Stata to also calculate basic summary statistics for the variable x for the observations in each cell of the table. Because sum() needs to know what variable you want it to act on, the name of the variable goes in parentheses after the option name.

Creating Variables

The Stata command to create a variable is generate, usually abbreviated gen. The syntax is:

gen name = expression

where name is the name of the variable to be created and expression is some mathematical expression you write. The expression will typically include some combination of numbers, variables, and mathematical functions.

The command to change the value of an existing variable is replace, and has the same syntax:

replace name = expression

These commands act on all the observations for a given variable (unless you limit it with an if condition) but one at a time: the new value for a given observation depends on the expression as evaluated for the same observation.

The egen command, short for extended generate, allows you to use a variety of useful functions like mean(), max(), and total(). Most egen functions are aggregate functions: they take multiple values as input and give back a single value as output. Unlike gen, many egen functions work across observations.

Do Files

A Stata do file is a text file containing Stata commands. You can write them using Stata's built-in do file editor. While typing (or clicking on) commands interactively is a great way to explore your data and check your work, actual data wrangling should always be done using do files.

Working in Stata involves three separate things: commands, data, and results. A proper do file manages all three. It contains the commands, it loads and saves the data, and it records the results in a log file.

A good template to follow is:

capture log close
log using mylogfile.log, replace

clear all
use mydata

//do work

save using mynewdata, replace
log close

Let's discuss each line in turn:

capture log close tells Stata to close any open log files. If your do file crashes before it reaches the log close command at the end, its log will stay open, getting in the way of future attempts to run your do file. The capture prefix means Stata can ignore any errors the command generates. We use it here because running log close when no log file is open generates an error and our goal is for this to work whether a log is open or not.

log using mylogfile.log, replace tells Stata to store all the results generated by this do file (i.e. all the text that shows up in in the Results window) in mylogfile.log. Of course you'll want to give real log files clearer names—our suggestion is that you give log files the same name as the do files that create them, so it's clear which ones go together. Giving the file the extension .log also tells Stata that you want the log to be in plain text format rather than the Stata Markup and Control Language (SMCL), which allows it to be used by other programs. The replace option tells Stata it can replace old versions of the log—without it you can't run the do file more than once.

clear all tells Stata to clear our all any data sets in memory, along with stored results, local macros, programs, etc. so the do file starts with a blank slate every time.

use mydata loads your data set into memory (we'll talk about alternative methods shortly). Having your do file clear whatever is in memory and then load your data fresh from disk every time goes a long way toward giving you reproducibility.

Now your do file is ready to do work. When it's done, you'll need to do one or two steps to wrap up.

save using mynewdata, replace saves the data set in memory as mynewdata.dta. The replace option again tells Stata it can overwrite old versions of the data set. You only need to do this if your do file makes changes to the data set that you want to save (data wrangling do files will; analysis do files generally will not).

Note that we're saving the new data set with a new name, so the original data set is not changed. Never save your output data set over your input data set. If you do, you can never run your do file again, as the data set it was written to work with is gone.

log close tells Stata to stop recording results in your log file. Without this command, anything you type after running the do file will be recorded in the do file's log.

Reproducibility

You're probably familiar in abstract with the importance of reproducibility to science: anyone with the proper training and equipment should be able to reproduce your experiment and get the same results. In practice we often use reproducibility as a rough proxy for truthfulness: if a result can be reproduced it's true, and if it can't be it's not.

If your experiment consists of analyzing a publicly available data set, then it may seem like allowing people to reproduce it is very simple: just tell people what data you analyzed and how, and they can reproduce what you did. But if you've ever tried to reproduce a published paper you probably discovered it's not that easy. There are many decisions to make in the process of preparing data for analysis and then analyzing it, and those decisions can have a big impact on the results. It's very hard to describe those decisions precisely in human languages like English (or we wouldn't bother with programming languages like Stata).

The solution is to publish your do files and your data as well as your results.

To eliminate possible sources of confusion and error, your code should be self-sufficient, meaning it contains everything needed to start with the raw data and end with your results. The ideal—and this is easily achievable—is that someone can click a single button and reproduce your entire research project.

But this isn't just about helping others or some lofty ideal of how science should work. With reproducible do files, it's easy to make changes and corrections. They reduce the probability of error (you'll make mistakes, but when you fix them they'll stay fixed). They help you keep your work organized. They allow you to reuse code when you run into a similar problem in the future. Even if you never shared your work on a project with anyone, you'd still benefit greatly from doing everything in reproducible do files. The goal of reproducibility underlies everything we do in Data Wrangling in Stata.

Next: Reading in Data

Last Revised: 9/11/2020