Data Wrangling in Stata: First Steps With Your Data

This is part three of Data Wrangling in Stata.

When you start work with a data set, your first goals are to understand the data set and to clean it up. Logically these are two separate processes, but in practice they are intertwined: you can't clean your data set until you understand it at some level, but you won't fully understand it until it's clean. Thus this section will cover both.

We'll also introduce a lot of data science concepts in this section. This makes for lengthy discussions of tasks that in practice you can complete very quickly.

In this section, we'll primarily use the file 2000_acs_sample.dta. This file started as the 1% unweighted sample of the 2000 American Community Survey available from IPUMS, but then we took a 1% random sample of the households in that dataset just to make it easier to work with. This data set uses the "source" variables directly from the Census Bureau rather than the "harmonized" variables created by IPUMS, which are much cleaner. For real work you can usually use the harmonized variables, but we're here to learn how to do the kinds of things IPUMS does to create them.

By using this data set in the examples, you'll also gain some basic understanding of the U.S. population and some experience with a real and important data set, but you would not want to use this particular data file for research.

Start a do file that loads 2000_acs_sample.dta, and make it a proper do file (i.e. it keeps a log, starts with a blank slate, etc.):

capture log close
log using first_steps.log, replace

clear all
use 2000_acs_sample

Use this do file for the example code in this section, and those of the exercises that involve working with the ACS sample. Some exercises ask you to work with other data sets, so make a separate do file for those.

Read the Documentation

When you download a data set, you'll be tempted to open it up and go to work right away. Resist! Time spent reading the data set's documentation (assuming there is some) can save you much more time down the road. Data providers may give you files containing documentation along with the data itself, or it may be on their web site. Feel free to skim what's not relevant to you—this section will give you a better sense of what information is most important.

Unfortunately, not all data sets have good documentation, or any documentation at all, so figuring out the nature of a data set by looking at the data set itself is a vital skill. You also can't assume that the documentation is completely accurate, so you need to check what it says.

The ACS has lots of good documentation, but for practice we'll make minimal use of it (just the codebook) and figure out everything we can for ourselves. We'd still do all the same things if we were using the documentation, we'd just understand what we were looking at much more quickly.

Identify the Variables

The describe command will give you basic but useful information about your data set:

describe

The primary goal of running describe is to see what variables you have and what they're called. The variable labels will help you understand what they mean. But it will frequently let you start a "to do list" of issues you need to address before analyzing the data. Here are some issues brought out by running describe on this data set:

  • The data set seems to have an excess of identifiers: serial and us2000c_serialno both appear to identify households groups, and pernum and us2000c_pnum both appear to identify individual persons.
  • Since you're using a single data set from a single year, you don't need year and datanum to tell you where each observation come from.
  • For the same reason, you also don't need us2000c_ (US 2000 Census) in your variable names to tell you where those variables come from.
  • You have both household weight (hhwt) and person weight (pwt) variables even though this is supposed to be an unweighted sample.
  • serialno, pnum, and all of the us2000c_ variables are stored as strings of various lengths, even though some of them are clearly intended to be numeric variables.

Another issue to watch out for is the storage type of identifier variables. The default variable type, float, has seven digits of accuracy. Identifiers that are more than seven digits long must be stored as double (numbers with 16 digits of accuracy), as long (integers with up to 10 digits), or as strings (up to a billion characters, but stored as text). In this data set serial is stored as double and us2000c_serialno and us2000c_pnum as string, so those are fine. The pernum variable, on the other hand, is an int, so it can only have five digits. (This is our first hint that pernum by itself is not a unique identifier.)

Note which variables have value labels: with those variables, what you see in output will be the labels and not the underlying values. You need to use the values in your code. You can see what all the underlying values are with label list. This data set makes the interesting choice of applying value labels to year which are identical to the values themselves. (Most likely there are some circumstances where they apply a different label.) You can also list a single label using the label names you get from describe:

label list gq_lbl

This is a more typical set of value labels. Looking at this definition tells you that if you wanted to limit your analysis to "Households under 1970 definition" the code for that restriction would be if gq==1.

Exercise: Carry out the same steps with file 2000_acs_sample_harm.dta (create a new do file that loads the data, then runs describe and label list.) This is a similar sample but with the IPUMS "harmonized" variables. What issues did IPUMS resolve? What issues remain?

Look at the Data

Unless your data set is very small, you can't possibly read all of it. But just looking at a few rows may allow you to immediately spot patterns that would be difficult to detect using code. In this case, opening the data browser (type browse or click the button that looks like a magnifying glass over a spreadsheet) will show the following:

  • year and datanum seem to always have the same value, suggesting that we don't need them.
  • hhwt and perwt (household and person weights) seem to always be 100.00, which makes sense given that this is supposed to be an unweighted sample.
  • pernum and us2000c_pnum appear to be identical other than us2000c_pnum being a string and having leading zeros.
  • pernum seems to count observations, starting over from 1 every time serial changes.
  • All of the string variables contain mostly numbers.
  • us2000c_sex, us2000c_hispan, us2000c_race1, and us2000c_marstat are clearly describing categories, but even though they are string variables they only contain numbers. We will have to refer to the codebook to find out what the numbers mean. (This also applies to us2000c_educ, it's just not as obvious at this point.)
  • us2000c_inctot sometimes has "BBBBBBB". This is a code used by the Census to indicate missing values.

You can't be sure that these patterns hold the for entire data set until you check, but now you know some things to check for. For example, you can run tab year and tab datanum to confirm they always contain the same value.

Exercise: Open 2000_acs_sample_harm.dta in the data browser. What issues do you notice? Be sure to look closely at inctot.

Find the Identifiers and Figure Out the Data Structure

Imagine you wanted to call someone's attention to a particular value in this data set. How could you tell them exactly which row and column to look at?

Observation Numbers

One way to tell them which row to look at is to simply tell them the observation number. Observation numbers are tracked by the system variable _n, which you can use in commands as if it were any other variable. If you are using by the variable _n will take that into account, meaning it will start over from 1 in every by group. We'll use that heavily later.

The trouble with observation numbers is that many common tasks change them: sorting, for example, or dropping or adding observations. Thus it's far better to use one or more identifier variables. If the data set does not contain identifier variables, you can create one based on the current observation numbers with gen id = _n. The variable id will not change even if the observation numbers do.

Identifier Variables

Identifier variables, also known as primary keys or indexes, are variables whose purpose is to identify observations. They normally do not contain information. If you had a data set where each row was a UW-Madison student, their UW-Madison ID number would be a unique identifier: specifying a UW-Madison ID number allows you to identify a single row.

Now imagine a data set describing UW-Madison students, but there is one row for each class the student is currently taking, with each class being identified by a number. In order to identify a specific row you now need to specify both a UW-Madison ID number and a class number. The combination of ID number and class number would be a compound identifier for the data set, and the combination of the two is a unique identifier.

Finding the identifier variables in a data set will help you understand the structure of the data. If student ID is a unique identifier, you know you have one row per student and an observation represents a student. If student ID and class number are a unique identifier, you know you have one row per student per class, and each observation represents a class/student combination. In practice you might refer to an observation as a class, as long as everyone understands that two students taking the same class will have two rows, not one.

The duplicates report command can easily identify whether a variable or set of variables is a unique identifier: if it is, there will be no duplicates. The ACS data has a variable called pernum (Person number in sample unit) is it a unique identifier?

duplicates report pernum

Clearly not: only one observation can be uniquely identified using pernum. Running tab on an identifier wouldn't normally be useful because there will be too many values, but the large number of duplicates suggests it might be fruitful here:

tab pernum

Seeing the values, it's clear what this variable is: the person's identifier within their household. So pernum is not an unique identifier by itself, but it is part of a compound identifier that is. Try:

duplicates report serial pernum

There are no duplicates so we now know that the serial and pernum are a compound identifier for this data set. The serial variable identifies a household and pernum identifies a person with in that household. We also know that each observation represents a person, and the people are grouped into households.

Column Identifiers

In Stata, columns are identified by variable names. Variable names are always unique (Stata won't allow you to create two variables with the same name) but they often have multiple parts. In this data set, some of the variable names are in the form source_variable, like us2000c_sex and us2000c_age. It can be very useful to think of such variable names as a compound identifier with two separate parts (e.g. source and variable). It's even possible to convert row identifiers to part of a compound column identifier and vice versa—we'll learn how later.

Using Identifiers

Stata does not give identifiers any special status, so you use them like any other variable. If you want to see observation 5, run:

list if _n==5

If you want to see person 1 in the household where serial is 242, run:

list if pernum==1 & serial==242

To identify a specific value in the data set you specify both the row identifier and column identifier (i.e. variable name):

list us2000c_age if pernum==1 & serial==242

Another way to identify the value of a specific variable for a specific observation number is to put the observation number in square brackets after the variable name.

display us2000c_age[5]

This square bracket syntax can be used in mathematical expressions, which is very useful.

Note: while the list command lists your data, the display command prints out a single thing. This can be the result of evaluating a mathematical expression (display 2+2) or a message (display "Completed discussion of identifiers"), both of which are quite useful.

Exercise: Open the data set atus.dta (use your separate do file for exercises). This is a selection from the American Time Use Survey, which measures how much time people spend on various activities. Find the identifiers in this data set. What does an observation represent? What was the first activity recorded for person 20170101170012?

Get Rid of Data You Won't Use

Understanding data takes time. Even skipping past data you don't care about to get to what you do care about takes time. So if you won't use parts of a data set, get rid of those parts sooner rather than later. Doing so will also reduce the amount of memory needed to analyze your data and the amount of disk space needed to store it, and make anything you do with it run that much faster. If you change your mind about what you need, you can always change your do file later.

The drop command can drop either variables or observations, depending on whether it is followed by a variable list or an if condition. To drop year and datanum, run:

drop year datanum

Let's declare that you don't want to include individuals living in "group quarters" (prisons, for example) in your analysis. Drop them with:

drop if gq==3 | gq==4

The keep command works in the same way, but in the opposite sense: running keep year datanum would drop all the variables except year and datanum. Running keep if gq==3 | gq==4 would drop everyone except those in group quarters. If you start with a data set that contains thousands of variables but only intend to use a dozen (very common), a keep command that cuts the data set down to just that dozen is an excellent starting point. On the other hand, if those variables have names that are codes rather than anything meaningful (also very typical in large data sets), consider renaming them first so you only have to use the codes once.

Change Variable Names

A good variable name tells you clearly what the variable contains. Good variable names make code easier to understand, easier to debug, and easier to write. If you have to choose between making a variable name short and making it clear, go with clear.

Variable names cannot contains spaces, but many variable names should contain multiple words. There are two competing conventions for making muti-word variable names readable. Camel case capitalizes the first letter of each word after the first: householdIncome, mothersEducation, etc. Snake case uses underscores instead of spaces: household_income, mothers_education, etc. Which one you use is less important than that you choose one and stick with it: don't force yourself to remember whether you called your variable householdIncome or household_income this time!

When you use abbreviations use the same abbreviation every time, even across projects. This data set abbreviates education as educ, and there's nothing wrong with that, but if you use edu in your other projects that's sufficient reason to change it.

The syntax to change a single variable name is just rename oldname newname. Change pernum to person with:

rename pernum person

The rename command can rename groups of variables as well. We're not planning to combine our example data with others, so we don't need the us2000c_ at the beginning of many of the variable names to tell us which data set the variable came from. You can remove it with:

rename us2000c_* *

This can be read as "rename all the variables that match the pattern us2000c_ followed by something to just the something." Type help rename group to see the various ways you can rename groups of variables.

Many of these variable names are easy to remember once you know what they are but cryptic on a first reading. We'll make them clearer in the exercise.

Exercise: Rename race1 to race, serial to household, educ to edu, hispan to hispanic, marstat to maritalStatus, and inctot to income.

The data set first_steps1.dta contains the data as it should be when you complete this exercise, including all the other things we've done with it thus far. If you haven't followed all the steps described, you can catch up now by loading that data set.

Convert String Variables that Contain Numbers to Numeric Variables

Unfortunately, it's very common for numeric variables to be imported into Stata as strings. Before you can do much work with them they need to be converted into numeric variables.

The destring command is the easy way to convert strings to numbers. Just give it a varlist to act on, and the replace option to tell it it can replace the existing string variables with the new numeric versions. You have some of these in the ACS sample; destring the ones you're interested in with:

destring sex age hispanic race maritalStatus edu income, replace

Before carrying out a conversion, Stata checks that all the string variable's values can be successfully converted to numbers. In the case of income, Stata found that some values could not be. Technically this is not an error (your do file did not crash) but income was not converted. To explore why, use the gen() option to create a new variable containing the numeric values, and the force option to tell Stata to proceed with the conversion despite the values that cannot be converted.

destring income, gen(income2) force

The force option is so named because you're forcing Stata to do something it doesn't think is a good idea. This should make you nervous! Only use a force option if you're very confident that you understand both why Stata thinks what you're doing might be a bad idea and why it's okay in your case.

With both income (the original string variable) and income2 (the new numeric variable) available to you, you can compare them and figure out what's going on. Observations with values of income that could not be converted have missing values for income2, so get a list of the values of income that could not be converted with tab:

tab income if income2==.

There's only one value that couldn't be converted, "BBBBBBB". This is a code for missing, so it should be converted to a missing value: destring did exactly what you want it to do. Now that you know that, you can go back and just convert income using the force option rather than creating a separate income2 variable. But you want to keep a record of how you know that that's okay to do.

Go back in your do file and comment out (i.e. turn into comments) both the destring command that created income2 and the tab command that checked it against income. Add a comment that explains what you found and why that means you don't have to worry about the values that cannot be converted. Then add a destring command that converts income to a numeric variable using the replace and force options. Rerun your do file so that the data set no longer contains income2 and income is numeric.

There's also a function that converts strings to numbers, called real() (as in real numbers). You could have created the income2 variable by running gen income2=real(income). The advantage of real() is that you can use it as part of an expression.

The assert command verifies that a condition you give it is true for all observations, making it very useful for checking all sorts of things. We've been reasonably confident that person and pnum are exactly the same (other than pnum being a string) for a long time, but now you can find out for sure:

assert real(pnum)==person

This command asserts that pnum, once converted to a number, is always the same as person. Stata does not complain, which means it agrees. If the assertion were not true, Stata would tell you how often the assertion is false—and then crash your do file. This is a good thing, because if you write a do file that assumes that some condition is true, it's far better for that do file to crash and tell you your assumption is wrong than for it to keep running and give you incorrect results.

Now that you know you don't need pnum, drop it:

drop pnum

Identify the Type of Each Variable

The most common variable types are continuous variables, categorical variables, string variables, and identifier variables. Categorical variables can be further divided into unordered categorical variables, ordered categorical variables, and indicator variables. (There are other variable types, such as date/time variables, but we'll focus on these for now.) Often it's obvious what type a variable is, but it's worth taking a moment to consider each variable and make sure you know its type.

Continuous variables can, in principle, take on an infinite number of values. They can also be changed by very small amounts (i.e. they're differentiable). In practice, all continuous variables must be rounded, as part of the data collection process or just because computers do not have infinite precision. As long as the underlying quantity is continuous, it doesn't matter how granular the available measurements of that quantity are. You may have a data set where the income variable is measured in thousands of dollars and all the values are integers, but it's still a continuous variable.

Continuous variables are sometimes referred to a quantitative variables, emphasizing that the numbers they contain actually correspond to some quantity in the real world. Thus it makes sense to do math with them.

Categorical variables, also called factor variables, take on a finite set of values, often called levels. The levels are typically stored as numbers (1=white, 2=black, 3=hispanic, for example), but it's important to remember that the numbers don't actually represent quantities. Categorical variables can also be stored as strings.

With unordered categorical variables, the numbers assigned are completely arbitrary. Nothing would change if you assigned different numbers to each level (1=black, 2=hispanic, 3=white). Thus it makes no sense to do any math with them, like finding the mean.

With ordered categorical variables, the levels have some natural order. Likert scales are examples of ordered categorical variables (e.g. 1=Very Dissatisfied, 2=Dissatisfied, 3=Neither Satisfied nor Dissatisfied, 4=Satisfied, 5=Very Satisfied). The numbers assigned to the levels should reflect their ordering, but beyond that they are still arbitrary: you could add 5 to all of them, or multiply them all by 2, and nothing would change. You will see people report means for ordered categorical variables and do other math with them, but you should be aware that doing so imposes assumptions that may or may not be true. Moving one person from Satisfied to Very Satisfied and moving one person from Very Dissatisfied to Dissatisfied have exactly the same effect on the mean, but are you really willing to assume that they're equivalent changes?

Indicator variables, also called binary variables or dummy variables, are just categorical variables with two levels. In principle they can be ordered or unordered but with only two levels it rarely matters. Often they answer the question "Is some condition true for this observation?" Occasionally indicator variables are referred to as flags, and more commonly flagging observations where a condition is true means to create an indicator variable for that condition.

String variables contain text. Sometimes the text is just labels for categories, and they can be treated like categorical variables. Other times they contain actual information.

Identifier variables allow you to find observations rather than containing information about them, though some compound identifiers blur the line between identifier variables and categorical variables. They may look like continuous variables because they have so many unique values, but you'll probably find the code you use for categorical variables to be more useful with them.

A useful tool for identifying variable types is the codebook command:

codebook

This produces a lot of output (there's a reason we covered dropping unneeded variables first) and you can skim much of it.

Stata will try to guess the variable type. For continuous variables, it will give you the mean, standard deviation, and percentiles. For categorical variables (including strings), it will give frequencies. For string variables it thinks are not categorical, it will give you some examples.

Some things to note in the codebook output for this data set:

  • It gave summary statistics for household (the household identifier formerly known as serial) and person (the person identifier formerly known as pernum) because it guessed they were continuous variables. This is of course nonsense and can be ignored. The codebook command gives useful output, but not all codebook output is useful.
  • hhwt and perwt really are always 100, so now you know for sure you can drop them.
  • gq is now down to two values, making it an indicator variable. But we don't care about the distinction between the 1970 and 1990 definitions of household so we'll just drop it.
  • sex is an indicator variable coded 1 and 2. We'll have to look up what those numbers mean.
  • age has 92 unique values and a range that looks like actual years, so we can be confident it's a continuous variable.
  • You might think hispanic would be an indicator variable, but with 22 unique values it must be a categorical variable.
  • race and maritalStatus are categorical. Again, we'll have to look up what the numbers mean.
  • With 17 unique values and examples that are plausible numbers of years in school, edu could be a quantitative variable. But the mean and median are low. We'll consider this variable more closely.
  • With 2,614 unique values and values that look like plausible incomes we can be confident income is a continuous variable.

Add the following to your do file:

drop hhwt perwt gq

Exercise: Load atus.dta (use your exercises do file) and run codebook. Identify the variable type of famincome, hispan, asian, and several other variables. How are hispan and asian different from each other?

Recode Indicator Variables

It is highly convenient to frame indicator variables as telling you if something is true or not, with "true" coded as 1 and "false" as zero. In this data set the variable sex has the levels 1 and 2. Which are the males and which are the females? We'll have to refer to the codebook to find out. Now consider a variable called "female" coded with 1 and 0. To anyone familiar with the convention that 1 means true and 0 means false, no further documentation is required. This also allows you to write intuitive code like if female and if !female (but only if there are no missing values).

IPUMS provided a codebook file for this dataset, 2000_acs_codebook.txt (note that this is not the same as the output of Stata's codebook command). On line 112, you'll see the coding for sex: 1 is Male and 2 is Female. Create a variable called female instead with:

gen female = (sex==2)

Recall that if you set a variable equal to a condition, it will get a 1 if the condition is true and a 0 if the condition is false.

This command relies on the fact that sex is never missing in this data set (as you saw by running codebook). If you had missing values for sex, you'd make sure female was also missing for those observations with:

gen female = (sex==2) if sex<.

It also relies on the fact that sex is binary in this data set: anyone who is not female is male. That is likely to become less common over time.

A good way to check your work for these kinds of tasks is to run a crosstab of the new and old variables. All the observations should be in table cells that make sense (e.g. 1 for sex and 0 for female) and none should be in the table cells that don't make sense (e.g. 2 for sex and 0 for female). Be sure to include missing values:

tab sex female, miss

Exercise: Create an indicator variable for "this person is hispanic." Use the IPUMS codebook to see which values of the existing hispanic variable identify someone as hispanic. Call the new variable hisp at first. Check your work by running a crosstab of hisp and hispanic, then drop hispanic and rename hisp to hispanic. Also drop sex.

Set Labels

Labels make a data set easier to use and understand—you've seen the benefits of variable labels as you've worked to understand this data set. Value labels can be even more useful by telling you what categorical variables mean: right now variables like race are useless without referring to the codebook.

Labels are set using the label command. The label command has many subcommands, like the label list you've used already.

Set Variable Labels

You can set variables labels with the label variable command: just specify the variable to be labeled and the text to label it with.

Variables like female are arguably clear enough that they don't need labels, but set one for it anyway just for practice:

label variable female "Person is female"

Exercise: Set a similar variable label for hispanic.

Set Value Labels

Value labels are a mapping from a set of numbers, the levels of your categorical variables, to a set of text labels that tell you what each level means. The first step in using them is to define the mapping using label define:

label define maritalStatusLabel ///
1 "Now married" ///
2 "Widowed" ///
3 "Divorced" ///
4 "Separated" ///
5 "Never married"

These meanings come from the codebook file, line 246. Note how the use of multiple lines and indentation makes this command easier to read.

The next step is to label the values of one or more variables with the mapping defined. This is done with the label values command:

label values maritalStatus maritalStatausLabel

If you wanted to apply this mapping to more than one variable you'd just list them all before ending with the name of the label.

See the results by running:

tab maritalStatus

If a mapping will only apply to one variable, it may be convenient to give the mapping the same name as the variable. This makes for a label values command that will look confusing to new Stata users, which is why we don't do it here (e.g. label values maritalStatus maritalStatus), but makes it very easy to remember the name of the label mapping for each variable.

You also need to remember the underlying values so you can use them in your code. In most cases this will come naturally as you work with the data. But if you don't mind somewhat uglier output you can put the numbers directly in the labels (e.g. "1: Now married" or "Now married (1)").

In the process of setting value labels you'll discover something important about edu: it is a categorical variable, not a quantitative variable meaning "years of school." Of course reading the documentation before getting started would have told you this long ago.

Exercise: Set value labels for race, using the codebook to find the meaning of each level. You may want to shorten the descriptions. Hint: 5 really means "They checked the box for American Indian or Alaska Native, but they didn't specify a tribe so we don't know if they're an American Indian or an Alaska Native." Optional: set value labels for edu.

The data set first_steps2.dta contains the data as it should be when you complete this exercise, including value labels for edu and all the other things we've done with it thus far. If you haven't followed all the steps described, you can catch up now by loading that data set.

Examine Variable Distributions

Understanding the distributions of your variables is important for both data cleaning and analysis.

Continuous Variables

For a continuous variable, like income, summarize (sum) is the place to start for understanding its distribution:

sum income

Things to note:

  • These statistics were calculated across 21,266 observations, while the data set has 27,410. This reflects 6,144 missing values (the former "BBBBBBB" codes).
  • The mean is $27,724.1, which seems low, but keep in mind it includes children, retirees, people who are unemployed, etc.
  • The minimum value is -10000. Negative values of income need further thought.

This leads to further questions:

Who has a missing value for income?

sum age if income==.
sum income if age<15

This tells us income is only missing for people under the age of 15, and is always missing for them.

Who has an income less than 0? Run:

sum age if income<0
tab edu if income<0

People with incomes less than zero are all over the age and education distribution, making it hard to understand it might mean. Also note that it's quite rare:

count if income<0

For analysis purposes, an important question is whether the variable is normally distributed, or even close to it. Add the detail option for some hints:

sum income, detail

With a mean that's much higher than the median (50th percentile) and a skewness of 4.95, income is clearly not normally distributed. You'll find that almost all income distributions are strongly right-skewed.

A histogram is a great tool for understanding the distribution of continuous variables. Use the freq option to have the y-axis labeled in terms of frequency rather than density:

hist income, freq

Unfortunately, the outliers drive the scale, making it hard to say much about the distribution of the bulk of the observations. Consider looking at a subset of the incomes:

hist income if income<100000, freq

Exercise: Examine the distribution of age. How close is it to normally distributed? Do you see any concerns? Since age is an integer, run another histogram with the discrete option, which gives you one bin for each value. Any concerns now?

The age variable has been top-coded, meaning that anyone older than 93 has been assigned a 93. This was done to protect the privacy of these individuals since there are not very many of them.

Categorical Variables

For a categorical variable, like edu, tabulate (tab) is the place to start for understanding its distribution:

tab edu, miss

Things to note:

  • There are no missing values, or at least none coded as such.
  • The category "Not in universe" needs some investigation.
  • There are more people in lower education categories than you might expect (41.7% did not graduate from high school).
  • There may be more categories here than are useful, so consider combining them.

Further questions:

Who are the people with "Not in universe" for edu?

tab age if edu==0
tab edu if age<3

People with "Not in universe" are all under the age of 3, and people under the age of 3 are always "Not in universe." The Census Bureau uses "Not in universe" to denote "legitimate skips": questions that were intentionally not asked or not answered because they don't apply.

But note that children over the age of three are included in the other categories of edu. Does this explain why we have more people in lower education categories than expected?

tab edu if age>=18

If you limit the sample to adults, the percent who did not graduate from high school falls to 20.8%.

A bar graph is a great tool for understanding the distribution of categorical variables. Unlike a frequency table, a reader can absorb the information in a bar graph instantly. It may take some work to make them presentable; Bar Graphs in Stata discusses some of the tricks needed. You can see the problem if you try to create the default bar graph for edu (assuming you set value labels for it):

graph bar, over(edu)

There's not enough space on the x-axis for all the labels. But this problem has a very simple solution: switch to a horizontal bar graph.

graph hbar, over(edu)

This makes horizontal the format of choice for bar graphs.

The syntax for bar graphs may seem confusing: why the over() option rather than just graph hbar edu? The reason is that bar graphs are also used to examine relationships between variables. A variable list is used to specify the variable that defines the lengths of the bars (by default the mean of that variable), while the over() option specifies the variable that defines the bars themselves, as in:

graph hbar inc, over(edu)

We hope the relative lengths of the last two bars do not surprise or disappoint any PhD students reading this.

Exercise: examine the distributions of race, maritalStatus, female, and hispanic. Investigate who the people who marked "Other" for race are. The 2010 Census (i.e. the census after the one we're examining) added a clarifying note to the questions about hispanic origin: "For this census, Hispanic Origins are not races." Can you see why? What problems did not having that note create in the 2000 data?

Recode Values that Mean Missing to Missing Values

Stata uses special codes to indicate missing values. For numeric values there's the generic missing value, ., plus the extended missing values .a, .b, .c, through .z. (Recall that as far as greater than and less than are concerned, .<.a, .a<.b, etc. so if x<. excludes all missing values of x.) For strings, the empty string, "" (i.e. quotes with absolutely nothing inside, not even a space) means missing. These values receive special treatment. For example, statistical commands exclude missing values from their calculations.

Data sets often use different codes to indicate missing values. We saw that with income: "BBBBBBB" meant missing. This was automatically converted to a missing value when you converted income from a string variable to a numeric variable, but it won't always be that simple. It's very common to use negative numbers to mean missing, especially -9. Stata will not recognize that -9 means missing and will include it in statistical calculations, giving incorrect results.

The solution is to identify values that really mean missing, and then change them to missing values. For edu, the value 0, "Not in universe" means missing. Change it with:

replace edu = . if edu==0

What about negative values for income? Is that really a code for missing? Take a look with:

tab income if income<0

The variety of values suggest these are actual quantities, not codes. Most likely these are people who lost money on investments—which means they probably have substantial wealth to invest. This leads to the paradox that while a low value of income is associated with low socio-economic status, a negative value of income might be associated with high socio-economic status. Ideally you'd consider wealth as well as income as proxies for socio-economic status, but for today "solve" the problem by changing all negative values of income to missing:

replace income = . if income<0

The same code could be used for a variable that has multiple codes for missing if they are all negative numbers.

Exercise: The income and maritalStatus variables treat children differently. Children under 15 do not have a zero for income, they have a missing value. But children under 15 all have "Never Married" for maritalStatus. Recode maritalStatus so they have a missing value instead. How will that change subsequent analysis of the maritalStatus variable?

Examine Missing Data

Once all missing values are coded in a way that Stata can recognize them, the misstable sum command will give you a very useful summary of the missing data in you data set.

misstable sum

Any variable that is not listed does not have any missing values. Thus you now know you can write conditions like if age>=65 without worrying about missing values of age being included. You also see how many missing values you have for each variable, and thus how big an issue missing data is in your data set.

What misstable cannot tell you is why the data are missing, and in particular whether they are missing completely at random. If they are not, their presence is likely to bias your analysis.

Often you can answer this question by examining the relationships between missing values and other variables, as we did when we looked at the distributions of the variables. For example, edu is clearly not missing at random: it is missing if and only if the person is less than three years old. The distribution of edu for these children is presumably made up mostly of zeros ("None") and perhaps some ones ("Nursery school-4th grade"), which is very different from the distribution of edu for the observations where edu is not missing.

You should also consider relationships between missing values. In this data set, the apparent relationship between the missingness of income and the missingness of maritalStatus is driven by their individual relationships with age. But in many data sets there are direct relationships. For example, if a subject could not be located for one wave of a survey then they may have missing values for all the variables for that survey wave.

Exercise: What is the likely distribution of income for people with a missing value for income? How is that different from the distribution of income for those where income is known? How would the mean of income change if all its missing values became known?

At this point the data set is reasonably clean, and what you do with it next will depend on how you plan to use it. For example, if you wanted to use education as a predictor in a regression model it would probably be wise to combine some of the categories, but if you were doing a descriptive study you might leave it as is. Warp up this do file by saving the cleaned-up data set (never saving it over the original file) and closing the log:

save 2000_acs_clean, replace
log close

Next: Variable Transformations

Last Revised: 10/16/2020