SAS implements value labels (e.g. 0 is male, 1 is female) by allowing you to define custom formats. The difficulty is that these formats are not saved as part of the data they label. This article will discuss ways of storing SAS formats such that they can be used in subsequent programs.
As an example, consider a fabricated data set of individuals. For each individual you have their gender, their age, and their income. You want to do three things with this data: read it in and prepare it for analysis, get basic summary statistics, and regress income on the other variables. Because you don't want to re-run all the previous steps as you debug the one you're working on, you put each step in a separate SAS program. However you want to apply the same value labels to gender in all three programs, so you need a way to store the custom format between programs.
The Task
First let's go through what you want to accomplish, ignoring the issue of formats for the moment.
Start by reading in the fabricated data. If this were actual data you'd probably use a combination of infile and input, but we'll use datalines and put the data right in the data step.
data 'incomedata';
input gender age income;
datalines;
0 50 60000
1 45 80000
1 30 25000
0 25 18000
1 72 40000
;
run;
Note that the data is stored in a permanent data set (a file) so you can use it in later programs.
Next you want some basic summary statistics. So run proc freq and proc means:
proc freq data='incomedata';
run;
proc means data='incomedata';
run;
Of course in the real world you'd probably do something more sophisticated.
Finally we'll run a regression.
proc reg data='incomedata';
model income=gender age;
run;
This is very simple as well--in reality something this easy could all go in one program, but we'll keep them separate for pedagogical purposes.
Review: Defining and Using Formats
Formats in SAS are defined using proc format, and are applied to variables using the format statement. So to apply a label to the gender variable, the first step is to define a format that associates 0 with male and 1 with female. We'll call it genderformat.
proc format;
value genderformat
0= 'male'
1= 'female'
;
run;
Next you need to associate that format with the gender variable:
format gender genderformat.;
This statement must of course be part of a data or proc step. This could be a separate data step just to apply the format, or it could be added to an existing data or proc step.
The difficulty is that genderformat goes away as soon as the program that defines it ends. So how can you use it in all three programs?
Including the Format Definitions in All Your Programs
One option is to simply include all the formatting code in every SAS program that uses genderformat. The first program, the one that just reads in the data, produces no output (other than the data set) and does not need to know about the format. In fact, including the format statement in that data step would complicate things, as we'll see later. But the other two do need the format, and thus need the code that handles it. Here is the complete code of those programs, including the formatting.
Summary statistics:
proc format;
value genderformat
0= 'male'
1= 'female'
;
run;
data formatteddata;
set 'incomedata';
format gender genderformat.;
run;
proc freq data=formatteddata;
run;
proc means data=formatteddata;
run;
Regression:
proc format;
value genderformat
0= 'male'
1= 'female'
;
run;
data formatteddata;
set 'incomedata';
format gender genderformat.;
run;
proc reg data=formatteddata;
model income=gender age;
run;
The disadvantage of this approach is obvious: the programs are now about three times as long. In reality it could be much worse--many data sets include pages and pages of value labels, making your programs extremely long and somewhat cumbersome to work with. And if you wanted to make any changes to a format, you'd have to change the copy in each program. On the other hand, this method is straightforward to implement.
Saving Formats in a Catalog File
The alternative is to save the format in a separate file SAS calls a catalog. Then subsequent programs can refer to this catalog when they need the format.
In order to save the format, you'll add a library statement to the proc format, telling SAS where to put it (you'll need to define the library first). This program will create a file called gender.sas7bcat in your current directory. You'll also need to tell SAS to look for formats in that catalog file. This is done with the fmtsearch option. Since genderformat is now permanent, you can make the association between gender and genderformat permanent by including the format statement in the data step that reads in the data. Here's the complete code for the data preparation program:
libname dir ".";
proc format library=dir.gender;
value genderformat
0= 'male'
1= 'female'
;
run;
options fmtsearch=(dir.gender);
data 'incomedata';
format gender genderformat.;
input gender age income;
datalines;
0 50 60000
1 45 80000
1 30 25000
0 25 18000
1 72 40000
;
run;
The payoff comes in the next two programs. When they load incomedata, the file will tell them that gender should be formatted using genderformat. They'll then look for genderformat, so you'll need to tell them where to look using the fmtsearch option. But that's it!
Summary Statistics:
libname dir ".";
options fmtsearch=(dir.gender);
proc freq data='incomedata';
run;
proc means data='incomedata';
run;
Regression:
libname dir ".";
options fmtsearch=(dir.gender);
proc reg data='incomedata';
model income=gender age;
run;
Obviously this is much shorter than redefining the format in each program, especially if you've got a lot of formats. But there is a catch. SAS now knows that gender needs to be formatted using genderformat. And if genderformat is unavailable, it will refuse to load the data set at all. You'll need to make sure the catalog file stays with the data set, and that everyone who uses it knows how to set the fmtsearch option.
If you ever do get stuck with a data set that has been formatted using formats you don't have access to, the following trick can be useful: tell SAS to change the format to nothing in the data step that first loads the data. The following format statement will clear all formats from a data set. The result may not be pretty, but it will be usable:
format _ALL_;
So which method should you use? Most likely as you read about the two techniques one or the other seemed easier to you. Whichever that was, go with it. It really is just a matter of personal preference.
Last Revised: 9/13/2005