Running Large SAS Jobs on Linstat

Linstat is capable of running the vast majority of SAS jobs without any difficulties. However, if you run SAS files that use extremely large data sets or do a lot of reading and writing of data sets, you need to understand where SAS stores your data sets and how and where to run your programs so that they get the resources they need without slowing down the server for others.

Where SAS Stores Files

SAS can put files in three types of storage:

/ramdisk

Linstat allows programs to use up to 24GB of RAM as if it were disk space in a directory called /ramdisk. This space is much faster than real disk space. SAS on Linstat is configured to use /ramdisk as its default WORK library, where all temporary data sets are stored. Files in /ramdisk that are not in use will be deleted after three hours.

/tmp

The /tmp directory is a directory on each server's local hard drive. It is much faster than network disk space, but much slower than /ramdisk. Linstat has about 200GB of space in /tmp.

The trouble with /tmp (and one reason /ramdisk is much preferred whenever it can be used) is that disk I/O is difficult for servers to manage. While today's servers have many CPUs and very large amounts of RAM, they still have just one one disk and one I/O bus. Switching between disk I/O tasks is a slow process, so if a server is busy with disk I/O even a trivial command like cd or ls can take several seconds to execute. This makes the server almost unusable.

Network Disk Space

Permanent data sets are generally stored in network disk space, such as home and /project directories. /project directories can be very large but they are much slower than either /tmp or /ramdisk.

While intensive network I/O has much less effect on a server than disk I/O, it could in principle overwhelm the SSCC's network storage and slow down access to files stored on the network for everyone. We do not believe a single SAS program can generate enough network I/O to significantly affect our current network storage devices, but we're not eager to have you test that theory.

Making Programs Run Faster

Since permanent data sets are written to the (relatively slow) network and temporary data sets are written (by default on Linstat) to /ramdisk, using temporary data sets whenever possible can dramatically improve the performance of programs that use large data sets.

Consider the following SAS program:

data '/project/example/intermediate_data_set';
set '/project/example/input_data_set';
various commands...
run;

data '/project/example/output_data_set';
set '/project/example/intermediate_data_set';
more commands...
run;

Since all of the data sets used are permanent data sets, all of the disk I/O generated by this program goes to the network. But consider this variation:

data intermediate_data_set;
set '/project/example/input_data_set';
various commands...
run;

data '/project/example/output_data_set';
set intermediate_data_set;
more commands...
run;

intermediate_data_set is now temporary and stored in /ramdisk, which makes reading and writing it much faster. The more steps you have between reading your initial input and writing your final output the more time you will save by making the intermediate data sets temporary.

Never use a data step simply to reset the default data set. Consider this construction:

data '/project/example/output_data_set';
set '/project/example/output_data_set';
run;

proc tables;
run;

From the user's perspective, all the data step does is make output_data_set the data set proc tables will act on. However, SAS will treat it as a real data step, reading the entire file and then writing it again unchanged. Use the following instead:

proc tables data='/project/example/output_data_set';
run;

Your programs will also be easier to read if proc steps state which data set they are to act on explicitly.

Preventing Programs from Running Out of Space

The vast majority of SAS programs require much less than the 24GB of space available in /ramdisk on Linstat. If your program requires more, here are your options:

WORK Space Required WORK Command Notes
<24GB /ramdisk sas program  
24GB-200GB /tmp sas -work /tmp program (1)
>200GB /project/myproject sas -work /project/myproject program (2)

(1) These jobs may slow down the server they run on. If they can be run at night, please do so.

(2) Any job that needs to write this much data to the network will be extremely slow. You would be well advised to spend time making sure your code is as efficient as possible before running such programs.

Putting "Permanent" Data Sets in Temporary Space

Both /ramdisk and /tmp can also be used to store "permanent" data sets. This can reduce the amount of WORK space your program requires, or optimize your disk usage. For example, if your program can almost run using /ramdisk as WORK, storing a few files as permanent data sets in /tmp instead of as temporary data sets may be enough to let the others stay in /ramdisk, without slowing down access to those files as much as putting them on the network. If your program's temporary space needs are large enough that you must set WORK to /tmp, you could still put up to 24GB of "permanent" files in /ramdisk so that they can be accessed more quickly--ideally the most heavily used files.

Just be sure to delete any permanent data sets you create in /ramdisk or /tmp at the end of your program using proc datasets.

The SSCC Help Desk can help you implement these suggestions and generally assist you with running large SAS programs. We'd also appreciate hearing about your experience running such programs on our servers: our server configuration and even purchasing decisions are often driven by making sure such large, ambitious projects can be run successfully.

Last Revised: 7/9/2013