Using compressed data files in SAS

Native SAS datasets (.sas7bdat files) are generally inefficient on disk space, since they reserve space for the maximum size of all variables, including missing data.  They compress very well with any of the standard lossless file compression tools (zip, gzip, bzip2 etc.), but cannot be used by SAS in that form.  External data files to be read by SAS (.e.g CSV or formatted text files) may be very large and also compress well, but SAS must be instructed how to read them.   Using compressed data files may result in higher performance, especially if network file storage is in use, since the reduced I/O more than compensates for the CPU effort needed to decompress.

These notes describe four ways to reduce the size of SAS datasets and deal with compressed input files.

Eliminating variables

Saving memory and I/O and disk space in SAS datasets by eliminating variables that you won't use, as early as possible in the processing (e.g. before any sorting)

This paper sugi27/p023-27.pdf has good general advice on reducing the memory usage and processing time for large datasets by removing unwanted variables.  It also has examples of the COMPRESS option.

Compress option for native datasets

The COMPRESS= option in DATA statements can be used to reduce the size of datasets (.sas7bdat files).  The effectiveness of this option depends greatly on the data.  There are two versions.  "COMPRESS=char", which compresses character variables with run-length encoding (generally not very useful), and "COMPRESS=binary"  which performs a more advanced compression scheme on repeated groups of characters.  Both operate at the level of individual records and so work best on datasets with large records.  The reference above suggests that it can make a big difference, and has improved in efficiency with newer SAS versions.  The SAS output listing reports on the space savings of using COMPRESS.

SAS reference: support.sas.com/documentation/

The PIPE input mechanism

This is applicable only to SAS on Unix/Linux systems.

External tools can be used to decompress data files into a pipe (in-memory transfer between processes) and then SAS can read from the pipe.  The FILENAME statement can assign a fileref to any unix pipe with the syntax:

FILENAME fileref PIPE 'UNIX-command' <options>;

where UNIX-command is any command that generates a stream of data that is understandable as an input file to sas.  For a gzip-compressed text file it would look like e.g.

'gunzip -c /path/to/my/datafile.txt.gz'

"gunzip -c" compresses the file to standard output, in this case the in-memory pipeline to the SAS process.  The uncompressed file never hits the disk.  This syntax works for any external data file that needs to be processed in some way before SAS can ingest it.

SAS reference: documentation.sas.com

Examples from other places: ucla.edu/sas/faq/how-do-i-read-raw-data-files-compressed-with-gzip-gz-files-in-sas/

Built-in compression libraries

Certain compression formats are natively understood by SAS because of libraries linked with it.  This is supported on all SAS platforms.  Zip and GZip work - maybe others.  The syntax is e.g.

FILENAME my_gz ZIP "path-to-file/compressedfile.txt.gz" GZIP;

 GZIP support was introduced in version 9.4TS1M5.  Zip support has been present for a long time.

SAS reference: documentation.sas.com

Discussion and examples from other places:

using-filename-zip-to-unzip-and-read-data-files-in-sas

sugi31/155-31.pdf

 

Details

Article ID: 84283
Created
Fri 8/2/19 2:28 PM
Modified
Fri 8/2/19 3:59 PM

Related Articles (1)