Compression

1. Introduction

HDF5 supports compression of raw data by compression methods built into the library or defined by an application. A compression method is associated with a dataset when the dataset is created and is applied independently to each storage chunk of the dataset. The dataset must use the H5D_CHUNKED storage layout. The library doesn't support compression for contiguous datasets because of the difficulty of implementing random access for partial I/O, and compact dataset compression is not supported because it wouldn't produce significant results.

2. Supported Compression Methods

The library identifies compression methods with small integers, with values less than 16 reserved for use by NCSA and values between 16 and 255 (inclusive) available for general use. This range may be extended in the future if it proves to be too small.

Method Name Description
H5Z_NONE The default is to not use compression. Specifying H5Z_NONE as the compression method results in better perfomance than writing a function that just copies data because the library's I/O pipeline recognizes this method and is able to short circuit parts of the pipeline.
H5Z_DEFLATE The deflate method is the algorithm used by the GNU gzipprogram. It's a combination of a Huffman encoding followed by a 1977 Lempel-Ziv (LZ77) dictionary encoding. The aggressiveness of the compression can be controlled by passing an integer value to the compressor with H5Pset_deflate() (see below). In order for this compression method to be used, the HDF5 library must be configured and compiled in the presence of the GNU zlib version 1.1.2 or later.
H5Z_RES_N These compression methods (where N is in the range two through 15, inclusive) are reserved by NCSA for future use.
Values of N between 16 and 255, inclusive These values can be used to represent application-defined compression methods. We recommend that methods under testing should be in the high range and when a method is about to be published it should be given a number near the low end of the range (or even below 16). Publishing the compression method and its numeric ID will make a file sharable.

Setting the compression for a dataset to a method which was not compiled into the library and/or not registered by the application is allowed, but writing to such a dataset will silently not compress the data. Reading a compressed dataset for a method which is not available will result in errors (specifically, H5Dread() will return a negative value). The errors will be displayed in the compression statistics if the library was compiled with debugging turned on for the "z" package. See the section on diagnostics below for more details.

3. Application-Defined Methods

Compression methods 16 through 255 can be defined by an application. As mentioned above, methods that have not been released should use high numbers in that range while methods that have been published will be assigned an official number in the low region of the range (possibly less than 16). Users should be aware that using unpublished compression methods results in unsharable files.

A compression method has two halves: one have handles compression and the other half handles uncompression. The halves are implemented as functions method_c and method_u respectively. One should not use the names compress or uncompress since they are likely to conflict with other compression libraries (like the GNU zlib).

Both the method_c and method_u functions take the same arguments and return the same values. They are defined with the type:

typedef size_t (*H5Z_func_t)(unsigned int flags, size_t cd_size, const void *client_data, size_t src_nbytes, const void *src, size_t dst_nbytes, void *dst/*out*/)
The flags are an 8-bit vector which is stored in the file and which is defined when the compression method is defined. The client_data is a pointer to cd_size bytes of configuration data which is also stored in the file. The function compresses or uncompresses src_nbytes from the source buffer src into at most dst_nbytes of the result buffer dst. The function returns the number of bytes written to the result buffer or zero if an error occurs. But if a result buffer overrun occurs the function should return a value at least as large as dst_size (the uncompressor will see an overrun only for corrupt data).

The application associates the pair of functions with a name and a method number by calling H5Zregister(). This function can also be used to remove a compression method from the library by supplying null pointers for the functions.

herr_t H5Zregister (H5Z_method_t method, const char *name, H5Z_func_t method_c, H5Z_func_t method_u)
The pair of functions to be used for compression (method_c) and uncompression (method_u) are associated with a short name used for debugging and a method number in the range 16 through 255. This function can be called as often as desired for a particular compression method with each call replacing the information stored by the previous call. Sometimes it's convenient to supply only one half of the compression, for instance in an application that opens files for read-only. Compression statistics for the method are accumulated across calls to this function.

Example: Registering an Application-Defined Compression Method

Here's a simple-minded "compression" method that just copies the input value to the output. It's similar to the H5Z_NONE method but slower. Compression and uncompression are performed by the same function.

size_t
bogus (unsigned int flags,
       size_t cd_size, const void *client_data,
       size_t src_nbytes, const void *src,
       size_t dst_nbytes, void *dst/*out*/)
{
    memcpy (dst, src, src_nbytes);
    return src_nbytes;
}
	      

The function could be registered as method 250 as follows:

#define H5Z_BOGUS 250
H5Zregister (H5Z_BOGUS, "bogus", bogus, bogus);
	      

The function can be unregistered by saying:

H5Zregister (H5Z_BUGUS, "bogus", NULL, NULL);
	      

Notice that we kept the name "bogus" even though we unregistered the functions that perform the compression and uncompression. This makes compression statistics more understandable when they're printed.

4. Enabling Compression for a Dataset

If a dataset is to be compressed then the compression information must be specified when the dataset is created since once a dataset is created compression parameters cannot be adjusted. The compression is specified through the dataset creation property list (see H5Pcreate()).

herr_t H5Pset_deflate (hid_t plist, int level)
The compression method for dataset creation property list plist is set to H5Z_DEFLATE and the aggression level is set to level. The level must be a value between one and nine, inclusive, where one indicates no (but fast) compression and nine is aggressive compression.

int H5Pget_deflate (hid_t plist)
If dataset creation property list plist is set to use H5Z_DEFLATE compression then this function will return the aggression level, an integer between one and nine inclusive. If plist isn't a valid dataset creation property list or it isn't set to use the deflate method then a negative value is returned.

herr_t H5Pset_compression (hid_t plist, H5Z_method_t method, unsigned int flags, size_t cd_size, const void *client_data)
This is a catch-all function for defining compresion methods and is intended to be called from a wrapper such as H5Pset_deflate(). The dataset creation property list plist is adjusted to use the specified compression method. The flags is an 8-bit vector which is stored in the file as part of the compression message and passed to the compress and uncompress functions. The client_data is a byte array of length cd_size which is copied to the file and passed to the compress and uncompress methods.

H5Z_method_t H5Pget_compression (hid_t plist, unsigned int *flags, size_t *cd_size, void *client_data)
This is a catch-all function for querying the compression method associated with dataset creation property list plist and is intended to be called from a wrapper function such as H5Pget_deflate(). The compression method (or a negative value on error) is returned by value, and compression flags and client data is returned by argument. The application should allocate the client_data and pass its size as the cd_size. On return, cd_size will contain the actual size of the client data. If client_data is not large enough to hold the entire client data then cd_size bytes are copied into client_data and cd_size is set to the total size of the client data, a value larger than the original.

It is possible to set the compression to a method which hasn't been defined with H5Zregister() and which isn't supported as a predefined method (for instance, setting the method to H5Z_DEFLATE when the GNU zlib isn't available). If that happens then data will be written to the file in its uncompressed form and the compression statistics will show failures for the compression.

Example: Statistics for an Unsupported Compression Method

If an application attempts to use an unsupported method then the compression statistics will show large numbers of compression errors and no data uncompressed.

H5Z: compression statistics accumulated over life of library:
   Method      Total  Overrun  Errors  User  System  Elapsed Bandwidth
   ------      -----  -------  ------  ----  ------  ------- ---------
   deflate-c  160000        0  160000  0.00    0.01     0.01 1.884e+07
   deflate-u       0        0       0  0.00    0.00     0.00       NaN
	      

This example is from a program that tried to use H5Z_DEFLATE on a system that didn't have the GNU zlib to write to a dataset and then read the result. The read and write both succeeded but the data was not compressed.

5. Compression Diagnostics

If the library is compiled with debugging turned on for the H5Z layer (usually as a result of configure --enable-debug=z) then statistics about data compression are printed when the application exits normally or the library is closed. The statistics are written to the standard error stream and include two lines for each compression method that was used: the first line shows compression statistics while the second shows uncompression statistics. The following fields are displayed:

Field Name Description
Method This is the name of the method as defined with H5Zregister() with the letters "-c" or "-u" appended to indicate compression or uncompression.
Total The total number of bytes compressed or decompressed including buffer overruns and errors. Bytes of non-compressed data are counted.
Overrun During compression, if the algorithm causes the result to be at least as large as the input then a buffer overrun error occurs. This field shows the total number of bytes from the Total column which can be attributed to overruns. Overruns for decompression can only happen if the data has been corrupted in some way and will result in failure of H5Dread().
Errors If an error occurs during compression the data is stored in it's uncompressed form; and an error during uncompression causes H5Dread() to return failure. This field shows the number of bytes of the Total column which can be attributed to errors.
User, System, Elapsed These are the amount of user time, system time, and elapsed time in seconds spent by the library to perform compression. Elapsed time is sensitive to system load. These times may be zero on operating systems that don't support the required operations.
Bandwidth This is the compression bandwidth which is the total number of bytes divided by elapsed time. Since elapsed time is subject to system load the bandwidth numbers cannot always be trusted. Furthermore, the bandwidth includes overrun and error bytes which may significanly taint the value.

Example: Compression Statistics

H5Z: compression statistics accumulated over life of library:
   Method      Total  Overrun  Errors  User  System  Elapsed Bandwidth
   ------      -----  -------  ------  ----  ------  ------- ---------
   deflate-c  160000      200       0  0.62    0.74     1.33 1.204e+05
   deflate-u  120000        0       0  0.11    0.00     0.12 9.885e+05
	      

Robb Matzke
Last modified: Fri Apr 17 16:15:21 EDT 1998