HDF5 supports compression of raw data by compression methods
built into the library or defined by an application. A
compression method is associated with a dataset when the dataset
is created and is applied independently to each storage chunk of
the dataset.
The dataset must use the H5D_CHUNKED
storage
layout. The library doesn't support compression for contiguous
datasets because of the difficulty of implementing random access
for partial I/O, and compact dataset compression is not
supported because it wouldn't produce significant results.
The library identifies compression methods with small integers, with values less than 16 reserved for use by NCSA and values between 16 and 255 (inclusive) available for general use. This range may be extended in the future if it proves to be too small.
Method Name | Description |
---|---|
H5Z_NONE |
The default is to not use compression. Specifying
H5Z_NONE as the compression method results
in better perfomance than writing a function that just
copies data because the library's I/O pipeline
recognizes this method and is able to short circuit
parts of the pipeline. |
H5Z_DEFLATE |
The deflate method is the algorithm used by
the GNU gzip program. It's a combination of
a Huffman encoding followed by a 1977 Lempel-Ziv (LZ77)
dictionary encoding. The aggressiveness of the
compression can be controlled by passing an integer value
to the compressor with H5Pset_deflate()
(see below). In order for this compression method to be
used, the HDF5 library must be configured and compiled
in the presence of the GNU zlib version 1.1.2 or
later. |
H5Z_RES_N |
These compression methods (where N is in the range two through 15, inclusive) are reserved by NCSA for future use. |
Values of N between 16 and 255, inclusive | These values can be used to represent application-defined compression methods. We recommend that methods under testing should be in the high range and when a method is about to be published it should be given a number near the low end of the range (or even below 16). Publishing the compression method and its numeric ID will make a file sharable. |
Setting the compression for a dataset to a method which was
not compiled into the library and/or not registered by the
application is allowed, but writing to such a dataset will
silently not compress the data. Reading a compressed
dataset for a method which is not available will result in
errors (specifically, H5Dread()
will return a
negative value). The errors will be displayed in the
compression statistics if the library was compiled with
debugging turned on for the "z" package. See the
section on diagnostics below for more details.
Compression methods 16 through 255 can be defined by an application. As mentioned above, methods that have not been released should use high numbers in that range while methods that have been published will be assigned an official number in the low region of the range (possibly less than 16). Users should be aware that using unpublished compression methods results in unsharable files.
A compression method has two halves: one have handles
compression and the other half handles uncompression. The
halves are implemented as functions
method_c
and
method_u
respectively. One should not use
the names compress
or uncompress
since
they are likely to conflict with other compression libraries
(like the GNU zlib).
Both the method_c
and
method_u
functions take the same arguments
and return the same values. They are defined with the type:
typedef size_t (*H5Z_func_t)(unsigned int
flags, size_t cd_size, const void
*client_data, size_t src_nbytes, const
void *src, size_t dst_nbytes, void
*dst/*out*/)
The application associates the pair of functions with a name
and a method number by calling H5Zregister()
. This
function can also be used to remove a compression method from
the library by supplying null pointers for the functions.
herr_t H5Zregister (H5Z_method_t method,
const char *name, H5Z_func_t method_c,
H5Z_func_t method_u)
Here's a simple-minded "compression" method
that just copies the input value to the output. It's
similar to the
The function could be registered as method 250 as follows:
The function can be unregistered by saying:
Notice that we kept the name "bogus" even though we unregistered the functions that perform the compression and uncompression. This makes compression statistics more understandable when they're printed. |
If a dataset is to be compressed then the compression
information must be specified when the dataset is created since
once a dataset is created compression parameters cannot be
adjusted. The compression is specified through the dataset
creation property list (see H5Pcreate()
).
herr_t H5Pset_deflate (hid_t plist, int
level)
H5Z_DEFLATE
and the
aggression level is set to level. The level
must be a value between one and nine, inclusive, where one
indicates no (but fast) compression and nine is aggressive
compression.
int H5Pget_deflate (hid_t plist)
H5Z_DEFLATE
compression then this function
will return the aggression level, an integer between one and
nine inclusive. If plist isn't a valid dataset
creation property list or it isn't set to use the deflate
method then a negative value is returned.
herr_t H5Pset_compression (hid_t plist,
H5Z_method_t method, unsigned int flags,
size_t cd_size, const void *client_data)
H5Pset_deflate()
. The dataset creation property
list plist is adjusted to use the specified
compression method. The flags is an 8-bit vector
which is stored in the file as part of the compression message
and passed to the compress and uncompress functions. The
client_data is a byte array of length
cd_size which is copied to the file and passed to the
compress and uncompress methods.
H5Z_method_t H5Pget_compression (hid_t plist,
unsigned int *flags, size_t *cd_size, void
*client_data)
H5Pget_deflate()
. The
compression method (or a negative value on error) is returned
by value, and compression flags and client data is returned by
argument. The application should allocate the
client_data and pass its size as the
cd_size. On return, cd_size will contain
the actual size of the client data. If client_data
is not large enough to hold the entire client data then
cd_size bytes are copied into client_data
and cd_size is set to the total size of the client
data, a value larger than the original.
It is possible to set the compression to a method which hasn't
been defined with H5Zregister()
and which isn't
supported as a predefined method (for instance, setting the
method to H5Z_DEFLATE
when the GNU zlib isn't
available). If that happens then data will be written to the
file in its uncompressed form and the compression statistics
will show failures for the compression.
If an application attempts to use an unsupported method then the compression statistics will show large numbers of compression errors and no data uncompressed.
This example is from a program that tried to use
|
If the library is compiled with debugging turned on for the H5Z
layer (usually as a result of configure --enable-debug=z
)
then statistics about data compression are printed when the
application exits normally or the library is closed. The
statistics are written to the standard error stream and include
two lines for each compression method that was used: the first
line shows compression statistics while the second shows
uncompression statistics. The following fields are displayed:
Field Name | Description |
---|---|
Method | This is the name of the method as defined with
H5Zregister() with the letters
"-c" or "-u" appended to indicate
compression or uncompression. |
Total | The total number of bytes compressed or decompressed including buffer overruns and errors. Bytes of non-compressed data are counted. |
Overrun | During compression, if the algorithm causes the result
to be at least as large as the input then a buffer
overrun error occurs. This field shows the total number
of bytes from the Total column which can be attributed to
overruns. Overruns for decompression can only happen if
the data has been corrupted in some way and will result
in failure of H5Dread() . |
Errors | If an error occurs during compression the data is
stored in it's uncompressed form; and an error during
uncompression causes H5Dread() to return
failure. This field shows the number of bytes of the
Total column which can be attributed to errors. |
User, System, Elapsed | These are the amount of user time, system time, and elapsed time in seconds spent by the library to perform compression. Elapsed time is sensitive to system load. These times may be zero on operating systems that don't support the required operations. |
Bandwidth | This is the compression bandwidth which is the total number of bytes divided by elapsed time. Since elapsed time is subject to system load the bandwidth numbers cannot always be trusted. Furthermore, the bandwidth includes overrun and error bytes which may significanly taint the value. |
|