diff options
Diffstat (limited to 'doc/html/Compression.html')
-rw-r--r-- | doc/html/Compression.html | 409 |
1 files changed, 409 insertions, 0 deletions
diff --git a/doc/html/Compression.html b/doc/html/Compression.html new file mode 100644 index 0000000..c3a2a45 --- /dev/null +++ b/doc/html/Compression.html @@ -0,0 +1,409 @@ +<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> +<html> + <head> + <title>Compression</title> + </head> + + <body> + <h1>Compression</h1> + + <h2>1. Introduction</h2> + + <p>HDF5 supports compression of raw data by compression methods + built into the library or defined by an application. A + compression method is associated with a dataset when the dataset + is created and is applied independently to each storage chunk of + the dataset. + + The dataset must use the <code>H5D_CHUNKED</code> storage + layout. The library doesn't support compression for contiguous + datasets because of the difficulty of implementing random access + for partial I/O, and compact dataset compression is not + supported because it wouldn't produce significant results. + + <h2>2. Supported Compression Methods</h2> + + <p>The library identifies compression methods with small + integers, with values less than 16 reserved for use by NCSA and + values between 16 and 255 (inclusive) available for general + use. This range may be extended in the future if it proves to + be too small. + + <p> + <center> + <table align=center width="80%"> + <tr> + <th width="30%">Method Name</th> + <th width="70%">Description</th> + </tr> + + <tr valign=top> + <td><code>H5Z_NONE</code></td> + <td>The default is to not use compression. Specifying + <code>H5Z_NONE</code> as the compression method results + in better perfomance than writing a function that just + copies data because the library's I/O pipeline + recognizes this method and is able to short circuit + parts of the pipeline.</td> + </tr> + + <tr valign=top> + <td><code>H5Z_DEFLATE</code></td> + <td>The <em>deflate</em> method is the algorithm used by + the GNU <code>gzip</code>program. It's a combination of + a Huffman encoding followed by a 1977 Lempel-Ziv (LZ77) + dictionary encoding. The aggressiveness of the + compression can be controlled by passing an integer value + to the compressor with <code>H5Pset_deflate()</code> + (see below). In order for this compression method to be + used, the HDF5 library must be configured and compiled + in the presence of the GNU zlib version 1.1.2 or + later.</td> + </tr> + + <tr valign=top> + <td><code>H5Z_RES_<em>N</em></code></td> + <td>These compression methods (where <em>N</em> is in the + range two through 15, inclusive) are reserved by NCSA + for future use.</td> + </tr> + + <tr valign=top> + <td>Values of <em>N</em> between 16 and 255, inclusive</td> + <td>These values can be used to represent application-defined + compression methods. We recommend that methods under + testing should be in the high range and when a method is + about to be published it should be given a number near + the low end of the range (or even below 16). Publishing + the compression method and its numeric ID will make a + file sharable.</td> + </tr> + </table> + </center> + + <p>Setting the compression for a dataset to a method which was + not compiled into the library and/or not registered by the + application is allowed, but writing to such a dataset will + silently <em>not</em> compress the data. Reading a compressed + dataset for a method which is not available will result in + errors (specifically, <code>H5Dread()</code> will return a + negative value). The errors will be displayed in the + compression statistics if the library was compiled with + debugging turned on for the "z" package. See the + section on diagnostics below for more details. + + <h2>3. Application-Defined Methods</h2> + + <p>Compression methods 16 through 255 can be defined by an + application. As mentioned above, methods that have not been + released should use high numbers in that range while methods + that have been published will be assigned an official number in + the low region of the range (possibly less than 16). Users + should be aware that using unpublished compression methods + results in unsharable files. + + <p>A compression method has two halves: one have handles + compression and the other half handles uncompression. The + halves are implemented as functions + <code><em>method</em>_c</code> and + <code><em>method</em>_u</code> respectively. One should not use + the names <code>compress</code> or <code>uncompress</code> since + they are likely to conflict with other compression libraries + (like the GNU zlib). + + <p>Both the <code><em>method</em>_c</code> and + <code><em>method</em>_u</code> functions take the same arguments + and return the same values. They are defined with the type: + + <dl> + <dt><code>typedef size_t (*H5Z_func_t)(unsigned int + <em>flags</em>, size_t <em>cd_size</em>, const void + *<em>client_data</em>, size_t <em>src_nbytes</em>, const + void *<em>src</em>, size_t <em>dst_nbytes</em>, void + *<em>dst</em>/*out*/)</code> + <dd>The <em>flags</em> are an 8-bit vector which is stored in + the file and which is defined when the compression method is + defined. The <em>client_data</em> is a pointer to + <em>cd_size</em> bytes of configuration data which is also + stored in the file. The function compresses or uncompresses + <em>src_nbytes</em> from the source buffer <em>src</em> into + at most <em>dst_nbytes</em> of the result buffer <em>dst</em>. + The function returns the number of bytes written to the result + buffer or zero if an error occurs. But if a result buffer + overrun occurs the function should return a value at least as + large as <em>dst_size</em> (the uncompressor will see an + overrun only for corrupt data). + </dl> + + <p>The application associates the pair of functions with a name + and a method number by calling <code>H5Zregister()</code>. This + function can also be used to remove a compression method from + the library by supplying null pointers for the functions. + + <dl> + <dt><code>herr_t H5Zregister (H5Z_method_t <em>method</em>, + const char *<em>name</em>, H5Z_func_t <em>method_c</em>, + H5Z_func_t <em>method_u</em>)</code> + <dd>The pair of functions to be used for compression + (<em>method_c</em>) and uncompression (<em>method_u</em>) are + associated with a short <em>name</em> used for debugging and a + <em>method</em> number in the range 16 through 255. This + function can be called as often as desired for a particular + compression method with each call replacing the information + stored by the previous call. Sometimes it's convenient to + supply only one half of the compression, for instance in an + application that opens files for read-only. Compression + statistics for the method are accumulated across calls to this + function. + </dl> + + <p> + <center> + <table border align=center width="100%"> + <caption align=bottom><h4>Example: Registering an + Application-Defined Compression Method</h4></caption> + <tr> + <td> + <p>Here's a simple-minded "compression" method + that just copies the input value to the output. It's + similar to the <code>H5Z_NONE</code> method but + slower. Compression and uncompression are performed + by the same function. + + <p><code><pre> +size_t +bogus (unsigned int flags, + size_t cd_size, const void *client_data, + size_t src_nbytes, const void *src, + size_t dst_nbytes, void *dst/*out*/) +{ + memcpy (dst, src, src_nbytes); + return src_nbytes; +} + </pre></code> + + <p>The function could be registered as method 250 as + follows: + + <p><code><pre> +#define H5Z_BOGUS 250 +H5Zregister (H5Z_BOGUS, "bogus", bogus, bogus); + </pre></code> + + <p>The function can be unregistered by saying: + + <p><code><pre> +H5Zregister (H5Z_BUGUS, "bogus", NULL, NULL); + </pre></code> + + <p>Notice that we kept the name "bogus" even + though we unregistered the functions that perform the + compression and uncompression. This makes compression + statistics more understandable when they're printed. + </td> + </tr> + </table> + </center> + + <h2>4. Enabling Compression for a Dataset</h2> + + <p>If a dataset is to be compressed then the compression + information must be specified when the dataset is created since + once a dataset is created compression parameters cannot be + adjusted. The compression is specified through the dataset + creation property list (see <code>H5Pcreate()</code>). + + <dl> + <dt><code>herr_t H5Pset_deflate (hid_t <em>plist</em>, int + <em>level</em>)</code> + <dd>The compression method for dataset creation property list + <em>plist</em> is set to <code>H5Z_DEFLATE</code> and the + aggression level is set to <em>level</em>. The <em>level</em> + must be a value between one and nine, inclusive, where one + indicates no (but fast) compression and nine is aggressive + compression. + + <br><br> + <dt><code>int H5Pget_deflate (hid_t <em>plist</em>)</code> + <dd>If dataset creation property list <em>plist</em> is set to + use <code>H5Z_DEFLATE</code> compression then this function + will return the aggression level, an integer between one and + nine inclusive. If <em>plist</em> isn't a valid dataset + creation property list or it isn't set to use the deflate + method then a negative value is returned. + + <br><br> + <dt><code>herr_t H5Pset_compression (hid_t <em>plist</em>, + H5Z_method_t <em>method</em>, unsigned int <em>flags</em>, + size_t <em>cd_size</em>, const void *<em>client_data</em>)</code> + <dd>This is a catch-all function for defining compresion methods + and is intended to be called from a wrapper such as + <code>H5Pset_deflate()</code>. The dataset creation property + list <em>plist</em> is adjusted to use the specified + compression method. The <em>flags</em> is an 8-bit vector + which is stored in the file as part of the compression message + and passed to the compress and uncompress functions. The + <em>client_data</em> is a byte array of length + <em>cd_size</em> which is copied to the file and passed to the + compress and uncompress methods. + + <br><br> + <dt><code>H5Z_method_t H5Pget_compression (hid_t <em>plist</em>, + unsigned int *<em>flags</em>, size_t *<em>cd_size</em>, void + *<em>client_data</em>)</code> + <dd>This is a catch-all function for querying the compression + method associated with dataset creation property list + <em>plist</em> and is intended to be called from a wrapper + function such as <code>H5Pget_deflate()</code>. The + compression method (or a negative value on error) is returned + by value, and compression flags and client data is returned by + argument. The application should allocate the + <em>client_data</em> and pass its size as the + <em>cd_size</em>. On return, <em>cd_size</em> will contain + the actual size of the client data. If <em>client_data</em> + is not large enough to hold the entire client data then + <em>cd_size</em> bytes are copied into <em>client_data</em> + and <em>cd_size</em> is set to the total size of the client + data, a value larger than the original. + </dl> + + <p>It is possible to set the compression to a method which hasn't + been defined with <code>H5Zregister()</code> and which isn't + supported as a predefined method (for instance, setting the + method to <code>H5Z_DEFLATE</code> when the GNU zlib isn't + available). If that happens then data will be written to the + file in its uncompressed form and the compression statistics + will show failures for the compression. + + <p> + <center> + <table border align=center width="100%"> + <caption align=bottom><h4>Example: Statistics for an + Unsupported Compression Method</h4></caption> + <tr> + <td> + <p>If an application attempts to use an unsupported + method then the compression statistics will show large + numbers of compression errors and no data + uncompressed. + + <p><code><pre> +H5Z: compression statistics accumulated over life of library: + Method Total Overrun Errors User System Elapsed Bandwidth + ------ ----- ------- ------ ---- ------ ------- --------- + deflate-c 160000 0 160000 0.00 0.01 0.01 1.884e+07 + deflate-u 0 0 0 0.00 0.00 0.00 NaN + </pre></code> + + <p>This example is from a program that tried to use + <code>H5Z_DEFLATE</code> on a system that didn't have + the GNU zlib to write to a dataset and then read the + result. The read and write both succeeded but the + data was not compressed. + </td> + </tr> + </table> + </center> + + <h2>5. Compression Diagnostics</h2> + + <p>If the library is compiled with debugging turned on for the H5Z + layer (usually as a result of <code>configure --enable-debug=z</code>) + then statistics about data compression are printed when the + application exits normally or the library is closed. The + statistics are written to the standard error stream and include + two lines for each compression method that was used: the first + line shows compression statistics while the second shows + uncompression statistics. The following fields are displayed: + + <p> + <center> + <table align=center width="80%"> + <tr> + <th width="30%">Field Name</th> + <th width="70%">Description</th> + </tr> + + <tr valign=top> + <td>Method</td> + <td>This is the name of the method as defined with + <code>H5Zregister()</code> with the letters + "-c" or "-u" appended to indicate + compression or uncompression.</td> + </tr> + + <tr valign=top> + <td>Total</td> + <td>The total number of bytes compressed or decompressed + including buffer overruns and errors. Bytes of + non-compressed data are counted.</td> + </tr> + + <tr valign=top> + <td>Overrun</td> + <td>During compression, if the algorithm causes the result + to be at least as large as the input then a buffer + overrun error occurs. This field shows the total number + of bytes from the Total column which can be attributed to + overruns. Overruns for decompression can only happen if + the data has been corrupted in some way and will result + in failure of <code>H5Dread()</code>.</td> + </tr> + + <tr valign=top> + <td>Errors</td> + <td>If an error occurs during compression the data is + stored in it's uncompressed form; and an error during + uncompression causes <code>H5Dread()</code> to return + failure. This field shows the number of bytes of the + Total column which can be attributed to errors.</td> + </tr> + + <tr valign=top> + <td>User, System, Elapsed</td> + <td>These are the amount of user time, system time, and + elapsed time in seconds spent by the library to perform + compression. Elapsed time is sensitive to system + load. These times may be zero on operating systems that + don't support the required operations.</td> + </tr> + + <tr valign=top> + <td>Bandwidth</td> + <td>This is the compression bandwidth which is the total + number of bytes divided by elapsed time. Since elapsed + time is subject to system load the bandwidth numbers + cannot always be trusted. Furthermore, the bandwidth + includes overrun and error bytes which may significanly + taint the value.</td> + </tr> + </table> + </center> + + <p> + <center> + <table border align=center width="100%"> + <caption align=bottom><h4>Example: Compression + Statistics</h4></caption> + <tr> + <td> + <p><code><pre> +H5Z: compression statistics accumulated over life of library: + Method Total Overrun Errors User System Elapsed Bandwidth + ------ ----- ------- ------ ---- ------ ------- --------- + deflate-c 160000 200 0 0.62 0.74 1.33 1.204e+05 + deflate-u 120000 0 0 0.11 0.00 0.12 9.885e+05 + </pre></code> + </td> + </tr> + </table> + </center> + + <hr> + <address><a href="mailto:matzke@llnl.gov">Robb Matzke</a></address> +<!-- Created: Fri Apr 17 13:39:35 EDT 1998 --> +<!-- hhmts start --> +Last modified: Fri Apr 17 16:15:21 EDT 1998 +<!-- hhmts end --> + </body> +</html> |