Raw Data Storage in HDF5

This document describes the various ways that raw data is stored in an HDF5 file and the object header messages which contain the parameters for the storage.

Raw data storage has three components: the mapping from some logical multi-dimensional element space to the linear address space of a file, compression of the raw data on disk, and striping of raw data across multiple files. These components are orthogonal.

Some goals of the storage mechanism are to be able to efficently store data which is:

Small
Small pieces of raw data can be treated as meta data and stored in the object header. This will be achieved by storing the raw data in the object header with message 0x0006. Compression and striping are not supported in this case.
Complete Large
The library should be able to store large arrays contiguously in the file provided the user knows the final array size a priori. The array can then be read/written in a single I/O request. This is accomplished by describing the storage with object header message 0x0005. Compression and striping are not supported in this case.
Sparse Large
A large sparse raw data array should be stored in a manner that is space-efficient but one in which any element can still be accessed in a reasonable amount of time. Implementation details are below.
Dynamic Size
One often doesn't have prior knowledge of the size of an array. It would be nice to allow arrays to grow dynamically in any dimension. It might also be nice to allow the array to grow in the negative dimension directions if convenient to implement. Implementation details are below.
Subslab Access
Some multi-dimensional arrays are almost always accessed by subslabs. For instance, a 2-d array of pixels might always be accessed as smaller 1k-by-1k 2-d arrays always aligned on 1k index values. We should be able to store the array in such a way that striding though the entire array is not necessary. Subslab access might also be useful with compression algorithms where each storage slab can be compressed independently of the others. Implementation details are below.
Compressed
Various compression algorithms can be applied to the entire array. We're not planning to support separate algorithms (or a single algorithm with separate parameters) for each chunk although it would be possible to implement that in a manner similar to the way striping across files is implemented.
Striped Across Files
The array access functions should support arrays stored discontiguously across a set of files.

Implementation of Indexed Storage

The Sparse Large, Dynamic Size, and Subslab Access methods share so much code that they can be described with a single message. The new Indexed Storage Message (0x0008) will replace the old Chunked Object (0x0009) and Sparse Object (0x000A) Messages.

The Format of the Indexed Storage Message
byte byte byte byte

Address of B-tree

Number of Dimensions Reserved Reserved Reserved
Reserved (4 bytes)
Alignment for Dimension 0 (4 bytes)
Alignment for Dimension 1 (4 bytes)
...
Alignment for Dimension N (4 bytes)

The alignment fields indicate the alignment in logical space to use when allocating new storage areas on disk. For instance, writing every other element of a 100-element one-dimensional array (using one HDF5 I/O partial write operation per element) that has unit storage alignment would result in 50 single-element, discontiguous storage segments. However, using an alignment of 25 would result in only four discontiguous segments. The size of the message varies with the number of dimensions.

A B-tree is used to point to the discontiguous portions of storage which has been allocated for the object. All keys of a particular B-tree are the same size and are a function of the number of dimensions. It is therefore not possible to change the dimensionality of an indexed storage array after its B-tree is created.

The Format of a B-Tree Key
byte byte byte byte
External File Number or Zero (4 bytes)
Chunk Offset in Dimension 0 (4 bytes)
Chunk Offset in Dimension 1 (4 bytes)
...
Chunk Offset in Dimension N (4 bytes)

The keys within a B-tree obey an ordering based on the chunk offsets. If the offsets in dimension-0 are equal, then dimension-1 is used, etc. The External File Number field contains a 1-origin offset into the External File List message which contains the name of the external file in which that chunk is stored.

Implementation of Striping

The indexed storage will support arbitrary striping at the chunk level; each chunk can be stored in any file. This is accomplished by using the External File Number field of an indexed storage B-tree key as a 1-origin offset into an External File List Message (0x0009) which takes the form:

The Format of the External File List Message
byte byte byte byte

Name Heap Address

Number of Slots Allocated (4 bytes)
Number of File Names (4 bytes)
Byte Offset of Name 1 in Heap (4 bytes)
Byte Offset of Name 2 in Heap (4 bytes)
...

Unused Slot(s)

Each indexed storage array that has all or part of its data stored in external files will contain a single external file list message. The size of the messages is determined when the message is created, but it may be possible to enlarge the message on demand by moving it. At this time, it's not possible for multiple arrays to share a single external file list message.

H5O_efl_t *H5O_efl_new (H5G_entry_t *object, intn nslots_hint, intn heap_size_hint)
Adds a new, empty external file list message to an object header and returns a pointer to that message. The message acts as a cache for file descriptors of external files that are open.

intn H5O_efl_index (H5O_efl_t *efl, const char *filename)
Gets the external file index number for a particular file name. If the name isn't in the external file list then it's added to the H5O_efl_t struct and immediately written to the object header to which the external file list message belongs. Name comparison is textual. Each name should be relative to the directory which contains the HDF5 file.

H5F_low_t *H5O_efl_open (H5O_efl_t *efl, intn index, uintn mode)
Gets a low-level file descriptor for an external file. The external file list caches file descriptors because we might have many more external files than there are file descriptors available to this process. The caller should not close this file.

herr_t H5O_efl_release (H5O_efl_t *efl)
Releases an external file list, closes all files associated with that list, and if the list has been modified since the call to H5O_efl_new flushes the message to disk.

Robb Matzke
Last modified: Tue Nov 25 12:36:50 EST 1997