External Files in HDF5

Overview of Layers

This table shows some of the layers of HDF5. Each layer calls functions at the same or lower layers and never functions at higher layers. An object identifier (OID) takes various forms at the various layers: at layer 0 an OID is an absolute physical file address; at layers 1 and 2 it's an absolute virtual file address. At layers 3 through 6 it's a relative address, and at layers 7 and above it's an object handle.

Layer-7	Groups	Datasets
Layer-6	Indirect Storage	Symbol Tables
Layer-5	B-trees	Object Hdrs	Heaps
Layer-4	Caching
Layer-3	H5F chunk I/O
Layer-2	H5F low
Layer-1	File Family	Split Meta/Raw
Layer-0	Section-2 I/O	Standard I/O	Malloc/Free

Single Address Space

The simplest form of hdf5 file is a single file containing only hdf5 data. The file begins with the super block, which is followed until the end of the file by hdf5 data. The next most complicated file allows non-hdf5 data (user defined data or internal wrappers) to appear before the super block and after the end of the hdf5 data. The hdf5 data is treated as a single linear address space in both cases.

The next level of complexity comes when non-hdf5 data is interspersed with the hdf5 data. We handle that by including the non-hdf5 interspersed data in the hdf5 address space and simply not referencing it (eventually we might add those addresses to a "do-not-disturb" list using the same mechanism as the hdf5 free list, but it's not absolutely necessary). This is implemented except for the "do-not-disturb" list.

The most complicated single address space hdf5 file is when we allow the address space to be split among multiple physical files. For instance, a >2GB file can be split into smaller chunks and transfered to a 32 bit machine, then accessed as a single logical hdf5 file. The library already supports >32 bit addresses, so at layer 1 we split a 64-bit address into a 32-bit file number and a 32-bit offset (the 64 and 32 are arbitrary). The rest of the library still operates with a linear address space.

Another variation might be a family of two files where all the meta data is stored in one file and all the raw data is stored in another file to allow the HDF5 wrapper to be easily replaced with some other wrapper.

The H5Fcreate and H5Fopen functions would need to be modified to pass file-type info down to layer 2 so the correct drivers can be called and parameters passed to the drivers to initialize them.

Implementation

I've implemented fixed-size family members. The entire hdf5 file is partitioned into members where each member is the same size. The family scheme is used if one passes a name to H5F_open (which is called by H5Fopen() and H5Fcreate) that contains a printf(3c)-style integer format specifier. Currently, the default low-level file driver is used for all family members (H5F_LOW_DFLT, usually set to be Section 2 I/O or Section 3 stdio), but we'll probably eventually want to pass that as a parameter of the file access property list, which hasn't been implemented yet. When creating a family, a default family member size is used (defined at the top H5Ffamily.c, currently 64MB) but that also should be settable in the file access property list. When opening an existing family, the size of the first member is used to determine the member size (flushing/closing a family ensures that the first member is the correct size) but the other family members don't have to be that large (the local address space, however, is logically the same size for all members).

I haven't implemented a split meta/raw family yet but am rather curious to see how it would perform. I was planning to use the `.h5' extension for the meta data file and `.raw' for the raw data file. The high-order bit in the address would determine whether the address refers to meta data or raw data. If the user passes a name that ends with `.raw' to H5F_open then we'll chose the split family and use the default low level driver for each of the two family members. Eventually we'll want to pass these kinds of things through the file access property list instead of relying on naming convention.

External Raw Data

We also need the ability to point to raw data that isn't in the HDF5 linear address space. For instance, a dataset might be striped across several raw data files.

Fortunately, the only two packages that need to be aware of this are the packages for reading/writing contiguous raw data and discontiguous raw data. Since contiguous raw data is a special case, I'll discuss how to implement external raw data in the discontiguous case.

Discontiguous data is stored as a B-tree whose keys are the chunk indices and whose leaf nodes point to the raw data by storing a file address. So what we need is some way to name the external files, and a way to efficiently store the external file name for each chunk.

I propose adding to the object header an External File List message that is a 1-origin array of file names. Then, in the B-tree, each key has an index into the External File List (or zero for the HDF5 file) for the file where the chunk can be found. The external file index is only used at the leaf nodes to get to the raw data (the entire B-tree is in the HDF5 file) but because of the way keys are copied among the B-tree nodes, it's much easier to store the index with every key.

Multiple HDF5 Files

One might also want to combine two or more HDF5 files in a manner similar to mounting file systems in Unix. That is, the group structure and meta data from one file appear as though they exist in the first file. One opens File-A, and then mounts File-B at some point in File-A, the mount point, so that traversing into the mount point actually causes one to enter the root object of File-B. File-A and File-B are each complete HDF5 files and can be accessed individually without mounting them.

We need a couple additional pieces of machinery to make this work. First, an haddr_t type (a file address) doesn't contain any info about which HDF5 file's address space the address belongs to. But since haddr_t is an opaque type except at layers 2 and below, it should be quite easy to add a pointer to the HDF5 file. This would also remove the H5F_t argument from most of the low-level functions since it would be part of the OID.

The other thing we need is a table of mount points and some functions that understand them. We would add the following table to each H5F_t struct:

struct H5F_mount_t {
   H5F_t *parent;         /* Parent HDF5 file if any */
   struct {
      H5F_t *f;           /* File which is mounted */
      haddr_t where;      /* Address of mount point */
   } *mount;              /* Array sorted by mount point */
   intn nmounts;          /* Number of mounted files */
   intn alloc;            /* Size of mount table */
}

The H5Fmount function takes the ID of an open file or group, the name of a to-be-mounted file, the name of the mount point, and a file access property list (like H5Fopen). It opens the new file and adds a record to the parent's mount table. The H5Funmount function takes the parent file or group ID and the name of the mount point and disassociates the mounted file from the mount point. It does not close the mounted file. The H5Fclose function closes/unmounts files recursively.

The H5G_iname function which translates a name to a file address (haddr_t) looks at the mount table at each step in the translation and switches files where appropriate. All name-to-address translations occur through this function.

How Long?

I'm expecting to be able to implement the two new flavors of single linear address space in about two days. It took two hours to implement the malloc/free file driver at level zero and I don't expect this to be much more work.

I'm expecting three days to implement the external raw data for discontiguous arrays. Adding the file index to the B-tree is quite trivial; adding the external file list message shouldn't be too hard since the object header message class from wich this message derives is fully implemented; and changing H5F_istore_read should be trivial. Most of the time will be spent designing a way to cache Unix file descriptors efficiently since the total number open files allowed per process could be much smaller than the total number of HDF5 files and external raw data files.

I'm expecting four days to implement being able to mount one HDF5 file on another. I was originally planning a lot more, but making haddr_t opaque turned out to be much easier than I planned (I did it last Fri). Most of the work will probably be removing the redundant H5F_t arguments for lots of functions.

Conclusion

The external raw data could be implemented as a single linear address space, but doing so would require one to allocate large enough file addresses throughout the file (>32bits) before the file was created. It would make mixing an HDF5 file family with external raw data, or external HDF5 wrapper around an HDF4 file a more difficult process. So I consider the implementation of external raw data files as a single HDF5 linear address space a kludge.

The ability to mount one HDF5 file on another might not be a very important feature especially since each HDF5 file must be a complete file by itself. It's not possible to stripe an array over multiple HDF5 files because the B-tree wouldn't be complete in any one file, so the only choice is to stripe the array across multiple raw data files and store the B-tree in the HDF5 file. On the other hand, it might be useful if one file contains some public data which can be mounted by other files (e.g., a mesh topology shared among collaborators and mounted by files that contain other fields defined on the mesh). Of course the applications can open the two files separately, but it might be more portable if we support it in the library.

So we're looking at about two weeks to implement all three versions. I didn't get a chance to do any of them in AIO although we had long-term plans for the first two with a possibility of the third. They'll be much easier to implement in HDF5 than AIO since I've been keeping these in mind from the start.

Robb Matzke

Last modified: Tue Sep 8 14:43:32 EDT 1998