This table shows some of the layers of HDF5. Each layer calls functions at the same or lower layers and never functions at higher layers. An object identifier (OID) takes various forms at the various layers: at layer 0 an OID is an absolute physical file address; at layers 1 and 2 it's an absolute virtual file address. At layers 3 through 6 it's a relative address, and at layers 7 and above it's an object handle.
Layer-7 | Groups | Datasets | |
Layer-6 | Indirect Storage | Symbol Tables | |
Layer-5 | B-trees | Object Hdrs | Heaps |
Layer-4 | Caching | ||
Layer-3 | H5F chunk I/O | ||
Layer-2 | H5F low | ||
Layer-1 | File Family | Split Meta/Raw | |
Layer-0 | Section-2 I/O | Standard I/O | Malloc/Free |
The simplest form of hdf5 file is a single file containing only hdf5 data. The file begins with the boot block, which is followed until the end of the file by hdf5 data. The next most complicated file allows non-hdf5 data (user defined data or internal wrappers) to appear before the boot block and after the end of the hdf5 data. The hdf5 data is treated as a single linear address space in both cases.
The next level of complexity comes when non-hdf5 data is interspersed with the hdf5 data. We handle that by including the non-hdf5 interspersed data in the hdf5 address space and simply not referencing it (eventually we might add those addresses to a "do-not-disturb" list using the same mechanism as the hdf5 free list, but it's not absolutely necessary). This is implemented except for the "do-not-disturb" list.
The most complicated single address space hdf5 file is when we allow the address space to be split among multiple physical files. For instance, a >2GB file can be split into smaller chunks and transfered to a 32 bit machine, then accessed as a single logical hdf5 file. The library already supports >32 bit addresses, so at layer 1 we split a 64-bit address into a 32-bit file number and a 32-bit offset (the 64 and 32 are arbitrary). The rest of the library still operates with a linear address space.
Another variation might be a family of two files where all the meta data is stored in one file and all the raw data is stored in another file to allow the HDF5 wrapper to be easily replaced with some other wrapper.
The H5Fcreate
and H5Fopen
functions
would need to be modified to pass file-type info down to layer 2
so the correct drivers can be called and parameters passed to
the drivers to initialize them.
I've implemented fixed-size family members. The entire hdf5
file is partitioned into members where each member is the same
size. The family scheme is used if one passes a name to
H5F_open
(which is called by H5Fopen()
and H5Fcreate
) that contains a
printf(3c)
-style integer format specifier.
Currently, the default low-level file driver is used for all
family members (H5F_LOW_DFLT, usually set to be Section 2 I/O or
Section 3 stdio), but we'll probably eventually want to pass
that as a parameter of the file access property list, which
hasn't been implemented yet. When creating a family, a default
family member size is used (defined at the top H5Ffamily.c,
currently 64MB) but that also should be settable in the file
access property list. When opening an existing family, the size
of the first member is used to determine the member size
(flushing/closing a family ensures that the first member is the
correct size) but the other family members don't have to be that
large (the local address space, however, is logically the same
size for all members).
I haven't implemented a split meta/raw family yet but am rather
curious to see how it would perform. I was planning to use the
`.h5' extension for the meta data file and `.raw' for the raw
data file. The high-order bit in the address would determine
whether the address refers to meta data or raw data. If the user
passes a name that ends with `.raw' to H5F_open
then we'll chose the split family and use the default low level
driver for each of the two family members. Eventually we'll
want to pass these kinds of things through the file access
property list instead of relying on naming convention.
We also need the ability to point to raw data that isn't in the HDF5 linear address space. For instance, a dataset might be striped across several raw data files.
Fortunately, the only two packages that need to be aware of this are the packages for reading/writing contiguous raw data and discontiguous raw data. Since contiguous raw data is a special case, I'll discuss how to implement external raw data in the discontiguous case.
Discontiguous data is stored as a B-tree whose keys are the chunk indices and whose leaf nodes point to the raw data by storing a file address. So what we need is some way to name the external files, and a way to efficiently store the external file name for each chunk.
I propose adding to the object header an External File List message that is a 1-origin array of file names. Then, in the B-tree, each key has an index into the External File List (or zero for the HDF5 file) for the file where the chunk can be found. The external file index is only used at the leaf nodes to get to the raw data (the entire B-tree is in the HDF5 file) but because of the way keys are copied among the B-tree nodes, it's much easier to store the index with every key.
One might also want to combine two or more HDF5 files in a manner similar to mounting file systems in Unix. That is, the group structure and meta data from one file appear as though they exist in the first file. One opens File-A, and then mounts File-B at some point in File-A, the mount point, so that traversing into the mount point actually causes one to enter the root object of File-B. File-A and File-B are each complete HDF5 files and can be accessed individually without mounting them.
We need a couple additional pieces of machinery to make this work. First, an haddr_t type (a file address) doesn't contain any info about which HDF5 file's address space the address belongs to. But since haddr_t is an opaque type except at layers 2 and below, it should be quite easy to add a pointer to the HDF5 file. This would also remove the H5F_t argument from most of the low-level functions since it would be part of the OID.
The other thing we need is a table of mount points and some functions that understand them. We would add the following table to each H5F_t struct:
struct H5F_mount_t {
H5F_t *parent; /* Parent HDF5 file if any */
struct {
H5F_t *f; /* File which is mounted */
haddr_t where; /* Address of mount point */
} *mount; /* Array sorted by mount point */
intn nmounts; /* Number of mounted files */
intn alloc; /* Size of mount table */
}
The H5Fmount
function takes the ID of an open
file, the name of a to-be-mounted file, the name of the mount
point, and a file access property list (like H5Fopen
).
It opens the new file and adds a record to the parent's mount
table. The H5Funmount
function takes the parent
file ID and the name of the mount point and closes the file
that's mounted at that point. The H5Fclose
function closes/unmounts files recursively.
The H5G_iname
function which translates a name to
a file address (haddr_t
) looks at the mount table
at each step in the translation and switches files where
appropriate. All name-to-address translations occur through
this function.
I'm expecting to be able to implement the two new flavors of single linear address space in about two days. It took two hours to implement the malloc/free file driver at level zero and I don't expect this to be much more work.
I'm expecting three days to implement the external raw data for
discontiguous arrays. Adding the file index to the B-tree is
quite trivial; adding the external file list message shouldn't
be too hard since the object header message class from wich this
message derives is fully implemented; and changing
H5F_istore_read
should be trivial. Most of the
time will be spent designing a way to cache Unix file
descriptors efficiently since the total number open files
allowed per process could be much smaller than the total number
of HDF5 files and external raw data files.
I'm expecting four days to implement being able to mount one
HDF5 file on another. I was originally planning a lot more, but
making haddr_t
opaque turned out to be much easier
than I planned (I did it last Fri). Most of the work will
probably be removing the redundant H5F_t arguments for lots of
functions.
The external raw data could be implemented as a single linear address space, but doing so would require one to allocate large enough file addresses throughout the file (>32bits) before the file was created. It would make mixing an HDF5 file family with external raw data, or external HDF5 wrapper around an HDF4 file a more difficult process. So I consider the implementation of external raw data files as a single HDF5 linear address space a kludge.
The ability to mount one HDF5 file on another might not be a very important feature especially since each HDF5 file must be a complete file by itself. It's not possible to stripe an array over multiple HDF5 files because the B-tree wouldn't be complete in any one file, so the only choice is to stripe the array across multiple raw data files and store the B-tree in the HDF5 file. On the other hand, it might be useful if one file contains some public data which can be mounted by other files (e.g., a mesh topology shared among collaborators and mounted by files that contain other fields defined on the mesh). Of course the applications can open the two files separately, but it might be more portable if we support it in the library.
So we're looking at about two weeks to implement all three versions. I didn't get a chance to do any of them in AIO although we had long-term plans for the first two with a possibility of the third. They'll be much easier to implement in HDF5 than AIO since I've been keeping these in mind from the start.