Backward/Forward Compatability

The HDF5 development must proceed in such a manner as to satisfy the following conditions:

  1. HDF5 applications can produce data that HDF5 applications can read and write and HDF4 applications can produce data that HDF4 applications can read and write. The situation that demands this condition is obvious.
  2. HDF5 applications are able to produce data that HDF4 applications can read and HDF4 applications can subsequently modify the file subject to certain constraints depending on the implementation. This condition is for the temporary situation where a consumer has neither been relinked with a new HDF4 API built on top of the HDF5 API nor recompiled with the HDF5 API.
  3. HDF5 applications can read existing HDF4 files and subsequently modify the file subject to certain constraints depending on the implementation. This is condition is for the temporary situation in which the producer has neither been relinked with a new HDF4 API built on top of the HDF5 API nor recompiled with the HDF5 API, or the permanent situation of HDF5 consumers reading archived HDF4 files.
  4. There's at least one invarient: new object features introduced in the HDF5 file format (like 2-d arrays of structs) might be impossible to "translate" to a format that an old HDF4 application can understand either because the HDF4 file format or the HDF4 API has no mechanism to describe the object.

    What follows is one possible implementation based on how Condition B was solved in the AIO/PDB world. It also attempts to satisfy these goals:

    1. The main HDF5 library contains as little extra baggage as possible by either relying on external programs to take care of compatability issues or by incorporating the logic of such programs as optional modules in the HDF5 library. Conditions B and C are separate programs/modules.
    2. No extra baggage not only means the library proper is small, but also means it can be implemented (rather than migrated from HDF4 source) from the ground up with minimal regard for HDF4 thus keeping the logic straight forward.
    3. Compatability issues are handled behind the scenes when necessary (and possible) but can be carried out explicitly during things like data migration.

    Wrappers

    The proposed implementation uses wrappers to handle compatability issues. A Format-X file is wrapped in a Format-Y file by creating a Format-Y skeleton that replicates the Format-X meta data. The Format-Y skeleton points to the raw data stored in Format-X without moving the raw data. The restriction is that raw data storage methods in Format-Y is a superset of raw data storage methods in Format-X (otherwise the raw data must be copied to Format-Y). We're assuming that meta data is small wrt the entire file.

    The wrapper can be a separate file that has pointers into the first file or it can be contained within the first file. If contained in a single file, the file can appear as a Format-Y file or simultaneously a Format-Y and Format-X file.

    The Format-X meta-data can be thought of as the original wrapper around raw data and Format-Y is a second wrapper around the same data. The wrappers are independend of one another; modifying the meta-data in one wrapper causes the other to become out of date. Modification of raw data doesn't invalidate either view as long as the meta data that describes its storage isn't modifed. For instance, an array element can change values if storage is already allocated for the element, but if storage isn't allocated then the meta data describing the storage must change, invalidating all wrappers but one.

    It's perfectly legal to modify the meta data of one wrapper without modifying the meta data in the other wrapper(s). The illegal part is accessing the raw data through a wrapper which is out of date.

    If raw data is wrapped by more than one internal wrapper (internal means that the wrapper is in the same file as the raw data) then access to that file must assume that unreferenced parts of that file contain meta data for another wrapper and cannot be reclaimed as free memory.


    Implementation of Condition B

    Since this is a temporary situation which can't be automatically detected by the HDF5 library, we must rely on the application to notify the HDF5 library whether or not it must satisfy Condition B. (Even if we don't rely on the application, at some point someone is going to remove the Condition B constraint from the library.) So the module that handles Condition B is conditionally compiled and then enabled on a per-file basis.

    If the application desires to produce an HDF4 file (determined by arguments to H5Fopen), and the Condition B module is compiled into the library, then H5Fclose calls the module to traverse the HDF5 wrapper and generate an additional internal or external HDF4 wrapper (wrapper specifics are described below). If Condition B is implemented as a module then it can benefit from the metadata already cached by the main library.

    An internal HDF4 wrapper would be used if the HDF5 file is writable and the user doesn't mind that the HDF5 file is modified. An external wrapper would be used if the file isn't writable or if the user wants the data file to be primarily HDF5 but a few applications need an HDF4 view of the data.

    Modifying through the HDF5 library an HDF5 file that has internal HDF4 wrapper should invalidate the HDF4 wrapper (and optionally regenerate it when H5Fclose is called). The HDF5 library must understand how wrappers work, but not necessarily anything about the HDF4 file format.

    Modifying through the HDF5 library an HDF5 file that has an external HDF4 wrapper will cause the HDF4 wrapper to become out of date (but possibly regenerated during H5Fclose). Note: Perhaps the next release of the HDF4 library should insure that the HDF4 wrapper file has a more recent modification time than the raw data file (the HDF5 file) to which it points(?)

    Modifying through the HDF4 library an HDF5 file that has an internal or external HDF4 wrapper will cause the HDF5 wrapper to become out of date. However, there is now way for the old HDF4 library to notify the HDF5 wrapper that it's out of date. Therefore the HDF5 library must be able to detect when the HDF5 wrapper is out of date and be able to fix it. If the HDF4 wrapper is complete then the easy way is to ignore the original HDF5 wrapper and generate a new one from the HDF4 wrapper. The other approach is to compare the HDF4 and HDF5 wrappers and assume that if they differ HDF4 is the right one, if HDF4 omits data then it was because HDF4 is a partial wrapper (rather than assume HDF4 deleted the data), and if HDF4 has new data then copy the new meta data to the HDF5 wrapper. On the other hand, perhaps we don't need to allow these situations (modifying an HDF5 file with the old HDF4 library and then accessing it with the HDF5 library is either disallowed or causes HDF5 objects that can't be described by HDF4 to be lost).

    To convert an HDF5 file to an HDF4 file on demand, one simply opens the file with the HDF4 flag and closes it. This is also how AIO implemented backward compatability with PDB in its file format.


    Implementation of Condition C

    This condition must be satisfied for all time because there will always be archived HDF4 files. If a pure HDF4 file (that is, one without HDF5 meta data) is opened with an HDF5 library, the H5Fopen builds an internal or external HDF5 wrapper and then accesses the raw data through that wrapper. If the HDF5 library modifies the file then the HDF4 wrapper becomes out of date. However, since the HDF5 library hasn't been released, we can at least implement it to disable and/or reclaim the HDF4 wrapper.

    If an external and temporary HDF5 wrapper is desired, the wrapper is created through the cache like all other HDF5 files. The data appears on disk only if a particular cached datum is preempted. Instead of calling H5Fclose on the HDF5 wrapper file we call H5Fabort which immediately releases all file resources without updating the file, and then we unlink the file from Unix.


    What do wrappers look like?

    External wrappers are quite obvious: they contain only things from the format specs for the wrapper and nothing from the format specs of the format which they wrap.

    An internal HDF4 wrapper is added to an HDF5 file in such a way that the file appears to be both an HDF4 file and an HDF5 file. HDF4 requires an HDF4 file header at file offset zero. If a user block is present then we just move the user block down a bit (and truncate it) and insert the minimum HDF4 signature. The HDF4 dd list and any other data it needs are appended to the end of the file and the HDF5 signature uses the logical file length field to determine the beginning of the trailing part of the wrapper.

    HDF4 minimal file header. Its main job is to point to the dd list at the end of the file.
    User-defined block which is truncated by the size of the HDF4 file header so that the HDF5 super block file address doesn't change.
    The HDF5 super block and data, unmodified by adding the HDF4 wrapper.
    The main part of the HDF4 wrapper. The dd list will have entries for all parts of the file so hdpack(?) doesn't (re)move anything.

    When such a file is opened by the HDF5 library for modification it shifts the user block back down to address zero and fills with zeros, then truncates the file at the end of the HDF5 data or adds the trailing HDF4 wrapper to the free list. This prevents HDF4 applications from reading the file with an out of date wrapper.

    If there is no user block then we have a problem. The HDF5 super block must be moved to make room for the HDF4 file header. But moving just the super block causes problems because all file addresses stored in the file are relative to the super block address. The only option is to shift the entire file contents by 512 bytes to open up a user block (too bad we don't have hooks into the Unix i-node stuff so we could shift the entire file contents by the size of a file system page without ever performing I/O on the file :-)

    Is it possible to place an HDF5 wrapper in an HDF4 file? I don't know enough about the HDF4 format, but I would suspect it might be possible to open a hole at file address 512 (and possibly before) by moving some things to the end of the file to make room for the HDF5 signature. The remainder of the HDF5 wrapper goes at the end of the file and entries are added to the HDF4 dd list to mark the location(s) of the HDF5 wrapper.


    Other Thoughts

    Conversion programs that copy an entire HDF4 file to a separate, self-contained HDF5 file and vice versa might be useful.


    Robb Matzke
    Last modified: Wed Oct 8 12:34:42 EST 1997