<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
  <head>
    <title>External Files in HDF5</title>
  </head>

  <body>
    <center><h1>External Files in HDF5</h1></center>

    <h3>Overview of Layers</h3>

    <p>This table shows some of the layers of HDF5.  Each layer calls
      functions at the same or lower layers and never functions at
      higher layers.  An object identifier (OID) takes various forms
      at the various layers: at layer 0 an OID is an absolute physical
      file address; at layers 1 and 2 it's an absolute virtual file
      address. At layers 3 through 6 it's a relative address, and at
      layers 7 and above it's an object handle.

    <p><center>
	<table border cellpadding=4 width="60%">
	  <tr align=center>
	    <td>Layer-7</td>
	    <td>Groups</td>
	    <td>Datasets</td>
	  </tr>
	  <tr align=center>
	    <td>Layer-6</td>
	    <td>Indirect Storage</td>
	    <td>Symbol Tables</td>
	  </tr>
	  <tr align=center>
	    <td>Layer-5</td>
	    <td>B-trees</td>
	    <td>Object Hdrs</td>
	    <td>Heaps</td>
	  </tr>
	  <tr align=center>
	    <td>Layer-4</td>
	    <td>Caching</td>
	  </tr>
	  <tr align=center>
	    <td>Layer-3</td>
	    <td>H5F chunk I/O</td>
	  </tr>
	  <tr align=center>
	    <td>Layer-2</td>
	    <td>H5F low</td>
	  </tr>
	  <tr align=center>
	    <td>Layer-1</td>
	    <td>File Family</td>
	    <td>Split Meta/Raw</td>
	  </tr>
	  <tr align=center>
	    <td>Layer-0</td>
	    <td>Section-2 I/O</td>
	    <td>Standard I/O</td>
	    <td>Malloc/Free</td>
	  </tr>
	</table>
      </center>

    <h3>Single Address Space</h3>

    <p>The simplest form of hdf5 file is a single file containing only
      hdf5 data. The file begins with the super block, which is
      followed until the end of the file by hdf5 data.  The next most
      complicated file allows non-hdf5 data (user defined data or
      internal wrappers) to appear before the super block and after the
      end of the hdf5 data.  The hdf5 data is treated as a single
      linear address space in both cases.

    <p>The next level of complexity comes when non-hdf5 data is
      interspersed with the hdf5 data.  We handle that by including
      the non-hdf5 interspersed data in the hdf5 address space and
      simply not referencing it (eventually we might add those
      addresses to a "do-not-disturb" list using the same mechanism as
      the hdf5 free list, but it's not absolutely necessary).  This is
      implemented except for the "do-not-disturb" list.

    <p>The most complicated single address space hdf5 file is when we
      allow the address space to be split among multiple physical
      files. For instance, a >2GB file can be split into smaller
      chunks and transfered to a 32 bit machine, then accessed as a
      single logical hdf5 file.  The library already supports >32 bit
      addresses, so at layer 1 we split a 64-bit address into a 32-bit
      file number and a 32-bit offset (the 64 and 32 are
      arbitrary). The rest of the library still operates with a linear
      address space.

    <p>Another variation might be a family of two files where all the
      meta data is stored in one file and all the raw data is stored
      in another file to allow the HDF5 wrapper to be easily replaced
      with some other wrapper.

    <p>The <code>H5Fcreate</code> and <code>H5Fopen</code> functions
      would need to be modified to pass file-type info down to layer 2
      so the correct drivers can be called and parameters passed to
      the drivers to initialize them.
      
    <h4>Implementation</h4>

    <p>I've implemented fixed-size family members.  The entire hdf5
      file is partitioned into members where each member is the same
      size.  The family scheme is used if one passes a name to
      <code>H5F_open</code> (which is called by <code>H5Fopen()</code>
      and <code>H5Fcreate</code>) that contains a
      <code>printf(3c)</code>-style integer format specifier.
      Currently, the default low-level file driver is used for all
      family members (H5F_LOW_DFLT, usually set to be Section 2 I/O or
      Section 3 stdio), but we'll probably eventually want to pass
      that as a parameter of the file access property list, which
      hasn't been implemented yet.  When creating a family, a default
      family member size is used (defined at the top H5Ffamily.c,
      currently 64MB) but that also should be settable in the file
      access property list. When opening an existing family, the size
      of the first member is used to determine the member size
      (flushing/closing a family ensures that the first member is the
      correct size) but the other family members don't have to be that
      large (the local address space, however, is logically the same
      size for all members).

    <p>I haven't implemented a split meta/raw family yet but am rather
      curious to see how it would perform. I was planning to use the
      `.h5' extension for the meta data file and `.raw' for the raw
      data file.  The high-order bit in the address would determine
      whether the address refers to meta data or raw data. If the user
      passes a name that ends with `.raw' to <code>H5F_open</code>
      then we'll chose the split family and use the default low level
      driver for each of the two family members.  Eventually we'll
      want to pass these kinds of things through the file access
      property list instead of relying on naming convention.

    <h3>External Raw Data</h3>

    <p>We also need the ability to point to raw data that isn't in the
      HDF5 linear address space.  For instance, a dataset might be
      striped across several raw data files.

    <p>Fortunately, the only two packages that need to be aware of
      this are the packages for reading/writing contiguous raw data
      and discontiguous raw data.  Since contiguous raw data is a
      special case, I'll discuss how to implement external raw data in
      the discontiguous case.

    <p>Discontiguous data is stored as a B-tree whose keys are the
      chunk indices and whose leaf nodes point to the raw data by
      storing a file address. So what we need is some way to name the
      external files, and a way to efficiently store the external file
      name for each chunk.

    <p>I propose adding to the object header an <em>External File
	List</em> message that is a 1-origin array of file names.
      Then, in the B-tree, each key has an index into the External
      File List (or zero for the HDF5 file) for the file where the
      chunk can be found. The external file index is only used at
      the leaf nodes to get to the raw data (the entire B-tree is in
      the HDF5 file) but because of the way keys are copied among
      the B-tree nodes, it's much easier to store the index with
      every key.

    <h3>Multiple HDF5 Files</h3>

    <p>One might also want to combine two or more HDF5 files in a
      manner similar to mounting file systems in Unix.  That is, the
      group structure and meta data from one file appear as though
      they exist in the first file.  One opens File-A, and then
      <em>mounts</em> File-B at some point in File-A, the <em>mount
      point</em>, so that traversing into the mount point actually
      causes one to enter the root object of File-B.  File-A and
      File-B are each complete HDF5 files and can be accessed
      individually without mounting them.

    <p>We need a couple additional pieces of machinery to make this
      work.  First, an haddr_t type (a file address) doesn't contain
      any info about which HDF5 file's address space the address
      belongs to.  But since haddr_t is an opaque type except at
      layers 2 and below, it should be quite easy to add a pointer to
      the HDF5 file.  This would also remove the H5F_t argument from
      most of the low-level functions since it would be part of the
      OID.

    <p>The other thing we need is a table of mount points and some
      functions that understand them.  We would add the following
      table to each H5F_t struct:

    <p><code><pre>
struct H5F_mount_t {
   H5F_t *parent;         /* Parent HDF5 file if any */
   struct {
      H5F_t *f;           /* File which is mounted */
      haddr_t where;      /* Address of mount point */
   } *mount;              /* Array sorted by mount point */
   intn nmounts;          /* Number of mounted files */
   intn alloc;            /* Size of mount table */
}
    </pre></code>

    <p>The <code>H5Fmount</code> function takes the ID of an open
      file or group, the name of a to-be-mounted file, the name of the mount
      point, and a file access property list (like <code>H5Fopen</code>).
      It opens the new file and adds a record to the parent's mount
      table.  The <code>H5Funmount</code> function takes the parent
      file or group ID and the name of the mount point and disassociates
      the mounted file from the mount point.  It does not close the 
      mounted file.  The <code>H5Fclose</code>
      function closes/unmounts files recursively.

    <p>The <code>H5G_iname</code> function which translates a name to
      a file address (<code>haddr_t</code>) looks at the mount table
      at each step in the translation and switches files where
      appropriate.  All name-to-address translations occur through
      this function.

    <h3>How Long?</h3>

    <p>I'm expecting to be able to implement the two new flavors of
      single linear address space in about two days. It took two hours
      to implement the malloc/free file driver at level zero and I
      don't expect this to be much more work.

    <p>I'm expecting three days to implement the external raw data for
      discontiguous arrays.  Adding the file index to the B-tree is
      quite trivial; adding the external file list message shouldn't
      be too hard since the object header message class from wich this
      message derives is fully implemented; and changing
      <code>H5F_istore_read</code> should be trivial.  Most of the
      time will be spent designing a way to cache Unix file
      descriptors efficiently since the total number open files
      allowed per process could be much smaller than the total number
      of HDF5 files and external raw data files.

    <p>I'm expecting four days to implement being able to mount one
      HDF5 file on another.  I was originally planning a lot more, but
      making <code>haddr_t</code> opaque turned out to be much easier
      than I planned (I did it last Fri).  Most of the work will
      probably be removing the redundant H5F_t arguments for lots of
      functions.

    <h3>Conclusion</h3>

    <p>The external raw data could be implemented as a single linear
      address space, but doing so would require one to allocate large
      enough file addresses throughout the file (>32bits) before the
      file was created.  It would make mixing an HDF5 file family with
      external raw data, or external HDF5 wrapper around an HDF4 file
      a more difficult process. So I consider the implementation of
      external raw data files as a single HDF5 linear address space a
      kludge.

    <p>The ability to mount one HDF5 file on another might not be a
      very important feature especially since each HDF5 file must be a
      complete file by itself.  It's not possible to stripe an array
      over multiple HDF5 files because the B-tree wouldn't be complete
      in any one file, so the only choice is to stripe the array
      across multiple raw data files and store the B-tree in the HDF5
      file.  On the other hand, it might be useful if one file
      contains some public data which can be mounted by other files
      (e.g., a mesh topology shared among collaborators and mounted by
      files that contain other fields defined on the mesh).  Of course
      the applications can open the two files separately, but it might
      be more portable if we support it in the library.

    <p>So we're looking at about two weeks to implement all three
      versions.  I didn't get a chance to do any of them in AIO
      although we had long-term plans for the first two with a
      possibility of the third. They'll be much easier to implement in
      HDF5 than AIO since I've been keeping these in mind from the
      start.

    <hr>
    <address><a href="mailto:matzke@llnl.gov">Robb Matzke</a></address>
<!-- Created: Sat Nov  8 18:08:52 EST 1997 -->
<!-- hhmts start -->
Last modified: Tue Sep  8 14:43:32 EDT 1998
<!-- hhmts end -->
  </body>
</html>