summaryrefslogtreecommitdiffstats
path: root/doc/html/TechNotes/ExternalFiles.html
diff options
context:
space:
mode:
Diffstat (limited to 'doc/html/TechNotes/ExternalFiles.html')
-rw-r--r--doc/html/TechNotes/ExternalFiles.html279
1 files changed, 0 insertions, 279 deletions
diff --git a/doc/html/TechNotes/ExternalFiles.html b/doc/html/TechNotes/ExternalFiles.html
deleted file mode 100644
index c3197af..0000000
--- a/doc/html/TechNotes/ExternalFiles.html
+++ /dev/null
@@ -1,279 +0,0 @@
-<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
-<html>
- <head>
- <title>External Files in HDF5</title>
- </head>
-
- <body>
- <center><h1>External Files in HDF5</h1></center>
-
- <h3>Overview of Layers</h3>
-
- <p>This table shows some of the layers of HDF5. Each layer calls
- functions at the same or lower layers and never functions at
- higher layers. An object identifier (OID) takes various forms
- at the various layers: at layer 0 an OID is an absolute physical
- file address; at layers 1 and 2 it's an absolute virtual file
- address. At layers 3 through 6 it's a relative address, and at
- layers 7 and above it's an object handle.
-
- <p><center>
- <table border cellpadding=4 width="60%">
- <tr align=center>
- <td>Layer-7</td>
- <td>Groups</td>
- <td>Datasets</td>
- </tr>
- <tr align=center>
- <td>Layer-6</td>
- <td>Indirect Storage</td>
- <td>Symbol Tables</td>
- </tr>
- <tr align=center>
- <td>Layer-5</td>
- <td>B-trees</td>
- <td>Object Hdrs</td>
- <td>Heaps</td>
- </tr>
- <tr align=center>
- <td>Layer-4</td>
- <td>Caching</td>
- </tr>
- <tr align=center>
- <td>Layer-3</td>
- <td>H5F chunk I/O</td>
- </tr>
- <tr align=center>
- <td>Layer-2</td>
- <td>H5F low</td>
- </tr>
- <tr align=center>
- <td>Layer-1</td>
- <td>File Family</td>
- <td>Split Meta/Raw</td>
- </tr>
- <tr align=center>
- <td>Layer-0</td>
- <td>Section-2 I/O</td>
- <td>Standard I/O</td>
- <td>Malloc/Free</td>
- </tr>
- </table>
- </center>
-
- <h3>Single Address Space</h3>
-
- <p>The simplest form of hdf5 file is a single file containing only
- hdf5 data. The file begins with the boot block, which is
- followed until the end of the file by hdf5 data. The next most
- complicated file allows non-hdf5 data (user defined data or
- internal wrappers) to appear before the boot block and after the
- end of the hdf5 data. The hdf5 data is treated as a single
- linear address space in both cases.
-
- <p>The next level of complexity comes when non-hdf5 data is
- interspersed with the hdf5 data. We handle that by including
- the non-hdf5 interspersed data in the hdf5 address space and
- simply not referencing it (eventually we might add those
- addresses to a "do-not-disturb" list using the same mechanism as
- the hdf5 free list, but it's not absolutely necessary). This is
- implemented except for the "do-not-disturb" list.
-
- <p>The most complicated single address space hdf5 file is when we
- allow the address space to be split among multiple physical
- files. For instance, a >2GB file can be split into smaller
- chunks and transfered to a 32 bit machine, then accessed as a
- single logical hdf5 file. The library already supports >32 bit
- addresses, so at layer 1 we split a 64-bit address into a 32-bit
- file number and a 32-bit offset (the 64 and 32 are
- arbitrary). The rest of the library still operates with a linear
- address space.
-
- <p>Another variation might be a family of two files where all the
- meta data is stored in one file and all the raw data is stored
- in another file to allow the HDF5 wrapper to be easily replaced
- with some other wrapper.
-
- <p>The <code>H5Fcreate</code> and <code>H5Fopen</code> functions
- would need to be modified to pass file-type info down to layer 2
- so the correct drivers can be called and parameters passed to
- the drivers to initialize them.
-
- <h4>Implementation</h4>
-
- <p>I've implemented fixed-size family members. The entire hdf5
- file is partitioned into members where each member is the same
- size. The family scheme is used if one passes a name to
- <code>H5F_open</code> (which is called by <code>H5Fopen()</code>
- and <code>H5Fcreate</code>) that contains a
- <code>printf(3c)</code>-style integer format specifier.
- Currently, the default low-level file driver is used for all
- family members (H5F_LOW_DFLT, usually set to be Section 2 I/O or
- Section 3 stdio), but we'll probably eventually want to pass
- that as a parameter of the file access property list, which
- hasn't been implemented yet. When creating a family, a default
- family member size is used (defined at the top H5Ffamily.c,
- currently 64MB) but that also should be settable in the file
- access property list. When opening an existing family, the size
- of the first member is used to determine the member size
- (flushing/closing a family ensures that the first member is the
- correct size) but the other family members don't have to be that
- large (the local address space, however, is logically the same
- size for all members).
-
- <p>I haven't implemented a split meta/raw family yet but am rather
- curious to see how it would perform. I was planning to use the
- `.h5' extension for the meta data file and `.raw' for the raw
- data file. The high-order bit in the address would determine
- whether the address refers to meta data or raw data. If the user
- passes a name that ends with `.raw' to <code>H5F_open</code>
- then we'll chose the split family and use the default low level
- driver for each of the two family members. Eventually we'll
- want to pass these kinds of things through the file access
- property list instead of relying on naming convention.
-
- <h3>External Raw Data</h3>
-
- <p>We also need the ability to point to raw data that isn't in the
- HDF5 linear address space. For instance, a dataset might be
- striped across several raw data files.
-
- <p>Fortunately, the only two packages that need to be aware of
- this are the packages for reading/writing contiguous raw data
- and discontiguous raw data. Since contiguous raw data is a
- special case, I'll discuss how to implement external raw data in
- the discontiguous case.
-
- <p>Discontiguous data is stored as a B-tree whose keys are the
- chunk indices and whose leaf nodes point to the raw data by
- storing a file address. So what we need is some way to name the
- external files, and a way to efficiently store the external file
- name for each chunk.
-
- <p>I propose adding to the object header an <em>External File
- List</em> message that is a 1-origin array of file names.
- Then, in the B-tree, each key has an index into the External
- File List (or zero for the HDF5 file) for the file where the
- chunk can be found. The external file index is only used at
- the leaf nodes to get to the raw data (the entire B-tree is in
- the HDF5 file) but because of the way keys are copied among
- the B-tree nodes, it's much easier to store the index with
- every key.
-
- <h3>Multiple HDF5 Files</h3>
-
- <p>One might also want to combine two or more HDF5 files in a
- manner similar to mounting file systems in Unix. That is, the
- group structure and meta data from one file appear as though
- they exist in the first file. One opens File-A, and then
- <em>mounts</em> File-B at some point in File-A, the <em>mount
- point</em>, so that traversing into the mount point actually
- causes one to enter the root object of File-B. File-A and
- File-B are each complete HDF5 files and can be accessed
- individually without mounting them.
-
- <p>We need a couple additional pieces of machinery to make this
- work. First, an haddr_t type (a file address) doesn't contain
- any info about which HDF5 file's address space the address
- belongs to. But since haddr_t is an opaque type except at
- layers 2 and below, it should be quite easy to add a pointer to
- the HDF5 file. This would also remove the H5F_t argument from
- most of the low-level functions since it would be part of the
- OID.
-
- <p>The other thing we need is a table of mount points and some
- functions that understand them. We would add the following
- table to each H5F_t struct:
-
- <p><code><pre>
-struct H5F_mount_t {
- H5F_t *parent; /* Parent HDF5 file if any */
- struct {
- H5F_t *f; /* File which is mounted */
- haddr_t where; /* Address of mount point */
- } *mount; /* Array sorted by mount point */
- intn nmounts; /* Number of mounted files */
- intn alloc; /* Size of mount table */
-}
- </pre></code>
-
- <p>The <code>H5Fmount</code> function takes the ID of an open
- file or group, the name of a to-be-mounted file, the name of the mount
- point, and a file access property list (like <code>H5Fopen</code>).
- It opens the new file and adds a record to the parent's mount
- table. The <code>H5Funmount</code> function takes the parent
- file or group ID and the name of the mount point and disassociates
- the mounted file from the mount point. It does not close the
- mounted file. The <code>H5Fclose</code>
- function closes/unmounts files recursively.
-
- <p>The <code>H5G_iname</code> function which translates a name to
- a file address (<code>haddr_t</code>) looks at the mount table
- at each step in the translation and switches files where
- appropriate. All name-to-address translations occur through
- this function.
-
- <h3>How Long?</h3>
-
- <p>I'm expecting to be able to implement the two new flavors of
- single linear address space in about two days. It took two hours
- to implement the malloc/free file driver at level zero and I
- don't expect this to be much more work.
-
- <p>I'm expecting three days to implement the external raw data for
- discontiguous arrays. Adding the file index to the B-tree is
- quite trivial; adding the external file list message shouldn't
- be too hard since the object header message class from wich this
- message derives is fully implemented; and changing
- <code>H5F_istore_read</code> should be trivial. Most of the
- time will be spent designing a way to cache Unix file
- descriptors efficiently since the total number open files
- allowed per process could be much smaller than the total number
- of HDF5 files and external raw data files.
-
- <p>I'm expecting four days to implement being able to mount one
- HDF5 file on another. I was originally planning a lot more, but
- making <code>haddr_t</code> opaque turned out to be much easier
- than I planned (I did it last Fri). Most of the work will
- probably be removing the redundant H5F_t arguments for lots of
- functions.
-
- <h3>Conclusion</h3>
-
- <p>The external raw data could be implemented as a single linear
- address space, but doing so would require one to allocate large
- enough file addresses throughout the file (>32bits) before the
- file was created. It would make mixing an HDF5 file family with
- external raw data, or external HDF5 wrapper around an HDF4 file
- a more difficult process. So I consider the implementation of
- external raw data files as a single HDF5 linear address space a
- kludge.
-
- <p>The ability to mount one HDF5 file on another might not be a
- very important feature especially since each HDF5 file must be a
- complete file by itself. It's not possible to stripe an array
- over multiple HDF5 files because the B-tree wouldn't be complete
- in any one file, so the only choice is to stripe the array
- across multiple raw data files and store the B-tree in the HDF5
- file. On the other hand, it might be useful if one file
- contains some public data which can be mounted by other files
- (e.g., a mesh topology shared among collaborators and mounted by
- files that contain other fields defined on the mesh). Of course
- the applications can open the two files separately, but it might
- be more portable if we support it in the library.
-
- <p>So we're looking at about two weeks to implement all three
- versions. I didn't get a chance to do any of them in AIO
- although we had long-term plans for the first two with a
- possibility of the third. They'll be much easier to implement in
- HDF5 than AIO since I've been keeping these in mind from the
- start.
-
- <hr>
- <address><a href="mailto:matzke@llnl.gov">Robb Matzke</a></address>
-<!-- Created: Sat Nov 8 18:08:52 EST 1997 -->
-<!-- hhmts start -->
-Last modified: Tue Sep 8 14:43:32 EDT 1998
-<!-- hhmts end -->
- </body>
-</html>