diff options
Diffstat (limited to 'doc/html/TechNotes/ExternalFiles.html')
-rw-r--r-- | doc/html/TechNotes/ExternalFiles.html | 279 |
1 files changed, 0 insertions, 279 deletions
diff --git a/doc/html/TechNotes/ExternalFiles.html b/doc/html/TechNotes/ExternalFiles.html deleted file mode 100644 index c3197af..0000000 --- a/doc/html/TechNotes/ExternalFiles.html +++ /dev/null @@ -1,279 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> -<html> - <head> - <title>External Files in HDF5</title> - </head> - - <body> - <center><h1>External Files in HDF5</h1></center> - - <h3>Overview of Layers</h3> - - <p>This table shows some of the layers of HDF5. Each layer calls - functions at the same or lower layers and never functions at - higher layers. An object identifier (OID) takes various forms - at the various layers: at layer 0 an OID is an absolute physical - file address; at layers 1 and 2 it's an absolute virtual file - address. At layers 3 through 6 it's a relative address, and at - layers 7 and above it's an object handle. - - <p><center> - <table border cellpadding=4 width="60%"> - <tr align=center> - <td>Layer-7</td> - <td>Groups</td> - <td>Datasets</td> - </tr> - <tr align=center> - <td>Layer-6</td> - <td>Indirect Storage</td> - <td>Symbol Tables</td> - </tr> - <tr align=center> - <td>Layer-5</td> - <td>B-trees</td> - <td>Object Hdrs</td> - <td>Heaps</td> - </tr> - <tr align=center> - <td>Layer-4</td> - <td>Caching</td> - </tr> - <tr align=center> - <td>Layer-3</td> - <td>H5F chunk I/O</td> - </tr> - <tr align=center> - <td>Layer-2</td> - <td>H5F low</td> - </tr> - <tr align=center> - <td>Layer-1</td> - <td>File Family</td> - <td>Split Meta/Raw</td> - </tr> - <tr align=center> - <td>Layer-0</td> - <td>Section-2 I/O</td> - <td>Standard I/O</td> - <td>Malloc/Free</td> - </tr> - </table> - </center> - - <h3>Single Address Space</h3> - - <p>The simplest form of hdf5 file is a single file containing only - hdf5 data. The file begins with the boot block, which is - followed until the end of the file by hdf5 data. The next most - complicated file allows non-hdf5 data (user defined data or - internal wrappers) to appear before the boot block and after the - end of the hdf5 data. The hdf5 data is treated as a single - linear address space in both cases. - - <p>The next level of complexity comes when non-hdf5 data is - interspersed with the hdf5 data. We handle that by including - the non-hdf5 interspersed data in the hdf5 address space and - simply not referencing it (eventually we might add those - addresses to a "do-not-disturb" list using the same mechanism as - the hdf5 free list, but it's not absolutely necessary). This is - implemented except for the "do-not-disturb" list. - - <p>The most complicated single address space hdf5 file is when we - allow the address space to be split among multiple physical - files. For instance, a >2GB file can be split into smaller - chunks and transfered to a 32 bit machine, then accessed as a - single logical hdf5 file. The library already supports >32 bit - addresses, so at layer 1 we split a 64-bit address into a 32-bit - file number and a 32-bit offset (the 64 and 32 are - arbitrary). The rest of the library still operates with a linear - address space. - - <p>Another variation might be a family of two files where all the - meta data is stored in one file and all the raw data is stored - in another file to allow the HDF5 wrapper to be easily replaced - with some other wrapper. - - <p>The <code>H5Fcreate</code> and <code>H5Fopen</code> functions - would need to be modified to pass file-type info down to layer 2 - so the correct drivers can be called and parameters passed to - the drivers to initialize them. - - <h4>Implementation</h4> - - <p>I've implemented fixed-size family members. The entire hdf5 - file is partitioned into members where each member is the same - size. The family scheme is used if one passes a name to - <code>H5F_open</code> (which is called by <code>H5Fopen()</code> - and <code>H5Fcreate</code>) that contains a - <code>printf(3c)</code>-style integer format specifier. - Currently, the default low-level file driver is used for all - family members (H5F_LOW_DFLT, usually set to be Section 2 I/O or - Section 3 stdio), but we'll probably eventually want to pass - that as a parameter of the file access property list, which - hasn't been implemented yet. When creating a family, a default - family member size is used (defined at the top H5Ffamily.c, - currently 64MB) but that also should be settable in the file - access property list. When opening an existing family, the size - of the first member is used to determine the member size - (flushing/closing a family ensures that the first member is the - correct size) but the other family members don't have to be that - large (the local address space, however, is logically the same - size for all members). - - <p>I haven't implemented a split meta/raw family yet but am rather - curious to see how it would perform. I was planning to use the - `.h5' extension for the meta data file and `.raw' for the raw - data file. The high-order bit in the address would determine - whether the address refers to meta data or raw data. If the user - passes a name that ends with `.raw' to <code>H5F_open</code> - then we'll chose the split family and use the default low level - driver for each of the two family members. Eventually we'll - want to pass these kinds of things through the file access - property list instead of relying on naming convention. - - <h3>External Raw Data</h3> - - <p>We also need the ability to point to raw data that isn't in the - HDF5 linear address space. For instance, a dataset might be - striped across several raw data files. - - <p>Fortunately, the only two packages that need to be aware of - this are the packages for reading/writing contiguous raw data - and discontiguous raw data. Since contiguous raw data is a - special case, I'll discuss how to implement external raw data in - the discontiguous case. - - <p>Discontiguous data is stored as a B-tree whose keys are the - chunk indices and whose leaf nodes point to the raw data by - storing a file address. So what we need is some way to name the - external files, and a way to efficiently store the external file - name for each chunk. - - <p>I propose adding to the object header an <em>External File - List</em> message that is a 1-origin array of file names. - Then, in the B-tree, each key has an index into the External - File List (or zero for the HDF5 file) for the file where the - chunk can be found. The external file index is only used at - the leaf nodes to get to the raw data (the entire B-tree is in - the HDF5 file) but because of the way keys are copied among - the B-tree nodes, it's much easier to store the index with - every key. - - <h3>Multiple HDF5 Files</h3> - - <p>One might also want to combine two or more HDF5 files in a - manner similar to mounting file systems in Unix. That is, the - group structure and meta data from one file appear as though - they exist in the first file. One opens File-A, and then - <em>mounts</em> File-B at some point in File-A, the <em>mount - point</em>, so that traversing into the mount point actually - causes one to enter the root object of File-B. File-A and - File-B are each complete HDF5 files and can be accessed - individually without mounting them. - - <p>We need a couple additional pieces of machinery to make this - work. First, an haddr_t type (a file address) doesn't contain - any info about which HDF5 file's address space the address - belongs to. But since haddr_t is an opaque type except at - layers 2 and below, it should be quite easy to add a pointer to - the HDF5 file. This would also remove the H5F_t argument from - most of the low-level functions since it would be part of the - OID. - - <p>The other thing we need is a table of mount points and some - functions that understand them. We would add the following - table to each H5F_t struct: - - <p><code><pre> -struct H5F_mount_t { - H5F_t *parent; /* Parent HDF5 file if any */ - struct { - H5F_t *f; /* File which is mounted */ - haddr_t where; /* Address of mount point */ - } *mount; /* Array sorted by mount point */ - intn nmounts; /* Number of mounted files */ - intn alloc; /* Size of mount table */ -} - </pre></code> - - <p>The <code>H5Fmount</code> function takes the ID of an open - file or group, the name of a to-be-mounted file, the name of the mount - point, and a file access property list (like <code>H5Fopen</code>). - It opens the new file and adds a record to the parent's mount - table. The <code>H5Funmount</code> function takes the parent - file or group ID and the name of the mount point and disassociates - the mounted file from the mount point. It does not close the - mounted file. The <code>H5Fclose</code> - function closes/unmounts files recursively. - - <p>The <code>H5G_iname</code> function which translates a name to - a file address (<code>haddr_t</code>) looks at the mount table - at each step in the translation and switches files where - appropriate. All name-to-address translations occur through - this function. - - <h3>How Long?</h3> - - <p>I'm expecting to be able to implement the two new flavors of - single linear address space in about two days. It took two hours - to implement the malloc/free file driver at level zero and I - don't expect this to be much more work. - - <p>I'm expecting three days to implement the external raw data for - discontiguous arrays. Adding the file index to the B-tree is - quite trivial; adding the external file list message shouldn't - be too hard since the object header message class from wich this - message derives is fully implemented; and changing - <code>H5F_istore_read</code> should be trivial. Most of the - time will be spent designing a way to cache Unix file - descriptors efficiently since the total number open files - allowed per process could be much smaller than the total number - of HDF5 files and external raw data files. - - <p>I'm expecting four days to implement being able to mount one - HDF5 file on another. I was originally planning a lot more, but - making <code>haddr_t</code> opaque turned out to be much easier - than I planned (I did it last Fri). Most of the work will - probably be removing the redundant H5F_t arguments for lots of - functions. - - <h3>Conclusion</h3> - - <p>The external raw data could be implemented as a single linear - address space, but doing so would require one to allocate large - enough file addresses throughout the file (>32bits) before the - file was created. It would make mixing an HDF5 file family with - external raw data, or external HDF5 wrapper around an HDF4 file - a more difficult process. So I consider the implementation of - external raw data files as a single HDF5 linear address space a - kludge. - - <p>The ability to mount one HDF5 file on another might not be a - very important feature especially since each HDF5 file must be a - complete file by itself. It's not possible to stripe an array - over multiple HDF5 files because the B-tree wouldn't be complete - in any one file, so the only choice is to stripe the array - across multiple raw data files and store the B-tree in the HDF5 - file. On the other hand, it might be useful if one file - contains some public data which can be mounted by other files - (e.g., a mesh topology shared among collaborators and mounted by - files that contain other fields defined on the mesh). Of course - the applications can open the two files separately, but it might - be more portable if we support it in the library. - - <p>So we're looking at about two weeks to implement all three - versions. I didn't get a chance to do any of them in AIO - although we had long-term plans for the first two with a - possibility of the third. They'll be much easier to implement in - HDF5 than AIO since I've been keeping these in mind from the - start. - - <hr> - <address><a href="mailto:matzke@llnl.gov">Robb Matzke</a></address> -<!-- Created: Sat Nov 8 18:08:52 EST 1997 --> -<!-- hhmts start --> -Last modified: Tue Sep 8 14:43:32 EDT 1998 -<!-- hhmts end --> - </body> -</html> |