From c3a7c9341e1e483c7714ff58ebb3bd7299f45e11 Mon Sep 17 00:00:00 2001 From: Quincey Koziol Date: Tue, 9 Sep 2003 20:30:16 -0500 Subject: [svn-r7450] Purpose: Add document describing issues relating to variable-length datatypes. --- doc/html/TechNotes.html | 5 ++ doc/html/TechNotes/VLTypes.html | 151 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 156 insertions(+) create mode 100644 doc/html/TechNotes/VLTypes.html diff --git a/doc/html/TechNotes.html b/doc/html/TechNotes.html index fe509c0..542d282 100644 --- a/doc/html/TechNotes.html +++ b/doc/html/TechNotes.html @@ -237,6 +237,11 @@ HDF5 Technical Notes   Results of reviewing tests for API functions. +Variable-Length Datatype Info +   + Description of various aspects of using variable-length datatypes in HDF5. + + diff --git a/doc/html/TechNotes/VLTypes.html b/doc/html/TechNotes/VLTypes.html new file mode 100644 index 0000000..d4ea905 --- /dev/null +++ b/doc/html/TechNotes/VLTypes.html @@ -0,0 +1,151 @@ + + + + Variable-Length Datatypes in HDF5 + + + + + + + + + + + + +

Introduction

+

Variable-length (VL) datatypes have a great deal of flexibility, but can + be over- or mis-used. VL datatypes are ideal at capturing the notion + that elements in an HDF5 dataset (or attribute) can have different + amounts of information (VL strings are the canonical example), + but they have some drawbacks that this document attempts + to address. +

+ +

Background

+

Because fast random access to dataset elements requires that each + element be a fixed size, the information stored for VL datatype elements + is actually information to locate the VL information, not + the information itself. +

+ +

When to use VL datatypes

+

VL datatypes are designed allow the amount of data stored in each + element of a dataset to vary. This change could be + over time as new values, with different lengths, were written to the + element. Or, the change can be over "space" - the dataset's space, + with each element in the dataset having the same fundamental type, but + different lengths. "Ragged arrays" are the classic example of elements + that change over the "space" of the dataset. If the elements of a + dataset are not going to change over "space" or time, a VL datatype + should probably not be used. +

+ +

Access Time Penalty

+

Accessing VL information requires reading the element in the file, then + using that element's location information to retrieve the VL + information itself. + In the worst case, this obviously doubles the number of disk accesses + required to access the VL information. +

+

However, in order to avoid this extra disk access overhead, the HDF5 + library groups VL information together into larger blocks on disk and + performs I/O only on those larger blocks. Additionally, these blocks of + information are cached in memory as long as possible. For most access + patterns, this amortizes the extra disk accesses over enough pieces of + VL information to hide the extra overhead involved. +

+ +

Storage Space Penalty

+

Because VL information must be located and retrieved from another + location in the file, extra information must be stored in the file to + locate + each item of VL information (i.e. each element in a dataset or each + VL field in a compound datatype, etc.). + Currently, that extra information amounts to 32 bytes per VL item. +

+

+ With some judicious re-architecting of the library and file format, + this could be reduced to 18 bytes per VL item with no loss in + functionality or additional time penalties. With some additional + effort, the space could perhaps could be pushed down as low as 8-10 + bytes per VL item with no loss in functionality, but potentially a + small time penalty. +

+ +

Chunking and Filters

+

Storing data as VL information has some affects on chunked storage and + the filters that can be applied to chunked data. Because the data that + is stored in each chunk is the location to access the VL information, + the actual VL information is not broken up into chunks in the same way + as other data stored in chunks. Additionally, because the + actual VL information is not stored in the chunk, any filters which + operate on a chunk will operate on the information to + locate the VL information, not the VL information itself. +

+ +

File Drivers

+

Because the parallel I/O file drivers (MPI-I/O and MPI-posix) don't + allow objects with varying sizes to be created in the file, attemping + to create + a dataset or attribute with a VL datatype in a file managed by those + drivers will cause the creation call to fail. +

+

Additionally, using + VL datatypes and the 'multi' and 'split' file drivers may not operate + in the manner desired. The HDF5 library currently categorizes the + "blocks of VL information" stored in the file as a type of metadata, + which means that they may not be stored with the other raw data for + the file. +

+ +

Rewriting

+

When VL information in the file is re-written, the old VL information + must be releases, space for the new VL information allocated and + the new VL information must be written to the file. This may cause + additional I/O accesses. +

+ + + + + -- cgit v0.12