diff options
author | Quincey Koziol <koziol@hdfgroup.org> | 2003-09-10 01:30:16 (GMT) |
---|---|---|
committer | Quincey Koziol <koziol@hdfgroup.org> | 2003-09-10 01:30:16 (GMT) |
commit | c3a7c9341e1e483c7714ff58ebb3bd7299f45e11 (patch) | |
tree | 06ae19d16e956f0eb11b96c09bfa750ff1b661e2 /doc/html/TechNotes | |
parent | dc5aa55448efe27dc2fb635ed1f16527081a86bd (diff) | |
download | hdf5-c3a7c9341e1e483c7714ff58ebb3bd7299f45e11.zip hdf5-c3a7c9341e1e483c7714ff58ebb3bd7299f45e11.tar.gz hdf5-c3a7c9341e1e483c7714ff58ebb3bd7299f45e11.tar.bz2 |
[svn-r7450] Purpose:
Add document describing issues relating to variable-length datatypes.
Diffstat (limited to 'doc/html/TechNotes')
-rw-r--r-- | doc/html/TechNotes/VLTypes.html | 151 |
1 files changed, 151 insertions, 0 deletions
diff --git a/doc/html/TechNotes/VLTypes.html b/doc/html/TechNotes/VLTypes.html new file mode 100644 index 0000000..d4ea905 --- /dev/null +++ b/doc/html/TechNotes/VLTypes.html @@ -0,0 +1,151 @@ +<html> + <head> + <title> + Variable-Length Datatypes in HDF5 + </title> + + <STYLE TYPE="text/css"> + + P { text-indent: 2em} + P.item { margin-left: 2em; text-indent: -2em} + P.item2 { margin-left: 2em; text-indent: 2em} + + TABLE.format { border:solid; border-collapse:collapse; caption-side:top; text-align:center; width:80%;} + TABLE.format TH { border:ridge; padding:4px; width:25%;} + TABLE.format TD { border:ridge; padding:4px; } + TABLE.format CAPTION { font-weight:bold; font-size:larger;} + + TABLE.note {border:none; text-align:right; width:80%;} + + TABLE.desc { border:solid; border-collapse:collapse; caption-size:top; text-align:left; width:80%;} + TABLE.desc TR { vertical-align:top;} + TABLE.desc TH { border-style:ridge; font-size:larger; padding:4px; text-decoration:underline;} + TABLE.desc TD { border-style:ridge; padding:4px; } + TABLE.desc CAPTION { font-weight:bold; font-size:larger;} + + TABLE.list { border:none; } + TABLE.list TR { vertical-align:top;} + TABLE.list TH { border:none; text-decoration:underline;} + TABLE.list TD { border:none; } + + </STYLE> + + <!-- #BeginLibraryItem "/ed_libs/styles_Format.lbi" --> + <!-- + * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * + * Copyright by the Board of Trustees of the University of Illinois. * + * All rights reserved. * + * * + * This file is part of HDF5. The full HDF5 copyright notice, including * + * terms governing use, modification, and redistribution, is contained in * + * the files COPYING and Copyright.html. COPYING can be found at the root * + * of the source code distribution tree; Copyright.html can be found at the * + * root level of an installed copy of the electronic HDF5 document set and * + * is linked from the top-level documents page. It can also be found at * + * http://hdf.ncsa.uiuc.edu/HDF5/doc/Copyright.html. If you do not have * + * access to either file, you may request a copy from hdfhelp@ncsa.uiuc.edu. * + * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * + --> + + <link href="ed_styles/FormatElect.css" rel="stylesheet" type="text/css"> + <!-- #EndLibraryItem --> + </head> + + <body bgcolor="#FFFFFF"> + <H3>Introduction</H3> + <P>Variable-length (VL) datatypes have a great deal of flexibility, but can + be over- or mis-used. VL datatypes are ideal at capturing the notion + that elements in an HDF5 dataset (or attribute) can have different + amounts of information (VL strings are the canonical example), + but they have some drawbacks that this document attempts + to address. + </P> + + <H3>Background</H3> + <P>Because fast random access to dataset elements requires that each + element be a fixed size, the information stored for VL datatype elements + is actually information to locate the VL information, not + the information itself. + </P> + + <H3>When to use VL datatypes</H3> + <P>VL datatypes are designed allow the amount of data stored in each + element of a dataset to vary. This change could be + over time as new values, with different lengths, were written to the + element. Or, the change can be over "space" - the dataset's space, + with each element in the dataset having the same fundamental type, but + different lengths. "Ragged arrays" are the classic example of elements + that change over the "space" of the dataset. If the elements of a + dataset are not going to change over "space" or time, a VL datatype + should probably not be used. + </P> + + <H3>Access Time Penalty</H3> + <P>Accessing VL information requires reading the element in the file, then + using that element's location information to retrieve the VL + information itself. + In the worst case, this obviously doubles the number of disk accesses + required to access the VL information. + </P> + <P>However, in order to avoid this extra disk access overhead, the HDF5 + library groups VL information together into larger blocks on disk and + performs I/O only on those larger blocks. Additionally, these blocks of + information are cached in memory as long as possible. For most access + patterns, this amortizes the extra disk accesses over enough pieces of + VL information to hide the extra overhead involved. + </P> + + <H3>Storage Space Penalty</H3> + <P>Because VL information must be located and retrieved from another + location in the file, extra information must be stored in the file to + locate + each item of VL information (i.e. each element in a dataset or each + VL field in a compound datatype, etc.). + Currently, that extra information amounts to 32 bytes per VL item. + </P> + <P> + With some judicious re-architecting of the library and file format, + this could be reduced to 18 bytes per VL item with no loss in + functionality or additional time penalties. With some additional + effort, the space could perhaps could be pushed down as low as 8-10 + bytes per VL item with no loss in functionality, but potentially a + small time penalty. + </P> + + <H3>Chunking and Filters</H3> + <P>Storing data as VL information has some affects on chunked storage and + the filters that can be applied to chunked data. Because the data that + is stored in each chunk is the location to access the VL information, + the actual VL information is not broken up into chunks in the same way + as other data stored in chunks. Additionally, because the + actual VL information is not stored in the chunk, any filters which + operate on a chunk will operate on the information to + locate the VL information, not the VL information itself. + </P> + + <H3>File Drivers</H3> + <P>Because the parallel I/O file drivers (MPI-I/O and MPI-posix) don't + allow objects with varying sizes to be created in the file, attemping + to create + a dataset or attribute with a VL datatype in a file managed by those + drivers will cause the creation call to fail. + </P> + <P>Additionally, using + VL datatypes and the 'multi' and 'split' file drivers may not operate + in the manner desired. The HDF5 library currently categorizes the + "blocks of VL information" stored in the file as a type of metadata, + which means that they may not be stored with the other raw data for + the file. + </P> + + <H3>Rewriting</H3> + <P>When VL information in the file is re-written, the old VL information + must be releases, space for the new VL information allocated and + the new VL information must be written to the file. This may cause + additional I/O accesses. + </P> + + </body> + +</html> + |