summaryrefslogtreecommitdiffstats
path: root/doc/html/TechNotes
diff options
context:
space:
mode:
authorQuincey Koziol <koziol@hdfgroup.org>2003-09-10 01:30:16 (GMT)
committerQuincey Koziol <koziol@hdfgroup.org>2003-09-10 01:30:16 (GMT)
commitc3a7c9341e1e483c7714ff58ebb3bd7299f45e11 (patch)
tree06ae19d16e956f0eb11b96c09bfa750ff1b661e2 /doc/html/TechNotes
parentdc5aa55448efe27dc2fb635ed1f16527081a86bd (diff)
downloadhdf5-c3a7c9341e1e483c7714ff58ebb3bd7299f45e11.zip
hdf5-c3a7c9341e1e483c7714ff58ebb3bd7299f45e11.tar.gz
hdf5-c3a7c9341e1e483c7714ff58ebb3bd7299f45e11.tar.bz2
[svn-r7450] Purpose:
Add document describing issues relating to variable-length datatypes.
Diffstat (limited to 'doc/html/TechNotes')
-rw-r--r--doc/html/TechNotes/VLTypes.html151
1 files changed, 151 insertions, 0 deletions
diff --git a/doc/html/TechNotes/VLTypes.html b/doc/html/TechNotes/VLTypes.html
new file mode 100644
index 0000000..d4ea905
--- /dev/null
+++ b/doc/html/TechNotes/VLTypes.html
@@ -0,0 +1,151 @@
+<html>
+ <head>
+ <title>
+ Variable-Length Datatypes in HDF5
+ </title>
+
+ <STYLE TYPE="text/css">
+
+ P { text-indent: 2em}
+ P.item { margin-left: 2em; text-indent: -2em}
+ P.item2 { margin-left: 2em; text-indent: 2em}
+
+ TABLE.format { border:solid; border-collapse:collapse; caption-side:top; text-align:center; width:80%;}
+ TABLE.format TH { border:ridge; padding:4px; width:25%;}
+ TABLE.format TD { border:ridge; padding:4px; }
+ TABLE.format CAPTION { font-weight:bold; font-size:larger;}
+
+ TABLE.note {border:none; text-align:right; width:80%;}
+
+ TABLE.desc { border:solid; border-collapse:collapse; caption-size:top; text-align:left; width:80%;}
+ TABLE.desc TR { vertical-align:top;}
+ TABLE.desc TH { border-style:ridge; font-size:larger; padding:4px; text-decoration:underline;}
+ TABLE.desc TD { border-style:ridge; padding:4px; }
+ TABLE.desc CAPTION { font-weight:bold; font-size:larger;}
+
+ TABLE.list { border:none; }
+ TABLE.list TR { vertical-align:top;}
+ TABLE.list TH { border:none; text-decoration:underline;}
+ TABLE.list TD { border:none; }
+
+ </STYLE>
+
+ <!-- #BeginLibraryItem "/ed_libs/styles_Format.lbi" -->
+ <!--
+ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
+ * Copyright by the Board of Trustees of the University of Illinois. *
+ * All rights reserved. *
+ * *
+ * This file is part of HDF5. The full HDF5 copyright notice, including *
+ * terms governing use, modification, and redistribution, is contained in *
+ * the files COPYING and Copyright.html. COPYING can be found at the root *
+ * of the source code distribution tree; Copyright.html can be found at the *
+ * root level of an installed copy of the electronic HDF5 document set and *
+ * is linked from the top-level documents page. It can also be found at *
+ * http://hdf.ncsa.uiuc.edu/HDF5/doc/Copyright.html. If you do not have *
+ * access to either file, you may request a copy from hdfhelp@ncsa.uiuc.edu. *
+ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
+ -->
+
+ <link href="ed_styles/FormatElect.css" rel="stylesheet" type="text/css">
+ <!-- #EndLibraryItem -->
+ </head>
+
+ <body bgcolor="#FFFFFF">
+ <H3>Introduction</H3>
+ <P>Variable-length (VL) datatypes have a great deal of flexibility, but can
+ be over- or mis-used. VL datatypes are ideal at capturing the notion
+ that elements in an HDF5 dataset (or attribute) can have different
+ amounts of information (VL strings are the canonical example),
+ but they have some drawbacks that this document attempts
+ to address.
+ </P>
+
+ <H3>Background</H3>
+ <P>Because fast random access to dataset elements requires that each
+ element be a fixed size, the information stored for VL datatype elements
+ is actually information to locate the VL information, not
+ the information itself.
+ </P>
+
+ <H3>When to use VL datatypes</H3>
+ <P>VL datatypes are designed allow the amount of data stored in each
+ element of a dataset to vary. This change could be
+ over time as new values, with different lengths, were written to the
+ element. Or, the change can be over "space" - the dataset's space,
+ with each element in the dataset having the same fundamental type, but
+ different lengths. "Ragged arrays" are the classic example of elements
+ that change over the "space" of the dataset. If the elements of a
+ dataset are not going to change over "space" or time, a VL datatype
+ should probably not be used.
+ </P>
+
+ <H3>Access Time Penalty</H3>
+ <P>Accessing VL information requires reading the element in the file, then
+ using that element's location information to retrieve the VL
+ information itself.
+ In the worst case, this obviously doubles the number of disk accesses
+ required to access the VL information.
+ </P>
+ <P>However, in order to avoid this extra disk access overhead, the HDF5
+ library groups VL information together into larger blocks on disk and
+ performs I/O only on those larger blocks. Additionally, these blocks of
+ information are cached in memory as long as possible. For most access
+ patterns, this amortizes the extra disk accesses over enough pieces of
+ VL information to hide the extra overhead involved.
+ </P>
+
+ <H3>Storage Space Penalty</H3>
+ <P>Because VL information must be located and retrieved from another
+ location in the file, extra information must be stored in the file to
+ locate
+ each item of VL information (i.e. each element in a dataset or each
+ VL field in a compound datatype, etc.).
+ Currently, that extra information amounts to 32 bytes per VL item.
+ </P>
+ <P>
+ With some judicious re-architecting of the library and file format,
+ this could be reduced to 18 bytes per VL item with no loss in
+ functionality or additional time penalties. With some additional
+ effort, the space could perhaps could be pushed down as low as 8-10
+ bytes per VL item with no loss in functionality, but potentially a
+ small time penalty.
+ </P>
+
+ <H3>Chunking and Filters</H3>
+ <P>Storing data as VL information has some affects on chunked storage and
+ the filters that can be applied to chunked data. Because the data that
+ is stored in each chunk is the location to access the VL information,
+ the actual VL information is not broken up into chunks in the same way
+ as other data stored in chunks. Additionally, because the
+ actual VL information is not stored in the chunk, any filters which
+ operate on a chunk will operate on the information to
+ locate the VL information, not the VL information itself.
+ </P>
+
+ <H3>File Drivers</H3>
+ <P>Because the parallel I/O file drivers (MPI-I/O and MPI-posix) don't
+ allow objects with varying sizes to be created in the file, attemping
+ to create
+ a dataset or attribute with a VL datatype in a file managed by those
+ drivers will cause the creation call to fail.
+ </P>
+ <P>Additionally, using
+ VL datatypes and the 'multi' and 'split' file drivers may not operate
+ in the manner desired. The HDF5 library currently categorizes the
+ "blocks of VL information" stored in the file as a type of metadata,
+ which means that they may not be stored with the other raw data for
+ the file.
+ </P>
+
+ <H3>Rewriting</H3>
+ <P>When VL information in the file is re-written, the old VL information
+ must be releases, space for the new VL information allocated and
+ the new VL information must be written to the file. This may cause
+ additional I/O accesses.
+ </P>
+
+ </body>
+
+</html>
+