1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
|
<html>
<head>
<title>
Variable-Length Datatypes in HDF5
</title>
<STYLE TYPE="text/css">
P { text-indent: 2em}
P.item { margin-left: 2em; text-indent: -2em}
P.item2 { margin-left: 2em; text-indent: 2em}
TABLE.format { border:solid; border-collapse:collapse; caption-side:top; text-align:center; width:80%;}
TABLE.format TH { border:ridge; padding:4px; width:25%;}
TABLE.format TD { border:ridge; padding:4px; }
TABLE.format CAPTION { font-weight:bold; font-size:larger;}
TABLE.note {border:none; text-align:right; width:80%;}
TABLE.desc { border:solid; border-collapse:collapse; caption-size:top; text-align:left; width:80%;}
TABLE.desc TR { vertical-align:top;}
TABLE.desc TH { border-style:ridge; font-size:larger; padding:4px; text-decoration:underline;}
TABLE.desc TD { border-style:ridge; padding:4px; }
TABLE.desc CAPTION { font-weight:bold; font-size:larger;}
TABLE.list { border:none; }
TABLE.list TR { vertical-align:top;}
TABLE.list TH { border:none; text-decoration:underline;}
TABLE.list TD { border:none; }
</STYLE>
<!-- #BeginLibraryItem "/ed_libs/styles_Format.lbi" -->
<!--
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Copyright by the Board of Trustees of the University of Illinois. *
* All rights reserved. *
* *
* This file is part of HDF5. The full HDF5 copyright notice, including *
* terms governing use, modification, and redistribution, is contained in *
* the files COPYING and Copyright.html. COPYING can be found at the root *
* of the source code distribution tree; Copyright.html can be found at the *
* root level of an installed copy of the electronic HDF5 document set and *
* is linked from the top-level documents page. It can also be found at *
* http://hdf.ncsa.uiuc.edu/HDF5/doc/Copyright.html. If you do not have *
* access to either file, you may request a copy from hdfhelp@ncsa.uiuc.edu. *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
-->
<link href="ed_styles/FormatElect.css" rel="stylesheet" type="text/css">
<!-- #EndLibraryItem -->
</head>
<body bgcolor="#FFFFFF">
<H3>Introduction</H3>
<P>Variable-length (VL) datatypes have a great deal of flexibility, but can
be over- or mis-used. VL datatypes are ideal at capturing the notion
that elements in an HDF5 dataset (or attribute) can have different
amounts of information (VL strings are the canonical example),
but they have some drawbacks that this document attempts
to address.
</P>
<H3>Background</H3>
<P>Because fast random access to dataset elements requires that each
element be a fixed size, the information stored for VL datatype elements
is actually information to locate the VL information, not
the information itself.
</P>
<H3>When to use VL datatypes</H3>
<P>VL datatypes are designed allow the amount of data stored in each
element of a dataset to vary. This change could be
over time as new values, with different lengths, were written to the
element. Or, the change can be over "space" - the dataset's space,
with each element in the dataset having the same fundamental type, but
different lengths. "Ragged arrays" are the classic example of elements
that change over the "space" of the dataset. If the elements of a
dataset are not going to change over "space" or time, a VL datatype
should probably not be used.
</P>
<H3>Access Time Penalty</H3>
<P>Accessing VL information requires reading the element in the file, then
using that element's location information to retrieve the VL
information itself.
In the worst case, this obviously doubles the number of disk accesses
required to access the VL information.
</P>
<P>However, in order to avoid this extra disk access overhead, the HDF5
library groups VL information together into larger blocks on disk and
performs I/O only on those larger blocks. Additionally, these blocks of
information are cached in memory as long as possible. For most access
patterns, this amortizes the extra disk accesses over enough pieces of
VL information to hide the extra overhead involved.
</P>
<H3>Storage Space Penalty</H3>
<P>Because VL information must be located and retrieved from another
location in the file, extra information must be stored in the file to
locate
each item of VL information (i.e. each element in a dataset or each
VL field in a compound datatype, etc.).
Currently, that extra information amounts to 32 bytes per VL item.
</P>
<P>
With some judicious re-architecting of the library and file format,
this could be reduced to 18 bytes per VL item with no loss in
functionality or additional time penalties. With some additional
effort, the space could perhaps could be pushed down as low as 8-10
bytes per VL item with no loss in functionality, but potentially a
small time penalty.
</P>
<H3>Chunking and Filters</H3>
<P>Storing data as VL information has some affects on chunked storage and
the filters that can be applied to chunked data. Because the data that
is stored in each chunk is the location to access the VL information,
the actual VL information is not broken up into chunks in the same way
as other data stored in chunks. Additionally, because the
actual VL information is not stored in the chunk, any filters which
operate on a chunk will operate on the information to
locate the VL information, not the VL information itself.
</P>
<H3>File Drivers</H3>
<P>Because the parallel I/O file drivers (MPI-I/O and MPI-posix) don't
allow objects with varying sizes to be created in the file, attemping
to create
a dataset or attribute with a VL datatype in a file managed by those
drivers will cause the creation call to fail.
</P>
<P>Additionally, using
VL datatypes and the 'multi' and 'split' file drivers may not operate
in the manner desired. The HDF5 library currently categorizes the
"blocks of VL information" stored in the file as a type of metadata,
which means that they may not be stored with the other raw data for
the file.
</P>
<H3>Rewriting</H3>
<P>When VL information in the file is re-written, the old VL information
must be releases, space for the new VL information allocated and
the new VL information must be written to the file. This may cause
additional I/O accesses.
</P>
</body>
</html>
|