summaryrefslogtreecommitdiffstats
path: root/doc/html/TechNotes/VLTypes.html
blob: 8a41c100cd2e253a4c0aa054ce616500cbdecc2f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
<html>
  <head>
    <title>
      Variable-Length Datatypes in HDF5
    </title>

    <STYLE TYPE="text/css">

    P { text-indent: 2em}
    P.item { margin-left: 2em; text-indent: -2em}
    P.item2 { margin-left: 2em; text-indent: 2em}

    TABLE.format { border:solid; border-collapse:collapse; caption-side:top; text-align:center; width:80%;}
    TABLE.format TH { border:ridge; padding:4px; width:25%;}
    TABLE.format TD { border:ridge; padding:4px; }
    TABLE.format CAPTION { font-weight:bold; font-size:larger;}

    TABLE.note {border:none; text-align:right; width:80%;}

    TABLE.desc { border:solid; border-collapse:collapse; caption-size:top; text-align:left; width:80%;}
    TABLE.desc TR { vertical-align:top;}
    TABLE.desc TH { border-style:ridge; font-size:larger; padding:4px; text-decoration:underline;}
    TABLE.desc TD { border-style:ridge; padding:4px; }
    TABLE.desc CAPTION { font-weight:bold; font-size:larger;}

    TABLE.list { border:none; }
    TABLE.list TR { vertical-align:top;}
    TABLE.list TH { border:none; text-decoration:underline;}
    TABLE.list TD { border:none; }

    </STYLE>
           
    <!-- #BeginLibraryItem "/ed_libs/styles_Format.lbi" -->
<!--
  * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
  * Copyright by the Board of Trustees of the University of Illinois.         *
  * All rights reserved.                                                      *
  *                                                                           *
  * This file is part of HDF5.  The full HDF5 copyright notice, including     *
  * terms governing use, modification, and redistribution, is contained in    *
  * the files COPYING and Copyright.html.  COPYING can be found at the root   *
  * of the source code distribution tree; Copyright.html can be found at the  *
  * root level of an installed copy of the electronic HDF5 document set and   *
  * is linked from the top-level documents page.  It can also be found at     *
  * http://hdf.ncsa.uiuc.edu/HDF5/doc/Copyright.html.  If you do not have     *
  * access to either file, you may request a copy from hdfhelp@ncsa.uiuc.edu. *
  * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 -->

<link href="../ed_styles/FormatElect.css" rel="stylesheet" type="text/css">
<!-- #EndLibraryItem --></head>

  <body bgcolor="#FFFFFF">
    <H3>Introduction</H3>
    <P>Variable-length (VL) datatypes have a great deal of flexibility, but can
        be over- or mis-used.  VL datatypes are ideal at capturing the notion
        that elements in an HDF5 dataset (or attribute) can have different
        amounts of information (VL strings are the canonical example),
        but they have some drawbacks that this document attempts
        to address.
    </P>

    <H3>Background</H3>
    <P>Because fast random access to dataset elements requires that each
        element be a fixed size, the information stored for VL datatype elements
        is actually information to locate the VL information, not
        the information itself.
    </P>

    <H3>When to use VL datatypes</H3>
    <P>VL datatypes are designed allow the amount of data stored in each
        element of a dataset to vary.  This change could be
        over time as new values, with different lengths, were written to the
        element.  Or, the change can be over "space" - the dataset's space,
        with each element in the dataset having the same fundamental type, but
        different lengths.  "Ragged arrays" are the classic example of elements
        that change over the "space" of the dataset.  If the elements of a
        dataset are not going to change over "space" or time, a VL datatype
        should probably not be used.
    </P>

    <H3>Access Time Penalty</H3>
    <P>Accessing VL information requires reading the element in the file, then
        using that element's location information to retrieve the VL
        information itself.
        In the worst case, this obviously doubles the number of disk accesses
        required to access the VL information.
    </P>
    <P>However, in order to avoid this extra disk access overhead, the HDF5
        library groups VL information together into larger blocks on disk and
        performs I/O only on those larger blocks.  Additionally, these blocks of
        information are cached in memory as long as possible.  For most access
        patterns, this amortizes the extra disk accesses over enough pieces of
        VL information to hide the extra overhead involved.
    </P>

    <H3>Storage Space Penalty</H3>
    <P>Because VL information must be located and retrieved from another
        location in the file, extra information must be stored in the file to
        locate 
        each item of VL information (i.e. each element in a dataset or each
        VL field in a compound datatype, etc.).
        Currently, that extra information amounts to 32 bytes per VL item.
    </P>
    <P>
        With some judicious re-architecting of the library and file format,
        this could be reduced to 18 bytes per VL item with no loss in
        functionality or additional time penalties.  With some additional
        effort, the space could perhaps could be pushed down as low as 8-10
        bytes per VL item with no loss in functionality, but potentially a
        small time penalty.
    </P>

    <H3>Chunking and Filters</H3>
    <P>Storing data as VL information has some affects on chunked storage and
        the filters that can be applied to chunked data.  Because the data that
        is stored in each chunk is the location to access the VL information,
        the actual VL information is not broken up into chunks in the same way
        as other data stored in chunks.  Additionally, because the
        actual VL information is not stored in the chunk, any filters which
        operate on a chunk will operate on the information to
        locate the VL information, not the VL information itself.
    </P>

    <H3>File Drivers</H3>
    <P>Because the parallel I/O file drivers (MPI-I/O and MPI-posix) don't
        allow objects with varying sizes to be created in the file, attemping
        to create
        a dataset or attribute with a VL datatype in a file managed by those
        drivers will cause the creation call to fail.
    </P>
    <P>Additionally, using
        VL datatypes and the 'multi' and 'split' file drivers may not operate
        in the manner desired.  The HDF5 library currently categorizes the
        "blocks of VL information" stored in the file as a type of metadata,
        which means that they may not be stored with the other raw data for
        the file.
    </P>

    <H3>Rewriting</H3>
    <P>When VL information in the file is re-written, the old VL information
        must be releases, space for the new VL information allocated and
        the new VL information must be written to the file.  This may cause
        additional I/O accesses.
    </P>

  </body>

</html>