summaryrefslogtreecommitdiffstats
path: root/doc/html/Big.html
blob: 080f786af7f23e0b1754170602c5b48c7f3c17e7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
  <head>
    <title>Big Datasets on Small Machines</title>
  </head>

  <body>
    <h1>Big Datasets on Small Machines</h1>

    <h2>1. Introduction</h2>

    <p>The HDF5 library is able to handle files larger than the
      maximum file size, and datasets larger than the maximum memory
      size.  For instance, a machine where <code>sizeof(off_t)</code>
      and <code>sizeof(size_t)</code> are both four bytes can handle
      datasets and files as large as 18x10^18 bytes.  However, most
      Unix systems limit the number of concurrently open files, so a
      practical file size limit is closer to 512GB or 1TB.

    <p>Two "tricks" must be imployed on these small systems in order
      to store large datasets.  The first trick circumvents the
      <code>off_t</code> file size limit and the second circumvents
      the <code>size_t</code> main memory limit.

    <h2>2. File Size Limits</h2>

    <p>Some 32-bit operating systems have special file systems that
      can support large (&gt;2GB) files and HDF5 will detect these and
      use them automatically.  If this is the case, the output from
      configure will show:

    <p><code><pre>
checking for lseek64... yes
checking for fseek64... yes
    </pre></code>

    <p>Otherwise one must use an HDF5 file family.  Such a family is
      created by setting file family properties in a file access
      property list and then supplying a file name that includes a
      <code>printf</code>-style integer format.  For instance:

    <p><code><pre>
hid_t plist, file;
plist = H5Pcreate (H5P_FILE_ACCESS);
H5Pset_family (plist, 1<<30, H5P_DEFAULT);
file = H5Fcreate ("big%03d.h5", H5F_ACC_TRUNC, H5P_DEFAULT, plist);
    </code></pre>

    <p>The second argument (<code>30</code>) to
      <code>H5Pset_family()</code> indicates that the family members
      are to be 2^30 bytes (1GB) each.  In general, family members
      cannot be 2GB because writes to byte number 2,147,483,647 will
      fail, so the largest safe value for a family member is
      2,147,483,647.  HDF5 will create family members on demand as the
      HDF5 address space increases, but since most Unix systems limit
      the number of concurrently open files the effective maximum size
      of the HDF5 address space will be limited.

    <p>If the effective HDF5 address space is limited then one may be
      able to store datasets as external datasets each spanning
      multiple files of any length since HDF5 opens external dataset
      files one at a time.  To arrange storage for a 5TB dataset one
      could say:

    <p><code><pre>
hid_t plist = H5Pcreate (H5P_DATASET_CREATE);
for (i=0; i&lt;5*1024; i++) {
   sprintf (name, "velocity-%04d.raw", i);
   H5Pset_external (plist, name, 0, (size_t)1&lt;&lt;30);
}
    </code></pre>

    <h2>3. Dataset Size Limits</h2>

    <p>The second limit which must be overcome is that of
      <code>sizeof(size_t)</code>.  HDF5 defines a new data type
      called <code>hsize_t</code> which is used for sizes of datasets
      and is, by default, defined as <code>unsigned long long</code>.

    <p>To create a dataset with 8*2^30 4-byte integers for a total of
      32GB one first creates the dataspace.  We give two examples
      here: a 4-dimensional dataset whose dimension sizes are smaller
      than the maximum value of a <code>size_t</code>, and a
      1-dimensional dataset whose dimension size is too large to fit
      in a <code>size_t</code>.

    <p><code><pre>
hsize_t size1[4] = {8, 1024, 1024, 1024};
hid_t space1 = H5Screate_simple (4, size1, size1);

hsize_t size2[1] = {8589934592LL};
hid_t space2 = H5Screate_simple (1, size2, size2};
    </pre></code>

    <p>However, the <code>LL</code> suffix is not portable, so it may
      be better to replace the number with
      <code>(hsize_t)8*1024*1024*1024</code>.

    <p>For compilers that don't support <code>long long</code> large
      datasets will not be possible.  The library performs too much
      arithmetic on <code>hsize_t</code> types to make the use of a
      struct feasible.

    <hr>
    <address><a href="mailto:matzke@llnl.gov">Robb Matzke</a></address>
<!-- Created: Fri Apr 10 13:26:04 EDT 1998 -->
<!-- hhmts start -->
Last modified: Wed May 13 12:36:47 EDT 1998
<!-- hhmts end -->
  </body>
</html>