diff options
Diffstat (limited to 'doc/html/TechNotes/BigDataSmMach.html')
-rw-r--r-- | doc/html/TechNotes/BigDataSmMach.html | 122 |
1 files changed, 122 insertions, 0 deletions
diff --git a/doc/html/TechNotes/BigDataSmMach.html b/doc/html/TechNotes/BigDataSmMach.html new file mode 100644 index 0000000..fe00ff8 --- /dev/null +++ b/doc/html/TechNotes/BigDataSmMach.html @@ -0,0 +1,122 @@ +<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> +<html> + <head> + <title>Big Datasets on Small Machines</title> + </head> + + <body> + <h1>Big Datasets on Small Machines</h1> + + <h2>1. Introduction</h2> + + <p>The HDF5 library is able to handle files larger than the + maximum file size, and datasets larger than the maximum memory + size. For instance, a machine where <code>sizeof(off_t)</code> + and <code>sizeof(size_t)</code> are both four bytes can handle + datasets and files as large as 18x10^18 bytes. However, most + Unix systems limit the number of concurrently open files, so a + practical file size limit is closer to 512GB or 1TB. + + <p>Two "tricks" must be imployed on these small systems in order + to store large datasets. The first trick circumvents the + <code>off_t</code> file size limit and the second circumvents + the <code>size_t</code> main memory limit. + + <h2>2. File Size Limits</h2> + + <p>Systems that have 64-bit file addresses will be able to access + those files automatically. One should see the following output + from configure: + + <p><code><pre> +checking size of off_t... 8 + </pre></code> + + <p>Also, some 32-bit operating systems have special file systems + that can support large (>2GB) files and HDF5 will detect + these and use them automatically. If this is the case, the + output from configure will show: + + <p><code><pre> +checking for lseek64... yes +checking for fseek64... yes + </pre></code> + + <p>Otherwise one must use an HDF5 file family. Such a family is + created by setting file family properties in a file access + property list and then supplying a file name that includes a + <code>printf</code>-style integer format. For instance: + + <p><code><pre> +hid_t plist, file; +plist = H5Pcreate (H5P_FILE_ACCESS); +H5Pset_family (plist, 1<<30, H5P_DEFAULT); +file = H5Fcreate ("big%03d.h5", H5F_ACC_TRUNC, H5P_DEFAULT, plist); + </code></pre> + + <p>The second argument (<code>1<<30</code>) to + <code>H5Pset_family()</code> indicates that the family members + are to be 2^30 bytes (1GB) each although we could have used any + reasonably large value. In general, family members cannot be + 2GB because writes to byte number 2,147,483,647 will fail, so + the largest safe value for a family member is 2,147,483,647. + HDF5 will create family members on demand as the HDF5 address + space increases, but since most Unix systems limit the number of + concurrently open files the effective maximum size of the HDF5 + address space will be limited (the system on which this was + developed allows 1024 open files, so if each family member is + approx 2GB then the largest HDF5 file is approx 2TB). + + <p>If the effective HDF5 address space is limited then one may be + able to store datasets as external datasets each spanning + multiple files of any length since HDF5 opens external dataset + files one at a time. To arrange storage for a 5TB dataset split + among 1GB files one could say: + + <p><code><pre> +hid_t plist = H5Pcreate (H5P_DATASET_CREATE); +for (i=0; i<5*1024; i++) { + sprintf (name, "velocity-%04d.raw", i); + H5Pset_external (plist, name, 0, (size_t)1<<30); +} + </code></pre> + + <h2>3. Dataset Size Limits</h2> + + <p>The second limit which must be overcome is that of + <code>sizeof(size_t)</code>. HDF5 defines a data type called + <code>hsize_t</code> which is used for sizes of datasets and is, + by default, defined as <code>unsigned long long</code>. + + <p>To create a dataset with 8*2^30 4-byte integers for a total of + 32GB one first creates the dataspace. We give two examples + here: a 4-dimensional dataset whose dimension sizes are smaller + than the maximum value of a <code>size_t</code>, and a + 1-dimensional dataset whose dimension size is too large to fit + in a <code>size_t</code>. + + <p><code><pre> +hsize_t size1[4] = {8, 1024, 1024, 1024}; +hid_t space1 = H5Screate_simple (4, size1, size1); + +hsize_t size2[1] = {8589934592LL}; +hid_t space2 = H5Screate_simple (1, size2, size2}; + </pre></code> + + <p>However, the <code>LL</code> suffix is not portable, so it may + be better to replace the number with + <code>(hsize_t)8*1024*1024*1024</code>. + + <p>For compilers that don't support <code>long long</code> large + datasets will not be possible. The library performs too much + arithmetic on <code>hsize_t</code> types to make the use of a + struct feasible. + + <hr> + <address><a href="mailto:matzke@llnl.gov">Robb Matzke</a></address> +<!-- Created: Fri Apr 10 13:26:04 EDT 1998 --> +<!-- hhmts start --> +Last modified: Sun Jul 19 11:37:25 EDT 1998 +<!-- hhmts end --> + </body> +</html> |