summaryrefslogtreecommitdiffstats
path: root/doc/html/Chunking.html
diff options
context:
space:
mode:
Diffstat (limited to 'doc/html/Chunking.html')
-rw-r--r--doc/html/Chunking.html313
1 files changed, 0 insertions, 313 deletions
diff --git a/doc/html/Chunking.html b/doc/html/Chunking.html
deleted file mode 100644
index 3738d9a..0000000
--- a/doc/html/Chunking.html
+++ /dev/null
@@ -1,313 +0,0 @@
-<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
-<html>
- <head>
- <title>Dataset Chunking</title>
-
-<!-- #BeginLibraryItem "/ed_libs/styles_UG.lbi" -->
-<!--
- * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
- * Copyright by the Board of Trustees of the University of Illinois. *
- * All rights reserved. *
- * *
- * This file is part of HDF5. The full HDF5 copyright notice, including *
- * terms governing use, modification, and redistribution, is contained in *
- * the files COPYING and Copyright.html. COPYING can be found at the root *
- * of the source code distribution tree; Copyright.html can be found at the *
- * root level of an installed copy of the electronic HDF5 document set and *
- * is linked from the top-level documents page. It can also be found at *
- * http://hdf.ncsa.uiuc.edu/HDF5/doc/Copyright.html. If you do not have *
- * access to either file, you may request a copy from hdfhelp@ncsa.uiuc.edu. *
- * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
- -->
-
-<link href="ed_styles/UGelect.css" rel="stylesheet" type="text/css">
-<!-- #EndLibraryItem --></head>
-
- <body bgcolor="#FFFFFF">
-
-
-<!-- #BeginLibraryItem "/ed_libs/NavBar_UG.lbi" --><hr>
-<center>
-<table border=0 width=98%>
-<tr><td valign=top align=left>
- <a href="index.html">HDF5 documents and links</a>&nbsp;<br>
- <a href="H5.intro.html">Introduction to HDF5</a>&nbsp;<br>
- <a href="RM_H5Front.html">HDF5 Reference Manual</a>&nbsp;<br>
- <a href="http://hdf.ncsa.uiuc.edu/HDF5/doc/UG/index.html">HDF5 User's Guide for Release 1.6</a>&nbsp;<br>
- <!--
- <a href="Glossary.html">Glossary</a><br>
- -->
-</td>
-<td valign=top align=right>
- And in this document, the
- <a href="H5.user.html"><strong>HDF5 User's Guide from Release 1.4.5:</strong></a>&nbsp;&nbsp;&nbsp;&nbsp;
- <br>
- <a href="Files.html">Files</a>&nbsp;&nbsp;
- <a href="Datasets.html">Datasets</a>&nbsp;&nbsp;
- <a href="Datatypes.html">Datatypes</a>&nbsp;&nbsp;
- <a href="Dataspaces.html">Dataspaces</a>&nbsp;&nbsp;
- <a href="Groups.html">Groups</a>&nbsp;&nbsp;
- <br>
- <a href="References.html">References</a>&nbsp;&nbsp;
- <a href="Attributes.html">Attributes</a>&nbsp;&nbsp;
- <a href="Properties.html">Property Lists</a>&nbsp;&nbsp;
- <a href="Errors.html">Error Handling</a>&nbsp;&nbsp;
- <br>
- <a href="Filters.html">Filters</a>&nbsp;&nbsp;
- <a href="Caching.html">Caching</a>&nbsp;&nbsp;
- <a href="Chunking.html">Chunking</a>&nbsp;&nbsp;
- <a href="MountingFiles.html">Mounting Files</a>&nbsp;&nbsp;
- <br>
- <a href="Performance.html">Performance</a>&nbsp;&nbsp;
- <a href="Debugging.html">Debugging</a>&nbsp;&nbsp;
- <a href="Environment.html">Environment</a>&nbsp;&nbsp;
- <a href="ddl.html">DDL</a>&nbsp;&nbsp;
-</td></tr>
-</table>
-</center>
-<hr>
-<!-- #EndLibraryItem --><h1>Dataset Chunking Issues</h1>
-
- <h2>Table of Contents</h2>
-
- <ul>
- <li><a href="#S1">1. Introduction</a>
- <li><a href="#S2">2. Raw Data Chunk Cache</a>
- <li><a href="#S3">3. Cache Efficiency</a>
- <li><a href="#S4">4. Fragmentation</a>
- <li><a href="#S5">5. File Storage Overhead</a>
- <li><a href="#S6">6. Chunk Compression</a>
- </ul>
-
- <h2><a name="S1">1. Introduction</a></h2>
-
-
- <p><em>Chunking</em> refers to a storage layout where a dataset is
- partitioned into fixed-size multi-dimensional chunks. The
- chunks cover the dataset but the dataset need not be an integral
- number of chunks. If no data is ever written to a chunk then
- that chunk isn't allocated on disk. Figure 1 shows a 25x48
- element dataset covered by nine 10x20 chunks and 11 data points
- written to the dataset. No data was written to the region of
- the dataset covered by three of the chunks so those chunks were
- never allocated in the file -- the other chunks are allocated at
- independent locations in the file and written in their entirety.
-
- <center><image src="Chunk_f1.gif"><br><b>Figure 1</b></center>
-
- <p>The HDF5 library treats chunks as atomic objects -- disk I/O is
- always in terms of complete chunks<a href="#fn1">(1)</a>. This
- allows data filters to be defined by the application to perform
- tasks such as compression, encryption, checksumming,
- <em>etc</em>. on entire chunks. As shown in Figure 2, if
- <code>H5Dwrite()</code> touches only a few bytes of the chunk,
- the entire chunk is read from the file, the data passes upward
- through the filter pipeline, the few bytes are modified, the
- data passes downward through the filter pipeline, and the entire
- chunk is written back to the file.
-
- <center><image src="Chunk_f2.gif"><br><b>Figure 2</b></center>
-
- <h2><a name="S2">2. The Raw Data Chunk Cache</a></h2>
-
- <p>It's obvious from Figure 2 that calling <code>H5Dwrite()</code>
- many times from the application would result in poor performance
- even if the data being written all falls within a single chunk.
- A raw data chunk cache layer was added between the top of the
- filter stack and the bottom of the byte modification layer<a
- href="#fn2">(2)</a>. By default, the chunk cache will store 521
- chunks or 1MB of data (whichever is less) but these values can
- be modified with <code>H5Pset_cache()</code>.
-
- <p>The preemption policy for the cache favors certain chunks and
- tries not to preempt them.
-
- <ul>
- <li>Chunks that have been accessed frequently in the near past
- are favored.
- <li>A chunk which has just entered the cache is favored.
- <li>A chunk which has been completely read or completely written
- but not partially read or written is penalized according to
- some application specified weighting between zero and one.
- <li>A chunk which is larger than the maximum cache size is not
- eligible for caching.
- </ul>
-
- <h2><a name="S3">3. Cache Efficiency</a></h2>
-
- <p>Now for some real numbers... A 2000x2000 element dataset is
- created and covered by a 20x20 array of chunks (each chunk is 100x100
- elements). The raw data cache is adjusted to hold at most 25 chunks by
- setting the maximum number of bytes to 25 times the chunk size in
- bytes. Then the application creates a square, two-dimensional memory
- buffer and uses it as a window into the dataset, first reading and then
- rewriting in row-major order by moving the window across the dataset
- (the read and write tests both start with a cold cache).
-
- <p>The measure of efficiency in Figure 3 is the number of bytes requested
- by the application divided by the number of bytes transferred from the
- file. There are at least a couple ways to get an estimate of the cache
- performance: one way is to turn on <a href="Debugging.html">cache
- debugging</a> and look at the number of cache misses. A more accurate
- and specific way is to register a data filter whose sole purpose is to
- count the number of bytes that pass through it (that's the method used
- below).
-
- <center><image src="Chunk_f3.gif"><br><b>Figure 3</b></center>
-
- <p>The read efficiency is less than one for two reasons: collisions in the
- cache are handled by preempting one of the colliding chunks, and the
- preemption algorithm occasionally preempts a chunk which hasn't been
- referenced for a long time but is about to be referenced in the near
- future.
-
- <p>The write test results in lower efficiency for most window
- sizes because HDF5 is unaware that the application is about to
- overwrite the entire dataset and must read in most chunks before
- modifying parts of them.
-
- <p>There is a simple way to improve efficiency for this example.
- It turns out that any chunk that has been completely read or
- written is a good candidate for preemption. If we increase the
- penalty for such chunks from the default 0.75 to the maximum
- 1.00 then efficiency improves.
-
- <center><image src="Chunk_f4.gif"><br><b>Figure 4</b></center>
-
- <p>The read efficiency is still less than one because of
- collisions in the cache. The number of collisions can often be
- reduced by increasing the number of slots in the cache. Figure
- 5 shows what happens when the maximum number of slots is
- increased by an order of magnitude from the default (this change
- has no major effect on memory used by the test since the byte
- limit was not increased for the cache).
-
- <center><image src="Chunk_f5.gif"><br><b>Figure 5</b></center>
-
- <p>Although the application eventually overwrites every chunk
- completely the library has know way of knowing this before hand
- since most calls to <code>H5Dwrite()</code> modify only a
- portion of any given chunk. Therefore, the first modification of
- a chunk will cause the chunk to be read from disk into the chunk
- buffer through the filter pipeline. Eventually HDF5 might
- contain a data set transfer property that can turn off this read
- operation resulting in write efficiency which is equal to read
- efficiency.
-
-
- <h2><a name="S4">4. Fragmentation</a></h2>
-
- <p>Even if the application transfers the entire dataset contents with a
- single call to <code>H5Dread()</code> or <code>H5Dwrite()</code> it's
- possible the request will be broken into smaller, more manageable
- pieces by the library. This is almost certainly true if the data
- transfer includes a type conversion.
-
- <center><image src="Chunk_f6.gif"><br><b>Figure 6</b></center>
-
- <p>By default the strip size is 1MB but it can be changed by calling
- <code>H5Pset_buffer()</code>.
-
-
- <h2><a name="S5">5. File Storage Overhead</a></h2>
-
- <p>The chunks of the dataset are allocated at independent
- locations throughout the HDF5 file and a B-tree maps chunk
- N-dimensional addresses to file addresses. The more chunks that
- are allocated for a dataset the larger the B-tree. Large B-trees
- have two disadvantages:
-
- <ul>
- <li>The file storage overhead is higher and more disk I/O is
- required to traverse the tree from root to leaves.
- <li>The increased number of B-tree nodes will result in higher
- contention for the meta data cache.
- </ul>
-
- <p>There are three ways to reduce the number of B-tree nodes. The
- obvious way is to reduce the number of chunks by choosing a larger chunk
- size (doubling the chunk size will cut the number of B-tree nodes in
- half). Another method is to adjust the split ratios for the B-tree by
- calling <code>H5Pset_btree_ratios()</code>, but this method typically
- results in only a slight improvement over the default settings.
- Finally, the out-degree of each node can be increased by calling
- <code>H5Pset_istore_k()</code> (increasing the out degree actually
- increases file overhead while decreasing the number of nodes).
-
-
- <h2><a name="S6">6. Chunk Compression</a></h2>
-
- <p>Dataset chunks can be compressed through the use of filters.
- See the chapter &ldquo;<a href="Filters.html">Filters in HDF5</a>.&rdquo;
-
- <p>Reading and rewriting compressed chunked data can result in holes
- in an HDF5 file. In time, enough such holes can increase the
- file size enough to impair application or library performance
- when working with that file. See
- &ldquo;<a href="Performance.html#Freespace">Freespace Management</a>&rdquo;
- in the chapter
- &ldquo;<a href="Performance.html">Performance Analysis and Issues</a>.&rdquo;
-
-
-<hr>
-
- <p><a name="fn1">Footnote 1:</a> Parallel versions of the library
- can access individual bytes of a chunk when the underlying file
- uses MPI-IO.
-
- <p><a name="fn2">Footnote 2:</a> The raw data chunk cache was
- added before the second alpha release.</p>
-
-
-<!-- #BeginLibraryItem "/ed_libs/NavBar_UG.lbi" --><hr>
-<center>
-<table border=0 width=98%>
-<tr><td valign=top align=left>
- <a href="index.html">HDF5 documents and links</a>&nbsp;<br>
- <a href="H5.intro.html">Introduction to HDF5</a>&nbsp;<br>
- <a href="RM_H5Front.html">HDF5 Reference Manual</a>&nbsp;<br>
- <a href="http://hdf.ncsa.uiuc.edu/HDF5/doc/UG/index.html">HDF5 User's Guide for Release 1.6</a>&nbsp;<br>
- <!--
- <a href="Glossary.html">Glossary</a><br>
- -->
-</td>
-<td valign=top align=right>
- And in this document, the
- <a href="H5.user.html"><strong>HDF5 User's Guide from Release 1.4.5:</strong></a>&nbsp;&nbsp;&nbsp;&nbsp;
- <br>
- <a href="Files.html">Files</a>&nbsp;&nbsp;
- <a href="Datasets.html">Datasets</a>&nbsp;&nbsp;
- <a href="Datatypes.html">Datatypes</a>&nbsp;&nbsp;
- <a href="Dataspaces.html">Dataspaces</a>&nbsp;&nbsp;
- <a href="Groups.html">Groups</a>&nbsp;&nbsp;
- <br>
- <a href="References.html">References</a>&nbsp;&nbsp;
- <a href="Attributes.html">Attributes</a>&nbsp;&nbsp;
- <a href="Properties.html">Property Lists</a>&nbsp;&nbsp;
- <a href="Errors.html">Error Handling</a>&nbsp;&nbsp;
- <br>
- <a href="Filters.html">Filters</a>&nbsp;&nbsp;
- <a href="Caching.html">Caching</a>&nbsp;&nbsp;
- <a href="Chunking.html">Chunking</a>&nbsp;&nbsp;
- <a href="MountingFiles.html">Mounting Files</a>&nbsp;&nbsp;
- <br>
- <a href="Performance.html">Performance</a>&nbsp;&nbsp;
- <a href="Debugging.html">Debugging</a>&nbsp;&nbsp;
- <a href="Environment.html">Environment</a>&nbsp;&nbsp;
- <a href="ddl.html">DDL</a>&nbsp;&nbsp;
-</td></tr>
-</table>
-</center>
-<hr>
-<!-- #EndLibraryItem --><!-- #BeginLibraryItem "/ed_libs/Footer.lbi" --><address>
-<a href="mailto:hdfhelp@ncsa.uiuc.edu">HDF Help Desk</a>
-<br>
-Describes HDF5 Release 1.4.5, February 2003
-</address><!-- #EndLibraryItem --><!-- Created: Tue Oct 20 12:38:40 EDT 1998 -->
-<!-- hhmts start -->
-Last modified: 2 August 2001
-<!-- hhmts end -->
-
-
-</body>
-</html>