diff options
Diffstat (limited to 'doc/html/TechNotes/ChunkingStudy.html')
-rw-r--r-- | doc/html/TechNotes/ChunkingStudy.html | 190 |
1 files changed, 0 insertions, 190 deletions
diff --git a/doc/html/TechNotes/ChunkingStudy.html b/doc/html/TechNotes/ChunkingStudy.html deleted file mode 100644 index 776b8fe..0000000 --- a/doc/html/TechNotes/ChunkingStudy.html +++ /dev/null @@ -1,190 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> -<html> - <head> - <title>Testing the chunked layout of HDF5</title> - </head> - - <body> - -<i> -This document is of interest primarily for its discussion of the -HDF team's motivation for implementing raw data caching. -At a more abstract level, the discussion of the principles of -data chunking is also of interest, but a more recent discussion -of that topic can be found in -<a href="../Chunking.html">Dataset Chunking Issues</a> in the -<cite><a href="../H5.user.html">HDF5 User's Guide</a></cite>. - -The performance study described here predates the current chunking -implementation in the HDF5 library, so the particular performance data -is no longer apropos. - -- the Editor -</i> - - <h1>Testing the chunked layout of HDF5</h1> - - <p>This is the results of studying the chunked layout policy in - HDF5. A 1000 by 1000 array of integers was written to a file - dataset extending the dataset with each write to create, in the - end, a 5000 by 5000 array of 4-byte integers for a total data - storage size of 100 million bytes. - - <p> - <center> - <img alt="Order that data was written" src="ChStudy_p1.gif"> - <br><b>Fig 1: Write-order of Output Blocks</b> - </center> - - <p>After the array was written, it was read back in blocks that - were 500 by 500 bytes in row major order (that is, the top-left - quadrant of output block one, then the top-right quadrant of - output block one, then the top-left quadrant of output block 2, - etc.). - - <p>I tried to answer two questions: - <ul> - <li>How does the storage overhead change as the chunk size - changes? - <li>What does the disk seek pattern look like as the chunk size - changes? - </ul> - - <p>I started with chunk sizes that were multiples of the read - block size or k*(500, 500). - - <p> - <center> - <table border> - <caption align=bottom> - <b>Table 1: Total File Overhead</b> - </caption> - <tr> - <th>Chunk Size (elements)</th> - <th>Meta Data Overhead (ppm)</th> - <th>Raw Data Overhead (ppm)</th> - </tr> - - <tr align=center> - <td>500 by 500</td> - <td>85.84</td> - <td>0.00</td> - </tr> - <tr align=center> - <td>1000 by 1000</td> - <td>23.08</td> - <td>0.00</td> - </tr> - <tr align=center> - <td>5000 by 1000</td> - <td>23.08</td> - <td>0.00</td> - </tr> - <tr align=center> - <td>250 by 250</td> - <td>253.30</td> - <td>0.00</td> - </tr> - <tr align=center> - <td>499 by 499</td> - <td>85.84</td> - <td>205164.84</td> - </tr> - </table> - </center> - - <hr> - <p> - <center> - <img alt="500x500" src="ChStudy_500x500.gif"> - <br><b>Fig 2: Chunk size is 500x500</b> - </center> - - <p>The first half of Figure 2 shows output to the file while the - second half shows input. Each dot represents a file-level I/O - request and the lines that connect the dots are for visual - clarity. The size of the request is not indicated in the - graph. The output block size is four times the chunk size which - results in four file-level write requests per block for a total - of 100 requests. Since file space for the chunks was allocated - in output order, and the input block size is 1/4 the output - block size, the input shows a staircase effect. Each input - request results in one file-level read request. The downward - spike at about the 60-millionth byte is probably the result of a - cache miss for the B-tree and the downward spike at the end is - probably a cache flush or file boot block update. - - <hr> - <p> - <center> - <img alt="1000x1000" src="ChStudy_1000x1000.gif"> - <br><b>Fig 2: Chunk size is 1000x1000</b> - </center> - - <p>In this test I increased the chunk size to match the output - chunk size and one can see from the first half of the graph that - 25 file-level write requests were issued, one for each output - block. The read half of the test shows that four times the - amount of data was read as written. This results from the fact - that HDF5 must read the entire chunk for any request that falls - within that chunk, which is done because (1) if the data is - compressed the entire chunk must be decompressed, and (2) the - library assumes that a chunk size was chosen to optimize disk - performance. - - <hr> - <p> - <center> - <img alt="5000x1000" src="ChStudy_5000x1000.gif"> - <br><b>Fig 3: Chunk size is 5000x1000</b> - </center> - - <p>Increasing the chunk size further results in even worse - performance since both the read and write halves of the test are - re-reading and re-writing vast amounts of data. This proves - that one should be careful that chunk sizes are not much larger - than the typical partial I/O request. - - <hr> - <p> - <center> - <img alt="250x250" src="ChStudy_250x250.gif"> - <br><b>Fig 4: Chunk size is 250x250</b> - </center> - - <p>If the chunk size is decreased then the amount of data - transfered between the disk and library is optimal for no - caching, but the amount of meta data required to describe the - chunk locations increases to 250 parts per million. One can - also see that the final downward spike contains more file-level - write requests as the meta data is flushed to disk just before - the file is closed. - - <hr> - <p> - <center> - <img alt="499x499" src="ChStudy_499x499.gif"> - <br><b>Fig 4: Chunk size is 499x499</b> - </center> - - <p>This test shows the result of choosing a chunk size which is - close to the I/O block size. Because the total size of the - array isn't a multiple of the chunk size, the library allocates - an extra zone of chunks around the top and right edges of the - array which are only partially filled. This results in - 20,516,484 extra bytes of storage, a 20% increase in the total - raw data storage size. But the amount of meta data overhead is - the same as for the 500 by 500 test. In addition, the mismatch - causes entire chunks to be read in order to update a few - elements along the edge or the chunk which results in a 3.6-fold - increase in the amount of data transfered. - - <hr> - <address><a href="mailto:hdfhelp@ncsa.uiuc.edu">HDF Help Desk</a></address> -<!-- Created: Fri Jan 30 21:04:49 EST 1998 --> -<!-- hhmts start --> -Last modified: 30 Jan 1998 (technical content) -<br> -Last modified: 9 May 2000 (editor's note) -<!-- hhmts end --> - </body> -</html> |