diff options
-rw-r--r-- | doc/parallel-compression.md | 313 |
1 files changed, 313 insertions, 0 deletions
diff --git a/doc/parallel-compression.md b/doc/parallel-compression.md new file mode 100644 index 0000000..e4fa822 --- /dev/null +++ b/doc/parallel-compression.md @@ -0,0 +1,313 @@ +# HDF5 Parallel Compression + +## Introduction + +When an HDF5 dataset is created, the application can specify +optional data filters to be applied to the dataset (as long as +the dataset uses a chunked data layout). These filters may +perform compression, shuffling, checksumming/error detection +and more on the dataset data. The filters are added to a filter +pipeline for the dataset and are automatically applied to the +data during dataset writes and reads. + +Prior to the HDF5 1.10.2 release, a parallel HDF5 application +could read datasets with filters applied to them, but could +not write to those datasets in parallel. The datasets would +have to first be written in a serial HDF5 application or from +a single MPI rank in a parallel HDF5 application. This +restriction was in place because: + + - Updating the data in filtered datasets requires management + of file metadata, such as the dataset's chunk index and file + space for data chunks, which must be done collectively in + order for MPI ranks to have a consistent view of the file. + At the time, HDF5 lacked the collective coordination of + this metadata management. + + - When multiple MPI ranks are writing independently to the + same chunk in a dataset (even if their selected portions of + the chunk don't overlap), the whole chunk has to be read, + unfiltered, modified, re-filtered and then written back to + disk. This read-modify-write style of operation would cause + conflicts among the MPI ranks and lead to an inconsistent + view of the file. + +Introduced in the HDF5 1.10.2 release, the parallel compression +feature allows an HDF5 application to write in parallel to +datasets with filters applied to them, as long as collective +I/O is used. The feature introduces new internal infrastructure +that coordinates the collective management of the file metadata +between MPI ranks during dataset writes. It also accounts for +multiple MPI ranks writing to a chunk by assigning ownership to +one of the MPI ranks, at which point the other MPI ranks send +their modifications to the owning MPI rank. + +The parallel compression feature is always enabled when HDF5 +is built with parallel enabled, but the feature may be disabled +if the necessary MPI-3 routines are not available. Therefore, +HDF5 conditionally defines the macro `H5_HAVE_PARALLEL_FILTERED_WRITES` +which an application can check for to see if the feature is +available. + +## Examples + +Using the parallel compression feature is very similar to using +compression in serial HDF5, except that dataset writes **must** +be collective: + +``` +hid_t dxpl_id = H5Pcreate(H5P_DATASET_XFER); +H5Pset_dxpl_mpio(dxpl_id, H5FD_MPIO_COLLECTIVE); +H5Dwrite(..., dxpl_id, ...); +``` + +The following are two simple examples of using the parallel compression +feature: + +[ph5_filtered_writes.c](https://github.com/HDFGroup/hdf5/blob/develop/examples/ph5_filtered_writes.c) + +[ph5_filtered_writes_no_sel.c](https://github.com/HDFGroup/hdf5/blob/develop/examples/ph5_filtered_writes_no_sel.c) + +The former contains simple examples of using the parallel +compression feature to write to compressed datasets, while the +latter contains an example of how to write to compressed datasets +when one or MPI ranks don't have any data to write to a dataset. +Remember that the feature requires these writes to use collective +I/O, so the MPI ranks which have nothing to contribute must still +participate in the collective write call. + +## Incremental file space allocation support + +HDF5's [file space allocation time](https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALLOC_TIME) +is a dataset creation property that can have significant effects +on application performance, especially if the application uses +parallel HDF5. In a serial HDF5 application, the default file space +allocation time for chunked datasets is "incremental". This means +that allocation of space in the HDF5 file for data chunks is +deferred until data is first written to those chunks. In parallel +HDF5, the file space allocation time was previously always forced +to "early", which allocates space in the file for all of a dataset's +data chunks at creation time (or during the first open of a dataset +if it was created serially). This would ensure that all the necessary +file space was allocated so MPI ranks could perform independent I/O +operations on a dataset without needing further coordination of file +metadata as described previously. + +While this strategy has worked in the past, it has some noticeable +drawbacks. For one, the larger the chunked dataset being created, +the more noticeable overhead there will be during dataset creation +as all of the data chunks are being allocated in the HDF5 file. +Further, these data chunks will, by default, be [filled](https://portal.hdfgroup.org/display/HDF5/H5P_SET_FILL_VALUE) +with HDF5's default fill data value, leading to extraordinary +dataset creation overhead and resulting in pre-filling large +portions of a dataset that the application might have been planning +to overwrite anyway. Even worse, there will be more initial overhead +from compressing that fill data before writing it out, only to have +it read back in, unfiltered and modified the first time a chunk is +written to. In the past, it was typically suggested that parallel +HDF5 applications should use [H5Pset_fill_time](https://portal.hdfgroup.org/display/HDF5/H5P_SET_FILL_TIME) +with a value of `H5D_FILL_TIME_NEVER` in order to disable writing of +the fill value to dataset chunks, but this isn't ideal if the +application actually wishes to make use of fill values. + +With [improvements made](https://www.hdfgroup.org/2022/03/parallel-compression-improvements-in-hdf5-1-13-1/) +to the parallel compression feature for the HDF5 1.13.1 release, +"incremental" file space allocation is now the default for datasets +created in parallel *only if they have filters applied to them*. +"Early" file space allocation is still supported for these datasets +if desired and is still forced for datasets created in parallel that +do *not* have filters applied to them. This change should significantly +reduce the overhead of creating filtered datasets in parallel HDF5 +applications and should be helpful to applications that wish to +use a fill value for these datasets. It should also help significantly +reduce the size of the HDF5 file, as file space for the data chunks +is allocated as needed rather than all at once. + +## Performance Considerations + +Since getting good performance out of HDF5's parallel compression +feature involves several factors, the following is a list of +performance considerations (generally from most to least important) +and best practices to take into account when trying to get the +optimal performance out of the parallel compression feature. + +### Begin with a good chunking strategy + +[Starting with a good chunking strategy](https://portal.hdfgroup.org/display/HDF5/Chunking+in+HDF5) +will generally have the largest impact on overall application +performance. The different chunking parameters can be difficult +to fine-tune, but it is essential to start with a well-performing +chunking layout before adding compression and parallel I/O into +the mix. Compression itself adds overhead and may have side +effects that necessitate further adjustment of the chunking +parameters and HDF5 application settings. Consider that the +chosen chunk size becomes a very important factor when compression +is involved, as data chunks have to be completely read and +re-written to perform partial writes to the chunk. + +[Improving I/O performance with HDF5 compressed datasets](http://portal.hdfgroup.org/display/HDF5/Improving+IO+Performance+When+Working+with+HDF5+Compressed+Datasets) +is a useful reference for more information on getting good +performance when using a chunked dataset layout. + +### Avoid chunk sharing + +Since the parallel compression feature has to assign ownership +of data chunks to a single MPI rank in order to avoid the +previously described read-modify-write issue, an HDF5 application +may need to take care when determining how a dataset will be +divided up among the MPI ranks writing to it. Each dataset data +chunk that is written to by more than 1 MPI rank will incur extra +MPI overhead as one of the ranks takes ownership and the other +ranks send it their data and information about where in the chunk +that data belongs. While not always possible to do, an HDF5 +application will get the best performance out of parallel compression +if it can avoid writing in a way that causes more than 1 MPI rank +to write to any given data chunk in a dataset. + +### Collective metadata operations + +The parallel compression feature typically works with a significant +amount of metadata related to the management of the data chunks +in datasets. In initial performance results gathered from various +HPC machines, it was found that the parallel compression feature +did not scale well at around 8k MPI ranks and beyond. On further +investigation, it became obvious that the bottleneck was due to +heavy filesystem pressure from the metadata management for dataset +data chunks as they changed size (as a result of data compression) +and moved around in the HDF5 file. + +Enabling collective metadata operations in the HDF5 application +(as in the below snippet) showed significant improvement in +performance and scalability and is generally always recommended +unless application performance shows negative benefits by doing +so. + +``` +... +hid_t fapl_id = H5Pcreate(H5P_FILE_ACCESS); +H5Pset_fapl_mpio(fapl_id, MPI_COMM_WORLD, MPI_INFO_NULL); +H5Pset_all_coll_metadata_ops(fapl_id, 1); +H5Pset_coll_metadata_write(fapl_id, 1); +hid_t file_id = H5Fcreate("file.h5", H5F_ACC_TRUNC, H5P_DEFAULT, fapl_id); +... +``` + +### Align chunks in the file + +The natural layout of an HDF5 file may cause dataset data +chunks to end up at addresses in the file that do not align +well with the underlying file system, possibly leading to +poor performance. As an example, Lustre performance is generally +good when writes are aligned with the chosen stripe size. +The HDF5 application can use [H5Pset_alignment](https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT) +to have a bit more control over where objects in the HDF5 +file end up. However, do note that setting the alignment +of objects generally wastes space in the file and has the +potential to dramatically increase its resulting size, so +caution should be used when choosing the alignment parameters. + +[H5Pset_alignment](https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT) +has two parameters that control the alignment of objects in +the HDF5 file, the "threshold" value and the alignment +value. The threshold value specifies that any object greater +than or equal in size to that value will be aligned in the +file at addresses which are multiples of the chosen alignment +value. While the value 0 can be specified for the threshold +to make every object in the file be aligned according to +the alignment value, this isn't generally recommended, as it +will likely waste an excessive amount of space in the file. + +In the example below, the chosen dataset chunk size is +provided for the threshold value and 1MiB is specified for +the alignment value. Assuming that 1MiB is an optimal +alignment value (e.g., assuming that it matches well with +the Lustre stripe size), this should cause dataset data +chunks to be well-aligned and generally give good write +performance. + +``` +hid_t fapl_id = H5Pcreate(H5P_FILE_ACCESS); +H5Pset_fapl_mpio(fapl_id, MPI_COMM_WORLD, MPI_INFO_NULL); +/* Assuming Lustre stripe size is 1MiB, align data chunks + in the file to address multiples of 1MiB. */ +H5Pset_alignment(fapl_id, dataset_chunk_size, 1048576); +hid_t file_id = H5Fcreate("file.h5", H5F_ACC_TRUNC, H5P_DEFAULT, fapl_id); +``` + +### File free space managers + +As data chunks in a dataset get written to and compressed, +they can change in size and be relocated in the HDF5 file. +Since parallel compression usually involves many data chunks +in a file, this can create significant amounts of free space +in the file over its lifetime and eventually cause performance +issues. + +An HDF5 application can use [H5Pset_file_space_strategy](http://portal.hdfgroup.org/display/HDF5/H5P_SET_FILE_SPACE_STRATEGY) +with a value of `H5F_FSPACE_STRATEGY_PAGE` to enable the paged +aggregation feature, which can accumulate metadata and raw +data for dataset data chunks into well-aligned, configurably +sized "pages" for better performance. However, note that using +the paged aggregation feature will cause any setting from +[H5Pset_alignment](https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT) +to be ignored. While an application should be able to get +comparable performance effects by [setting the size of these pages](http://portal.hdfgroup.org/display/HDF5/H5P_SET_FILE_SPACE_PAGE_SIZE) to be equal to the value that +would have been set for [H5Pset_alignment](https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT), +this may not necessarily be the case and should be studied. + +Note that [H5Pset_file_space_strategy](http://portal.hdfgroup.org/display/HDF5/H5P_SET_FILE_SPACE_STRATEGY) +has a `persist` parameter. This determines whether or not the +file free space manager should include extra metadata in the +HDF5 file about free space sections in the file. If this +parameter is `false`, any free space in the HDF5 file will +become unusable once the HDF5 file is closed. For parallel +compression, it's generally recommended that `persist` be set +to `true`, as this will keep better track of file free space +for data chunks between accesses to the HDF5 file. + +``` +hid_t fcpl_id = H5Pcreate(H5P_FILE_CREATE); +/* Use persistent free space manager with paged aggregation */ +H5Pset_file_space_strategy(fcpl_id, H5F_FSPACE_STRATEGY_PAGE, 1, 1); +/* Assuming Lustre stripe size is 1MiB, set page size to that */ +H5Pset_file_space_page_size(fcpl_id, 1048576); +... +hid_t file_id = H5Fcreate("file.h5", H5F_ACC_TRUNC, fcpl_id, fapl_id); +``` + +### Low-level collective vs. independent I/O + +While the parallel compression feature requires that the HDF5 +application set and maintain collective I/O at the application +interface level (via [H5Pset_dxpl_mpio](https://portal.hdfgroup.org/display/HDF5/H5P_SET_DXPL_MPIO)), +it does not require that the actual MPI I/O that occurs at +the lowest layers of HDF5 be collective; independent I/O may +perform better depending on the application I/O patterns and +parallel file system performance, among other factors. The +application may use [H5Pset_dxpl_mpio_collective_opt](https://portal.hdfgroup.org/display/HDF5/H5P_SET_DXPL_MPIO_COLLECTIVE_OPT) +to control this setting and see which I/O method provides the +best performance. + +``` +hid_t dxpl_id = H5Pcreate(H5P_DATASET_XFER); +H5Pset_dxpl_mpio(dxpl_id, H5FD_MPIO_COLLECTIVE); +H5Pset_dxpl_mpio_collective_opt(dxpl_id, H5FD_MPIO_INDIVIDUAL_IO); /* Try independent I/O */ +H5Dwrite(..., dxpl_id, ...); +``` + +### Runtime HDF5 Library version + +An HDF5 application can use the [H5Pset_libver_bounds](http://portal.hdfgroup.org/display/HDF5/H5P_SET_LIBVER_BOUNDS) +routine to set the upper and lower bounds on library versions +to use when creating HDF5 objects. For parallel compression +specifically, setting the library version to the latest available +version can allow access to better/more efficient chunk indexing +types and data encoding methods. For example: + +``` +... +hid_t fapl_id = H5Pcreate(H5P_FILE_ACCESS); +H5Pset_libver_bounds(fapl_id, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST); +hid_t file_id = H5Fcreate("file.h5", H5F_ACC_TRUNC, H5P_DEFAULT, fapl_id); +... +``` |