# Welcome to VFD SWMR Thank you for volunteering to test VFD SWMR. SWMR, which stands for Single Writer/Multiple Reader, is a feature of the HDF5 library that lets a process write data to an HDF5 file while one or more processes read the file. Use cases range from monitoring data collection and/or steering experiments in progress to financial applications. The following diagram illustrates the original version of SWMR. The original version of SWMR functions by ordering metadata writes to the HDF5 file so as to always maintain a consistent view of metadata in the HDF5 file -- which requires SWMR specific modifications to all code that maintains on disk metadata. VFD SWMR is designed to be a more maintainable and more modular replacement for the existing SWMR feature. It functions by taking regular snapshots of HDF5 file metadata on the writer side, and using a specialized virtual file driver (VFD) on the reader side to intercept metadata read requests and satisfy them from the snapshots where appropriate -- thus assuring that the readers see a consistent view of HDF5 file metadata, This design allowed us to implement VFD SWMR with only minor modifications to the HDF5 library above metadata cache and page buffer. As a result, not only is VFD SWMR more modular and easier to maintain, it is also almost "full SWMR" -- that is it allows use of almost all HDF5 capabilities by VFD SWMR writers, with results that become visible to the VFD SWMR readers. In particular, VFD SWMR allows the writer to create and delete both groups and datasets, and to create and delete attributes on both groups and datasets while operating in VFD SWMR mode -- which is not possible using the original SWMR implementation. We say that VFD SWMR is almost "full SWMR" because there are a few limitations -- most notably: * The current implementation of variable length data in datasets is fundamentally incompatible with VFD SWMR, as it stores variable length data as metadata. This shouldn't be a major issue, as the current implementation of variable length data has very poor performance, and thus is not suitable for most SWMR applications. A new implementation of variable length data is in the works, and should offer both better performance and be compatible with VFD SWMR. However, there is no ETA for delivery. Variable length attributes on datasets and groups should work, but are currently un-tested. * VFD SWMR is only tested with, and should only be used with the latest HDF5 file format. Theoretically, there is no functional reason why it will not work with earlier versions of the file format. However, it is possible to construct very large pieces of metadata in early versions of the HDF5 file format, which has the potential to cause major performance issues. Due to its regular snapshots of metadata, VFD SWMR provides guarantees on the maximum time from write to visibility to the readers -- with the provisos that the underlying file system is fast enough, that the writer makes HDF5 library API calls with sufficient regularity, and that both reader and writer avoid long running HDF5 API calls. For further details on VFD SWMR design and implementation, see `VFD_SWMR_RFC_200916.pdf` in the doc directory. # Quick start Follow these instructions to download, configure, and build the VFD SWMR project, then install the HDF5 library and utilities built by the VFD SWMR project. ## Download Clone the HDF5 repository in a new directory, then switch to the `feature/vfd_swmr_beta_1` branch as follows: ``` % git clone https://github.com/HDFGroup/hdf5 swmr % cd swmr % git checkout feature/vfd_swmr_beta_2 ``` ## Build There are no special instructions for building VFD SWMR. Simply follow the usual build procedure for CMake or the Autotools using the guides in the `release_docs` directory. Some notes: - The VFD SWMR tests can take some time to run. - The VFD SWMR acceptance tests will typically emit some output about "expected errors" that you can ignore. Real errors are clearly flagged. - If the tests do not pass on your system, please let the developers know via the email address given at the end of this document. - VFD SWMR is not compatible with parallel HDF5 because page buffering is disabled in parallel HDF5. # Sample programs ## Extensible datasets For an example of a program that uses VFD SWMR to write/read many extensible datasets, have a look at `test/vfd_swmr_bigset_writer.c`, the "bigset" test. We compile two binaries from that source file, one that operates in write mode, and a second that operates in read mode. In write mode, "bigset" creates an HDF5 file containing one or more datasets that are extensible in either one dimension or two. Then it runs for several steps, increasing the size of each dataset in each dimension once every step. The dimensions, number of datasets, the step increase in dataset size, and the number of steps are configurable using command-line options -d, -s, -r and -c, and -n, respectively---use the -h option to get a usage message. Each dataset is written with a predictable pattern. In read mode, "bigset" reads each dataset from an HDF5 file created by a "bigset" writer and verifies the patterns. It takes the same command-line parameters as the "bigset" writer. The reader and writer may run concurrently; the reader "polls" the content until it is just shy of complete, given the number of steps expected. To run a bigset test, open a couple of terminal windows, one for the reader and one for the writer. cd to the `test` directory under my build directory, and run the writer in one window: ``` % ./vfd_swmr_bigset_writer -n 50 -d 2 ``` and in the other window, run the reader: ``` % ./vfd_swmr_bigset_reader -n 50 -d 2 -W ``` The writer will wait for a signal before it quits. You can tap CTRL-C to make it quit. The reader and writer programs support several command-line options: ``` usage: vfd_swmr_bigset_writer [-F] [-M] [-S] [-V] [-W] [-a steps] [-b] [-c cols] [-d dims] [-n iterations] [-r rows] [-s datasets] [-u milliseconds] -F: fixed maximal dimension for the chunked datasets -M: use virtual datasets and many source files -S: do not use VFD SWMR -V: use virtual datasets and a single source file -W: do not wait for a signal before exiting -a steps: `steps` between adding attributes -b: write data in big-endian byte order -c cols: `cols` columns per chunk -d 1|one|2|two|both: select dataset expansion in one or both dimensions -n iterations: how many times to expand each dataset -r rows: `rows` rows per chunk -s datasets: number of datasets to create -u ms: milliseconds interval between updates to vfd_swmr_bigset_writer.h5 ``` ## The VFD SWMR demos The VFD SWMR demos are located in the `examples` directory of this source tree. Instructions for building the example programs are given in the README file in that directory. These programs are NOT installed via `make install` and have to built by hand with h5cc as described in the README. Two Gaussian programs are built, `wgaussians` and `rgaussians`. If you start both from the same directory in different terminals, you should see the "bouncing 2-D Gaussian distributions" in the `rgaussians` terminal. This demo uses curses, so you may need to install the curses developers library to build (and this is probably not going to be easy to build on Windows). The creation-deletion (`credel`) demo is also run in two terminals. The two command lines are given in the README. You need to use the `h5ls` installed from the VFD SWMR branch, since only that version has the `--poll` option. Be careful to not use a non-VFD-SWMR system h5ls here. # Developer tips ## Configuring VFD SWMR ### File-creation properties To use VFD SWMR, creating your HDF5 file with a paged allocation strategy is mandatory. This call enables the paged allocation strategy: ``` ret = H5Pset_file_space_strategy(fcpl, H5F_FSPACE_STRATEGY_PAGE, false, 1); ``` Allocated storage that is smaller than the page size will not overlap a page boundary, and allocated storage that is one page or greater in size will start on a page boundary. VFD SWMR relies on that allocation strategy. ### File-access properties In this section we show how to configure your application to use VFD SWMR. 1. Create a file access property list using `H5Pcreate(H5P_FILE_ACCESS)`. 2. Set the latest file format using `H5Pset_libver_bounds(fapl, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST)`. 3. Enable page buffering using `H5Pset_page_buffer_size()`. 4. Set any VFD SWMR configuration properties using `H5Pset_vfd_swmr_config()`. The struct is documented in H5Fpublic.h, with some additional documentation below. (In the near future, this struct will be documented in the library's Doxygen documentation.) VFD SWMR relies on metadata reads and writes to go through the page buffer. Note that the default page size is 4096 bytes. Finding good values for `buf_size` may take some experimentation. We use 4096 (giving a single page buffer) for `buf_size` in our test code. *Note well*: when VFD SWMR is enabled, the meta-/raw-data pages proportion set by `H5Pset_page_buffer_size()` does not actually control the pages reserved for raw data. *All* pages are dedicated to buffering metadata. ### `H5F_vfd_swmr_config_t` fields discussion Example code: ``` memset(&config, 0, sizeof(config)); config.version = H5F__CURR_VFD_SWMR_CONFIG_VERSION; config.tick_len = 4; config.max_lag = 7; config.writer = true; config.md_pages_reserved = 128; strcpy(config.md_file_path, "./my_md_file"); H5Pset_vfd_swmr_config(fapl, &config); ``` When VFD SWMR is enabled, changes to the HDF5 metadata accumulate in RAM until a configurable unit of time known as a *tick* has passed. At the end of each tick, a snapshot of the metadata at the end of the tick is "published"---that is, made visible to the readers. The length of a *tick* is configurable in units of 100 milliseconds using the `tick_len` parameter. Here, `tick_len` is set to `4` to select a tick length of 400ms. A snapshot does not persist forever, but it expires after a number of ticks, given by the *maximum lag*, has passed. Here, `max_lag` is set to `7` to select a maximum lag of 7 ticks. After a snapshot has expired, the writer may overwrite it. When a reader first enters the API, it starts to use, or "selects," the metadata in the newest snapshot, and on every subsequent API entry, if a tick has passed since the last selection, and if new snapshots are available, then the reader selects the latest. If a reader spends longer than `max_lag - 1` ticks (2400ms with the example configuration) inside the HDF5 API, then its snapshot may expire, resulting in undefined behavior. When a snapshot expires while the reader is using it, we say that the writer has "overrun" the reader. The writer cannot detect overruns. Frequently the reader will detect an overrun and force the program to exit with a diagnostic assertion failure. The application tells VFD SWMR whether or not to configure for reading or writing a file by setting the `writer` parameter to `true` for writing or `false` for reading. VFD SWMR snapshots are stored in a "metadata file" that is shared between writer and readers. On a POSIX system, the metadata file may be placed on any *local* filesystem that the reader and writer share. The `md_file_path` parameter tells where to put the metadata file. The `md_pages_reserved` parameter tells how many pages to reserve at the beginning of the metadata file for the metadata-file header and the metadata index. The header has an entire page to itself. The remaining `md_pages_reserved - 1` pages are reserved for the metadata index. If the index grows larger than its initial allocation, then it will move to a new location in the metadata file, and the initial allocation will be reclaimed. `md_pages_reserved` must be at least 2. The `version` parameter tells what version of VFD SWMR configuration the parameter struct `config` contains. For now, it should be initialized to `H5F__CURR_VFD_SWMR_CONFIG_VERSION`. ## Pushing HDF5 raw data to reader visibility If the `flush_raw_data` field of the `H5F_vfd_swmr_config_t` struct is set to `true`, raw dataset data will be flushed as a part of end of tick processing and it should not be necessary to call H5Fflush(). In fact, when VFD SWMR is active, H5Fflush() may require up to `max_lag` ticks to complete due to metadata consistency issues. A writer can make its last changes to HDF5 file visible to all readers immediately using the new call, `H5Fvfd_swmr_end_tick()`. Note that this call should be used sparingly, as it terminates the current tick early, thus effectively reducing `max_lag`. Repeated calls in quick succession can force a reader to overrun `max_lag`, and read stale metadata. When the flush of raw data at end of tick is disabled, the `H5Fvfd_swmr_end_tick()` call will make the writers current view of metadata visible to the reader -- which may refer to raw data that hasn't been written to the HDF5 file yet. ## Reading up-to-date content One expected use case for VFD SWMR involves an experiment in which instruments continuously generate 2-dimensional data frames. These data frames are recorded in datasets in a HDF5 file that has been opened in VFD SWMR writer mode. In this use case, the HDF5 file is opened in VFD SWMR reader mode by a second program that generates a real time display of the data as it is being collected -- thus allowing the experimenters to steer the experiment. THG developed a demonstration program for class of application, and we have some advice based on that experience. The writer typically will increase a dataset's dimensions by a frame, using `H5Dset_extent()`, before it writes the data of that frame with `H5Dwrite()`. It's possible that a snapshot of the HDF5 file will propagate to the reader between the `H5Dset_extent()` call and the `H5Dwrite()`. Values `H5Dread()` from the last frame at that juncture will not reflect the actual experimental data. Instead, the reader will see arbitrary values or the fill value. To display those values would be distracting and misleading to the experimenter. On the reader, a strategy for displaying the most current, bonafide application data is to read the dimensions of the frames dataset, `d`, compute the number `n` of full frames contained in `d`, and read the next-to-last frame, `n - 2`. THG uses a variant of this strategy in its `gaussians` demo. On the writer, a strategy for protecting against snapshots between the `H5Dset_extent()` and `H5Dwrite()` calls is to suspend VFD SWMR's clock across both of the calls. The `H5Fvfd_swmr_disable_end_of_tick()` call takes a file identifier and stops new snapshots from being taken on the given file until `H5Fvfd_swmr_enable_end_of_tick()` is called on the same file. Needless to say, end of tick processing should only be disabled briefly. # Known issues ## Variable-length data A VFD SWMR reader cannot reliably read back a variable-length dataset written by VFD SWMR. For example, a variable-length string created and written as follows ``` hid_t dset, space, type; char data[] = "content"; type = H5Tcopy(H5T_C_S1); H5Tset_size(type, H5T_VARIABLE); space = H5Screate(H5S_SCALAR); dset = H5Dcreate2(..., "string", type, space, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT); H5Dwrite(dset, type, space, space, H5P_DEFAULT, &data); ``` and read back like this, ``` char *data; herr_t ret; ret = H5Dread(..., ..., H5S_ALL, H5S_ALL, H5P_DEFAULT, &data); ``` may produce either an error return from `H5Dread` (`ret < 0`) or a `NULL` pointer (`data == NULL`). As discussed above, this is caused by a fundamental incompatibility between the current variable length data implementation in HDF5, which stores variable length data as metadata. It is possible we may be able to mitigate the issue, but the most likely solution is the planned re-implementation of variable length data that is currently in the planning stage. Unfortunately, we have no ETA for this re-implementation. ## Iteration An application that reads in VFD SWMR mode should take care to avoid HDF5 iteration APIs, especially when iterating large numbers of objects or using long-running application callbacks. While the library is in an iteration routine, it cannot track changes made by the writer. If the library spends more than `max_lag` ticks in the routine, then its view of the HDF5 file will become stale. Under those circumstances, HDF5 content could be mis-read, or the library could crash with a diagnostic assertion. NOTE: The HDF5 command-line tools (h5dump, etc.) use iteration routines to do their work, so they should be used carefully with files open for VFD SWMR writing. ## Object handles At the present level of development, the writer cannot invalidate a reader's HDF5 object handles (`hid_t`s). If a reader holds an object open---that is, it has a valid handle (`hid_t`) for the object---while the writer deletes it, then reading content through the handle may yield corrupted data or the data from some other object, or the library may crash. ## Supported filesystems A VFD SWMR writer and readers share a couple of files, the HDF5 (`.h5`) file and the metadata file -- which is used to communicate snapshots of the HDF5 file metadata from the writer to the readers. VFD SWMR relies on writes to the metadata file to take effect in the order described in the POSIX documentation for `read(2)` and `write(2)` system calls. If the VFD SWMR readers and the writer run on the same POSIX host, this ordering should take effect, regardless of the underlying filesystem. If the VFD SWMR reader and the writer run on *different* hosts, then the write-ordering rules depend on the shared filesystem. VFD SWMR is not generally expected to work with NFS at this time. Parallel file systems like GPFS and Lustre should order writes according to POSIX convention, so we expect VFD SWMR to work on those file systems but we have not tested this. The HDF Group plans to add support for networked file systems like NFS and Windows SMB to VFD SWMR in the future. ## Microsoft Windows VFD SWMR is not officially supported on Microsoft Windows at this time. The feature should in theory work on Windows and NTFS, however it has not been tested as the existing VFD SWMR tests rely on shell scripts. Note that Windows file shares are not supported as there is no write ordering guarantee (as with NFS, et al.). ## File-opening order If the file already exists, you can open the file via the writer and readers in any order. If the file does not exist, the reader will wait for the file to be created (until the default timeout expires). # Reporting bugs VFD SWMR is still under development, so it is possible that you will encounter bugs. Please report them, along with any performance or design issues you encounter. To contact the VFD SWMR developers, email vfdswmr@hdfgroup.org.