doc/vfd-swmr-user-guide.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452

# Welcome to VFD SWMR  

Thank you for volunteering to test VFD SWMR.

SWMR, which stands for Single Writer/Multiple Reader, is a feature
of the HDF5 library that lets a process write data to an HDF5 file
while one or more processes read the file.  Use cases range from
monitoring data collection and/or steering experiments in progress
to financial applications.

The following diagram illustrates the original version of SWMR.

<img src = SWMRdataflow.png width=400 />

The original version of SWMR functions by ordering metadata writes to
the HDF5 file so as to always maintain a consistent view of metadata
in the HDF5 file -- which requires SWMR specific modifications to 
all code that maintains on disk metadata.

VFD SWMR is designed to be a more maintainable and more modular 
replacement for the existing SWMR feature.  It functions by taking 
regular snapshots of HDF5 file metadata on the writer side, and using 
a specialized virtual file driver (VFD) on the reader side to 
intercept metadata read requests and satisfy them from the 
snapshots where appropriate -- thus assuring that the readers 
see a consistent view of HDF5 file metadata,

This design allowed us to implement VFD SWMR with only minor 
modifications to the HDF5 library above metadata cache and page 
buffer.  As a result, not only is VFD SWMR more modular and 
easier to maintain, it is also almost "full SWMR" -- that is it 
allows use of almost all HDF5 capabilities by VFD SWMR writers,
with results that become visible to the VFD SWMR readers.

In particular, VFD SWMR allows the writer to create and delete 
both groups and datasets, and to create and delete attributes on 
both groups and datasets while operating in VFD SWMR mode -- 
which is not possible using the original SWMR implementation.  

We say that VFD SWMR is almost "full SWMR" because there are a 
few limitations -- most notably:

* The current implementation of variable length data in datasets
  is fundamentally incompatible with VFD SWMR, as it stores variable 
  length data as metadata.  This shouldn't be a major issue, as the 
  current implementation of variable length data has very poor performance, 
  and thus is not suitable for most SWMR applications.  A new 
  implementation of variable length data is in the works, and should 
  offer both better performance and be compatible with VFD SWMR.
  However, there is no ETA for delivery.  Variable length attributes 
  on datasets and groups should work, but are currently un-tested.

* VFD SWMR is only tested with, and should only be used with 
  the latest HDF5 file format.  Theoretically, there is no functional
  reason why it will not work with earlier versions of the file format.  
  However, it is possible to construct very large pieces of metadata 
  in early versions of the HDF5 file format, which has the potential to 
  cause major performance issues.

Due to its regular snapshots of metadata, VFD SWMR provides guarantees 
on the maximum time from write to visibility to the readers -- with 
the provisos that the underlying file system is fast enough, that 
the writer makes HDF5 library API calls with sufficient regularity, and 
that both reader and writer avoid long running HDF5 API calls.

For further details on VFD SWMR design and implementation, see 
`VFD_SWMR_RFC_200916.pdf` in the doc directory.

# Quick start

Follow these instructions to download, configure, and build the
VFD SWMR project, then install the HDF5 library and
utilities built by the VFD SWMR project.

## Download

Clone the HDF5 repository in a new directory, then switch to the 
`feature/vfd_swmr_beta_1` branch as follows:

```
% git clone https://github.com/HDFGroup/hdf5 swmr
% cd swmr
% git checkout feature/vfd_swmr_beta_2
```

## Build

There are no special instructions for building VFD SWMR. Simply follow
the usual build procedure for CMake or the Autotools using the guides
in the `release_docs` directory.

Some notes:

- The VFD SWMR tests can take some time to run.
- The VFD SWMR acceptance tests will typically emit some output about "expected errors" that you can ignore. Real errors are clearly flagged.
- If the tests do not pass on your system, please let the developers know via the email address given at the end of this document.
- VFD SWMR is not compatible with parallel HDF5 because page buffering is disabled in parallel HDF5.

# Sample programs

## Extensible datasets

For an example of a program that uses VFD SWMR to write/read many
extensible datasets, have a look at `test/vfd_swmr_bigset_writer.c`, the
"bigset" test.  We compile two binaries from that source file, one that
operates in write mode, and a second that operates in read mode.

In write mode, "bigset" creates an HDF5 file containing one or more
datasets that are extensible in either one dimension or two.  Then it
runs for several steps, increasing the size of each dataset in each
dimension once every step.  The dimensions, number of datasets, the
step increase in dataset size, and the number of steps are configurable
using command-line options -d, -s, -r and -c, and -n, respectively---use
the -h option to get a usage message.  Each dataset is written with a
predictable pattern.

In read mode, "bigset" reads each dataset from an HDF5 file created
by a "bigset" writer and verifies the patterns.  It takes the same
command-line parameters as the "bigset" writer.  The reader and writer
may run concurrently; the reader "polls" the content until it is just
shy of complete, given the number of steps expected.

To run a bigset test, open a couple of terminal windows, one for the
reader and one for the writer.  cd to the `test` directory under
my build directory, and run the writer in one window:

```
% ./vfd_swmr_bigset_writer -n 50 -d 2
```

and in the other window, run the reader:

```
% ./vfd_swmr_bigset_reader -n 50 -d 2 -W
```

The writer will wait for a signal before it quits.  You can tap CTRL-C to make
it quit.

The reader and writer programs support several command-line options:

```
usage: vfd_swmr_bigset_writer [-F] [-M] [-S] [-V] [-W] [-a steps] [-b] [-c cols]
    [-d dims]
    [-n iterations] [-r rows] [-s datasets]
    [-u milliseconds]

-F:                   fixed maximal dimension for the chunked datasets
-M:                   use virtual datasets and many source files
-S:                   do not use VFD SWMR
-V:                   use virtual datasets and a single source file
-W:                   do not wait for a signal before exiting
-a steps:             `steps` between adding attributes
-b:                   write data in big-endian byte order
-c cols:              `cols` columns per chunk
-d 1|one|2|two|both:  select dataset expansion in one or
                      both dimensions
-n iterations:        how many times to expand each dataset
-r rows:              `rows` rows per chunk
-s datasets:          number of datasets to create
-u ms:                milliseconds interval between updates
                      to vfd_swmr_bigset_writer.h5
```

## The VFD SWMR demos

The VFD SWMR demos are located in the `examples` directory of this source
tree. Instructions for building the example programs are given in the README
file in that directory. These programs are NOT installed via `make install`
and have to built by hand with h5cc as described in the README.

Two Gaussian programs are built, `wgaussians` and `rgaussians`.  If you start
both from the same directory in different terminals, you should see the
"bouncing 2-D Gaussian distributions" in the `rgaussians` terminal.  This demo
uses curses, so you may need to install the curses developers library to build
(and this is probably not going to be easy to build on Windows).

The creation-deletion (`credel`) demo is also run in two terminals.
The two command lines are given in the README. You need to use the `h5ls`
installed from the VFD SWMR branch, since only that version has the `--poll`
option. Be careful to not use a non-VFD-SWMR system h5ls here.

# Developer tips

## Configuring VFD SWMR

### File-creation properties

To use VFD SWMR, creating your HDF5 file with a paged allocation strategy
is mandatory.  This call enables the paged allocation strategy:

```
ret = H5Pset_file_space_strategy(fcpl, H5F_FSPACE_STRATEGY_PAGE, false, 1);
```

Allocated storage that is smaller than the page size will
not overlap a page boundary, and allocated storage that is one page or
greater in size will start on a page boundary.  VFD SWMR relies on that
allocation strategy.

### File-access properties

In this section we show how to configure your application to use VFD
SWMR.

1. Create a file access property list using `H5Pcreate(H5P_FILE_ACCESS)`.
2. Set the latest file format using `H5Pset_libver_bounds(fapl, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST)`. 
3. Enable page buffering using `H5Pset_page_buffer_size()`.
4. Set any VFD SWMR configuration properties using `H5Pset_vfd_swmr_config()`. The struct is documented in H5Fpublic.h, with some additional documentation below. (In the near future, this struct will be documented in the library's Doxygen documentation.)

VFD SWMR relies on metadata reads and writes to go through the
page buffer.  Note that the default page size is 4096 bytes. Finding good
values for `buf_size` may take some experimentation. We use 4096 (giving a
single page buffer) for `buf_size` in our test code.

*Note well*: when VFD SWMR is enabled, the meta-/raw-data pages proportion 
set by `H5Pset_page_buffer_size()` does not actually control the
pages reserved for raw data.  *All* pages are dedicated to buffering
metadata.

### `H5F_vfd_swmr_config_t` fields discussion

Example code:

```
    memset(&config, 0, sizeof(config));

    config.version = H5F__CURR_VFD_SWMR_CONFIG_VERSION;
    config.tick_len = 4;
    config.max_lag = 7;
    config.writer = true;
    config.md_pages_reserved = 128;
    strcpy(config.md_file_path, "./my_md_file");

    H5Pset_vfd_swmr_config(fapl, &config);
```

When VFD SWMR is enabled, changes to the HDF5 metadata accumulate in
RAM until a configurable unit of time known as a *tick* has passed.
At the end of each tick, a snapshot of the metadata at the end of
the tick is "published"---that is, made visible to the readers.

The length of a *tick* is configurable in units of 100 milliseconds
using the `tick_len` parameter.  Here, `tick_len` is set to `4` to
select a tick length of 400ms.

A snapshot does not persist forever, but it expires after a number
of ticks, given by the *maximum lag*, has passed.  Here, `max_lag`
is set to `7` to select a maximum lag of 7 ticks.  After a snapshot
has expired, the writer may overwrite it.

When a reader first enters the API, it starts to use, or "selects,"
the metadata in the newest snapshot, and on every subsequent API
entry, if a tick has passed since the last selection, and if new
snapshots are available, then the reader selects the latest.

If a reader spends longer than `max_lag - 1` ticks (2400ms with
the example configuration) inside the HDF5 API, then its snapshot
may expire, resulting in undefined behavior.  When a snapshot
expires while the reader is using it, we say that the writer has
"overrun" the reader.  The writer cannot detect overruns.
Frequently the reader will detect an overrun and force the program
to exit with a diagnostic assertion failure.

The application tells VFD SWMR whether or not to configure for
reading or writing a file by setting the `writer` parameter to
`true` for writing or `false` for reading.

VFD SWMR snapshots are stored in a "metadata file" that is shared
between writer and readers.  On a POSIX system, the metadata file
may be placed on any *local* filesystem that the reader and writer
share.  The `md_file_path` parameter tells where to put the metadata
file.

The `md_pages_reserved` parameter tells how many pages to reserve
at the beginning of the metadata file for the metadata-file header
and the metadata index.  The header has an entire page to itself.
The remaining `md_pages_reserved - 1` pages are reserved for the
metadata index.  If the index grows larger than its initial
allocation, then it will move to a new location in the metadata file,
and the initial allocation will be reclaimed.  `md_pages_reserved`
must be at least 2.

The `version` parameter tells what version of VFD SWMR configuration
the parameter struct `config` contains.  For now, it should be
initialized to `H5F__CURR_VFD_SWMR_CONFIG_VERSION`.

## Pushing HDF5 raw data to reader visibility

DISCUSS FLUSH DATA END OF TICK HERE

If <flush of raw data at end of tick> is selected, 
it should not be necessary to call H5Fflush().  In fact, when VFD SWMR is 
active, H5Fflush() may require up to `max_lag` ticks to complete due to 
metadata consistency issues.

A writer can make its last changes to HDF5 file visible to all
readers immediately using the new call, `H5Fvfd_swmr_end_tick()`.  Note
that this call should be used sparingly, as it terminates the current 
tick early, thus effectively reducing `max_lag`.  Repeated calls in 
quick succession can force a reader to overrun `max_lag`, and 
read stale metadata.

When the flush of raw data at end of tick is disabled, 
the `H5Fvfd_swmr_end_tick()` call will make the writers current view of metadata
visible to the reader -- which may refer to raw data that hasn't been written to 
the HDF5 file yet.

## Reading up-to-date content

One expected use case for VFD SWMR involves an experiment in which instruments 
continuously generate 2-dimensional data frames.  These data frames are recorded 
in datasets in a HDF5 file that has been opened in VFD SWMR writer mode.  In this 
use case, the HDF5 file is opened in VFD SWMR reader mode by a second program 
that generates a real time display of the data as it is being collected -- thus 
allowing the experimenters to steer the experiment.

THG developed a demonstration program for class of application,
and we have some advice based on that experience. 

The writer typically will increase a dataset's dimensions by a
frame, using `H5Dset_extent()`, before it writes the data of that
frame with `H5Dwrite()`.  It's possible that a snapshot of the HDF5
file will propagate to the reader between the `H5Dset_extent()`
call and the `H5Dwrite()`.  Values `H5Dread()` from the last frame
at that juncture will not reflect the actual experimental data.
Instead, the reader will see arbitrary values or the fill value.
To display those values would be distracting and misleading to
the experimenter. 

On the reader, a strategy for displaying the most current, bonafide application
data is to read the dimensions of the frames dataset, `d`, compute
the number `n` of full frames contained in `d`, and read the
next-to-last frame, `n - 2`.  THG uses a variant of this strategy
in its `gaussians` demo.

On the writer, a strategy for protecting against snapshots between
the `H5Dset_extent()` and `H5Dwrite()` calls is to suspend VFD
SWMR's clock across both of the calls.  The
`H5Fvfd_swmr_disable_end_of_tick()` call takes a file identifier
and stops new snapshots from being taken on the given file until
`H5Fvfd_swmr_enable_end_of_tick()` is called on the same file.
Needless to say, end of tick processing should only be disabled
briefly.

# Known issues

## Variable-length data

A VFD SWMR reader cannot reliably read back a variable-length dataset
written by VFD SWMR.  For example, a variable-length string
created and written as follows

```
    hid_t dset, space, type;
    char data[] = "content";

    type = H5Tcopy(H5T_C_S1);

    H5Tset_size(type, H5T_VARIABLE);

    space = H5Screate(H5S_SCALAR);

    dset = H5Dcreate2(..., "string", type, space, H5P_DEFAULT, H5P_DEFAULT,
        H5P_DEFAULT);

    H5Dwrite(dset, type, space, space, H5P_DEFAULT, &data);
```

and read back like this,

```
    char *data;
    herr_t ret;

    ret = H5Dread(..., ..., H5S_ALL, H5S_ALL, H5P_DEFAULT, &data);
```

may produce either an error return from `H5Dread` (`ret < 0`) or
a `NULL` pointer (`data == NULL`).

As discussed above, this is caused by a fundamental incompatibility 
between the current variable length data implementation in HDF5, which 
stores variable length data as metadata.  It is possible we may be able 
to mitigate the issue, but the most likely solution is the planned 
re-implementation of variable length data that is currently in the planning
stage.  Unfortunately, we have no ETA for this re-implementation.

## Iteration

An application that reads in VFD SWMR mode should take care to avoid
HDF5 iteration APIs, especially when iterating large numbers of objects
or using long-running application callbacks.  While the library is in an
iteration routine, it cannot track changes made by the writer.  If the
library spends more than `max_lag` ticks in the routine, then its view
of the HDF5 file will become stale.  Under those circumstances, HDF5
content could be mis-read, or the library could crash with a diagnostic
assertion.

NOTE: The HDF5 command-line tools (h5dump, etc.) use iteration routines to do
their work, so they should be used carefully with files open for VFD SWMR
writing.

## Object handles

At the present level of development, the writer cannot invalidate
a reader's HDF5 object handles (`hid_t`s).  If a reader holds an
object open---that is, it has a valid handle (`hid_t`) for the
object---while the writer deletes it, then reading content through
the handle may yield corrupted data or the data from some other
object, or the library may crash.

## Supported filesystems

A VFD SWMR writer and readers share a couple of files, the HDF5 (`.h5`)
file and the metadata file -- which is used to communicate snapshots of 
the HDF5 file metadata from the writer to the readers.  VFD SWMR relies 
on writes to the metadata file to take effect in the order described in 
the POSIX documentation for `read(2)` and `write(2)` system calls.  If 
the VFD SWMR readers and the writer run on the same POSIX host, this 
ordering should take effect, regardless of the underlying filesystem. 

If the VFD SWMR reader and the writer run on *different* hosts, then
the write-ordering rules depend on the shared filesystem.  VFD SWMR is
not generally expected to work with NFS at this time.  Parallel file systems
like GPFS and Lustre should order writes according to POSIX convention, so we
expect VFD SWMR to work on those file systems but we have not tested this.

The HDF Group plans to add support for networked file systems like NFS and
Windows SMB to VFD SWMR in the future.

## Microsoft Windows 

VFD SWMR is not officially supported on Microsoft Windows at this time.  The
feature should in theory work on Windows and NTFS, however it has not been
tested as the existing VFD SWMR tests rely on shell scripts.  Note that Windows
file shares are not supported as there is no write ordering guarantee (as with
NFS, et al.).

## File-opening order

If the file already exists, you can open the file via the writer and readers
in any order. If the file does not exist, the reader will wait for the file
to be created (until the default timeout expires).

# Reporting bugs

VFD SWMR is still under development, so it is possible that you will encounter 
bugs.  Please report them, along with any performance or design issues you 
encounter.

To contact the VFD SWMR developers, email vfdswmr@hdfgroup.org.