summaryrefslogtreecommitdiffstats
path: root/doc/vfd-swmr-user-guide.md
blob: 650d869fb7ae4539ec5447309a71d31cfef0298f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
# Welcome to VFD SWMR  

Thank you for volunteering to test VFD SWMR.

SWMR, which stands for Single Writer/Multiple Reader, is a feature
of the HDF5 library that lets a process write data to an HDF5 file
while one or more processes read the file.  Use cases range from
monitoring data collection and/or steering experiments in progress
to financial applications.

The following diagram illustrates how SWMR works.

<img src = SWMRdataflow.png width=400 />


VFD SWMR is designed to be a more flexible, more modular,
better-performing replacement for the existing SWMR feature.

* VFD SWMR allows HDF5 objects (groups, datasets, attributes) to be
  created and destroyed in the course of a reader-writer session.
  Creating objects is not possible using the existing SWMR feature.
* It compartmentalizes much of the SWMR functionality in a virtual-file
  driver (VFD), thus easing The HDF Group's software-maintenance burden.
* And it makes guarantees for the maximum time from write to availability
  of data for read, provided that the reading and writing systems and
  their interconnections can keep up with the data flow.

For details on how VFD SWMR is implemented, see [TBD: LINK to RFC].

# Quick start

Follow these instructions to download, configure, and build the
VFD SWMR project in a jiffy.  Then install the HDF5 library and
utilites built by the VFD SWMR project.

## Download

The latest source code here for VFD SWMR is found on the `multi`
branch of [the VFD SWMR
repository](https://bitbucket.hdfgroup.org/scm/~dyoung/vchoi_fork.git).

Clone the repository in a new directory, then switch to the VFD SWMR branch:

```
% git clone https://bitbucket.hdfgroup.org/scm/~dyoung/vchoi_fork.git swmr
% cd swmr
% git checkout multi
```

## Build

Setup for autotools:

```
% sh ./autogen.sh
```

Create a build directory, change to that directory, and run the
configure script:

```
% mkdir -p ../build/swmr
% cd ../build/swmr
% ../../swmr/configure
```

Build the project:

```
% make
```

## Test

We recommend that you run the full HDF5 test suite to make sure that VFD
SWMR works correctly on your system.  To test the library, utilities, run

```
% make check
```

If the tests don't pass, please let the developers know!

# Sample programs

## Extensible datasets

For an example of a program that uses VFD SWMR to write/read many
extensible datasets, have a look at `test/vfd_swmr_bigset_writer.c`, the
"bigset" test.  We compile two binaries from that source file, one that
operates in write mode, and a second that operates in read mode.

In write mode, "bigset" creates an HDF5 file containing one or more
datasets that are extensible in either one dimension or two.  Then it
runs for several steps, increasing the size of each dataset in each
dimension once every step.  The dimensions, number of datasets, the
step increase in dataset size, and the number of steps are configurable
using command-line options -d, -s, -r and -c, and -n, respectively---use
the -h option to get a usage message.  Each dataset is written with a
predictable pattern.

In read mode, "bigset" reads each dataset from an HDF5 file created
by a "bigset" writer and verifies the patterns.  It takes the same
command-line parameters as the "bigset" writer.  The reader and writer
may run concurrently; the reader "polls" the content until it is just
shy of complete, given the number of steps expected.

To run a bigset test, I open a couple of terminal windows, one for the
reader and one for the writer.  I change to the `test` directory under
my build directory, and I run the writer in one window:

```
% ./vfd_swmr_bigset_writer -n 50 -d 2
```

and in the other window, I run the reader:

```
% ./vfd_swmr_bigset_reader -n 50 -d 2 -W
```

The writer will wait for a signal before it quits.  You may tap
Control-C to make it quit.

The reader and writer programs support several command-line options:

* `-h`: show program usage

* `-W`: stop the program from waiting for a signal before it quits.

* `-q`: suppress the progress messages that the programs write to the
  standard error stream.

* `-V`: create a virtual dataset with content in three source datasets
  in the same HDF5 file---only available when the writer creates a
  dataset extensible in one dimension (`-d 1`)

* `-M`: like `-V`, the writer creates the virtual dataset on three
   source datasets, but each source dataset is in a different HDF5 file.

## The VFD SWMR demos

The VFD SWMR demos are in a [separate
repository](https://bitbucket.hdfgroup.org/scm/~dyoung/swmr-demo.git).

Before you build the demos, you will need to install the HDF5 library
and utilities built from the VFD SWMR branch in your home directory
somewhere.  In the ./configure step, use the command-line option
`--prefix=$HOME/path/for/library` to set the directory you prefer.
In the demo Makefiles, update the `H5CC` variable with the path to
the `h5cc` installed from the VFD SWMR branch.  Then you should be
able to `make` and `make clean` the demos.

Under `gaussians/`, two programs are built, `wgaussians` and
`rgaussians`.  If you start both from the same directory in different
terminals, you should see the "bouncing 2-D Gaussian distributions"
in the `rgaussians` terminal.

The creation-deletion (`credel`) demo is also run in two terminals.
The two command lines are given in `credel/README.md`.  You need
to use the `h5ls` installed from the VFD SWMR branch, since only
that version has the `--poll` option.

# Developer tips

## Configuring VFD SWMR

### File-creation properties

To use VFD SWMR, creating your HDF5 file with paged allocation strategy
is mandatory.  This call enables the paged allocation strategy:

```
ret = H5Pset_file_space_strategy(fcpl, H5F_FSPACE_STRATEGY_PAGE, false, 1);
```

Allocated storage that is smaller than the page size will
not overlap a page boundary, and allocated storage that is one page or
greater in size will start on a page boundary.  VFD SWMR relies on that
allocation strategy.

### File-access properties

In this section we dissect `vfd_swmr_create_fapl()`, a helper routine in
the VFD SWMR tests, to show how to configure your application to use VFD
SWMR.

```
hid_t
vfd_swmr_create_fapl(bool writer, bool only_meta_pages, bool use_vfd_swmr)
{   
    H5F_vfd_swmr_config_t config;
    hid_t fapl;

```

`h5_fileaccess()` is also a helper routine for the tests.  In your
program, you can replace the `h5_fileaccess()` call with a call to
`H5Pcreate(H5P_FILE_ACCESS)`.

```
    /* Create file access property list */
    if((fapl = h5_fileaccess()) < 0) {
        warnx("h5_fileaccess");
        return badhid;
    }
```


VFD SWMR has only been tested with the latest file format.  It may
malfunction with older formats, we just don't know.  We force the
latest version here.

```
    /* FOR NOW: set to use latest format, the "old" parameter is not used */
    if(H5Pset_libver_bounds(fapl, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST) < 0) {
        warnx("H5Pset_libver_bounds");
        return badhid;
    }

    /*
     * Set up to open the file with VFD SWMR configured.
     */
```

VFD SWMR relies on metadata reads and writes to go through the
page buffer.  Note that the default page size is 4096 bytes.  This
call sets the total page buffer size to 4096 bytes.  So we have
effectively created a one-page page buffer!  That is adequate for
testing, but it may not be best for your application.

If `only_meta_pages` is true, then the entire page buffer is
dedicated to metadata.  That's fine for VFD SWMR.

*Note well*: when VFD SWMR is enabled, the meta-/raw-data pages proportion 
set by `H5Pset_page_buffer_size()` does not actually control the
pages reserved for raw data.  *All* pages are dedicated to buffering
metadata.


```
    /* Enable page buffering */
    if(H5Pset_page_buffer_size(fapl, 4096, only_meta_pages ? 100 : 0, 0) < 0) {
        warnx("H5Pset_page_buffer_size");
        return badhid;
    }
```


Add VFD SWMR-specific configuration to the file-access property list
(`fapl`) using an `H5Pset_vfd_swmr_config()` call.

When VFD SWMR is enabled, changes to the HDF5 metadata accumulate in
RAM until a configurable unit of time known as a *tick* has passed.
At the end of each tick, a snapshot of the metadata at the end of
the tick is "published"---that is, made visible to the readers.

The length of a *tick* is configurable in units of 100 milliseconds
using the `tick_len` parameter.  Below, `tick_len` is set to `4` to
select a tick length of 400ms.

A snapshot does not persist forever, but it expires after a number
of ticks, given by the *maximum lag*, has passed.  Below, `max_lag`
is set to `7` to select a maximum lag of 7 ticks.  After a snapshot
has expired, the writer may overwrite it.

When a reader first enters the API, it starts to use, or "selects,"
the metadata in the newest snapshot, and on every subsequent API
entry, if a tick has passed since the last selection, and if new
snapshots are available, then the reader selects the latest.

If a reader spends longer than `max_lag - 1` ticks (2400ms with
the example configuration) inside the HDF5 API, then its snapshot
may expire, resulting in undefined behavior.  When a snapshot
expires while the reader is using it, we say that the writer has
"overrun" the reader.  The writer cannot currently detect overruns.
Frequently the reader will detect an overrun and force the program
to exit with a diagnostic assertion failure.

The application tells VFD SWMR whether or not to configure for
reading or writing a file by setting the `writer` parameter to
`true` for writing or `false` for reading.

VFD SWMR snapshots are stored in a "shadow file" that is shared
between writer and readers.  On a POSIX system, the shadow file
may be placed on any *local* filesystem that the reader and writer
share.  The `md_file_path` parameter tells where to put the shadow
file.

The `md_pages_reserved` parameter tells how many pages to reserve
at the beginning of the shadow file for the shadow-file header
and the shadow index.  The header has an entire page to itself.
The remaining `md_pages_reserved - 1` pages are reserved for the
shadow index.  If the index grows larger than its initial
allocation, then it will move to a new location in the shadow file,
and the initial allocation will be reclaimed.  `md_pages_reserved`
must be at least 2.

The `version` parameter tells what version of VFD SWMR configuration
the parameter struct `config` contains.  For now, it should be
initialized to `H5F__CURR_VFD_SWMR_CONFIG_VERSION`.

```
    memset(&config, 0, sizeof(config));

    config.version = H5F__CURR_VFD_SWMR_CONFIG_VERSION;
    config.tick_len = 4;
    config.max_lag = 7;
    config.writer = writer;
    config.md_pages_reserved = 128;
    HDstrcpy(config.md_file_path, "./my_md_file");

    /* Enable VFD SWMR configuration */
    if(use_vfd_swmr && H5Pset_vfd_swmr_config(fapl, &config) < 0) {
        warnx("H5Pset_vfd_swmr_config");
        return badhid;
    }
    return fapl;
}
```

## Using virtual datasets (VDS)

An application may want to use VFD SWMR to create, read, or write
a virtual dataset.  Unfortunately, VDS does not work properly with
VFD SWMR at this time.  In this section, we describe some workarounds
that can be used with great care to make VDS and VFD SWMR cooperate.

A virtual dataset, when it is read or written, will open files on
an application's behalf in order to access the source datasets
inside.  If a virtual dataset resides on file `v.h5`, and one of
its source datasets resides on a second file, `s1.h5`, then the
virtual dataset will try to open `s1.h5` using the same file-access
properties as `v.h5`.  Thus, if `v.h5` is open with VFD SWMR with
shadow file `v.shadow`, then the virtual dataset will try to open
`s1.h5` with the same shadow file, which will fail.

Suppose that `v.h5` is *not* open with VFD SWMR, but it was opened
with default file-access properties.  Then the virtual dataset will
open the source dataset on `s1.h5` with default file-access
properties, too.  This default virtual-dataset behavior is not
helpful to the application that wants to use VFD SWMR to read or
write source datasets.

To use VFD SWMR with VDS, an application should *pre-open* each file
using its preferred file-access properties, including independent shadow
filenames for each source file.  As long as the virtual dataset remains
in use, the application should leave each of the pre-opened files open.
In this way the library, when it tries to open the source files, will
always find them already open and re-use the already-open files with the
file-access properties established on first open.

## Pushing HDF5 content to reader visibility

With VFD SWMR, ordinarily it should not be necessary to call
H5Fflush().  In fact, when VFD SWMR is active, calling H5Fflush()
may slow down your program considerably because the call will not
return until after `max_lag` ticks have passed.

A writer can make its last changes to an HDF5 file visible to all
readers immediately using the new call, `H5Fvfd_swmr_end_tick()`.
A writer should use `H5Fvfd_swmr_end_tick()` carefully: by calling
it more frequently than once a tick, a writer may corrupt a reader's
view of the HDF5 file.

When VFD SWMR is enabled, raw data is not cached in the page buffer.  On
each tick, the content of chunk caches and other unwritten raw data is
flushed directly to the HDF5 file, so that raw data is always available
before the HDF5 structural metadata that describes it.

## Reading up-to-date content

The HDF Group (THG) expects that in one class of VFD SWMR application,
instruments on a particle accelerator will continuously generate
2-dimensional data frames and add them to HDF5 datasets while an
experiment is ongoing.  The datasets will be written to an HDF5
file opened in VFD SWMR mode.  Experimenters will monitor a real-time
display of the datasets while the experiment takes place.  A second
program, possibly running on a second computer, will generate the
display.  The second program will open the HDF5 file in VFD SWMR
mode, too.

THG developed a demonstration program for class of application,
and we have some advice based on that experience. 

The writer typically will increase a dataset's dimensions by a
frame, using `H5Dset_extent()`, before it writes the data of that
frame with `H5Dwrite()`.  It's possible that a snapshot of the HDF5
file will propagate to the reader between the `H5Dset_extent()`
call and the `H5Dwrite()`.  Values `H5Dread()` from the last frame
at that juncture will not reflect the actual experimental data.
Instead, the reader will see arbitrary values or the fill value.
To display those values would be distracting and misleading to
the experimenter. 

On the reader, a strategy for displaying the most current, bonafide application
data is to read the dimensions of the frames dataset, `d`, compute
the number `n` of full frames contained in `d`, and read the
next-to-last frame, `n - 2`.  THG uses a variant of this strategy
in its `gaussians` demo.

On the writer, a strategy for protecting against snapshots between
the `H5Dset_extent()` and `H5Dwrite()` calls is to suspend VFD
SWMR's clock across both of the calls.  The
`H5Fvfd_swmr_disable_end_of_tick()` call takes a file identifier
and stops new snapshots from being taken on the given file until
`H5Fvfd_swmr_enable_end_of_tick()` is called on the same file.

# Known issues

## Variable-length data

A VFD SWMR reader cannot reliably read back a variable-length dataset
written by VFD SWMR.  For example, a variable-length string
created and written as follows

```
    hid_t dset, space, type;
    char data[] = "content";

    type = H5Tcopy(H5T_C_S1);

    H5Tset_size(type, H5T_VARIABLE);

    space = H5Screate(H5S_SCALAR);

    dset = H5Dcreate2(..., "string", type, space, H5P_DEFAULT, H5P_DEFAULT,
        H5P_DEFAULT);

    H5Dwrite(dset, type, space, space, H5P_DEFAULT, &data);
```

and read back like this,

```
    char *data;
    herr_t ret;

    ret = H5Dread(..., ..., H5S_ALL, H5S_ALL, H5P_DEFAULT, &data);
```

may produce either an error return from `H5Dread` (`ret < 0`) or
a `NULL` pointer (`data == NULL`).

Planned improvements to the HDF5 *global heap* may alleviate this
problem.  There is no schedule for those improvements.

Improvements to VFD SWMR may also alleviate the problem.

## Microsoft Windows 

VFD SWMR is not officially supported on Microsoft Windows at this time.  The
feature should in theory work on Windows and NTFS, however it has not been
tested as the existing VFD SWMR tests rely on shell scripts.  Note that Windows
file shares are not supported as there is no write ordering guarantee (as with
NFS, et al.).

## Supported filesystems

A VFD SWMR writer and readers share a couple of files, the HDF5 (`.h5`)
file and the shadow file.  VFD SWMR relies on writes to the files to
take effect in the order described in the POSIX documentation for
`read(2)` and `write(2)` system calls.  If the VFD SWMR readers and the
writer run on the same POSIX host, this ordering should take effect,
regardless of the underlying filesystem.

If the VFD SWMR reader and the writer run on *different* hosts, then
the write-ordering rules depend on the shared filesystem.  VFD SWMR is
not generally expected to work with NFS at this time.  GPFS is reputed
to order writes according to POSIX convention, so we expect VFD SWMR
to work with GPFS.  (Caveat: we are still looking for an authoritative
description of GPFS I/O semantics.)

The HDF Group plans to add support for NFS to VFD SWMR in the future.

## File-opening order

If an application tries to open a file in VFD SWMR reader mode, and the
file is not already open by a VFD SWMR writer, then the application will
sleep in the `H5Fopen()` call until either the writer opens the same
file (using the same shadow file) or the reader times out after several
seconds.

# Reporting bugs

VFD SWMR is still under construction, so I think that you will find some
bugs. Please do not hesitate to report them.

To contact the VFD SWMR developers, email vfdswmr@hdfgroup.org.