diff options
author | jhendersonHDF <jhenderson@hdfgroup.org> | 2022-07-22 20:03:12 (GMT) |
---|---|---|
committer | GitHub <noreply@github.com> | 2022-07-22 20:03:12 (GMT) |
commit | 27bb358f7ab1d23f3f8ce081c6b4f1602033e4d7 (patch) | |
tree | e8a69bdfbc16f9acf073ddb3ebff586eccfca009 /src/H5FDsubfiling | |
parent | 32caa567a26680c5f98d0ea0cc989c45c89dc654 (diff) | |
download | hdf5-27bb358f7ab1d23f3f8ce081c6b4f1602033e4d7.zip hdf5-27bb358f7ab1d23f3f8ce081c6b4f1602033e4d7.tar.gz hdf5-27bb358f7ab1d23f3f8ce081c6b4f1602033e4d7.tar.bz2 |
Subfiling VFD (#1883)
* Added support for vector I/O calls to the VFD layer, and
associated test code. Note that this includes the optimization
to allow shortened sizes and types arrays to allow more space
efficient representations of vectors in which all entries are
of the same size and/or type. See the Selection I/o RFC for
further details.
Tested serial and parallel, debug and production on Charis.
serial and parallel debug only on Jelly.
* ran code formatter
quick serial build and test on jelly
* Add H5FD_read_selection() and H5FD_write_selection(). Currently only
translate to scalar calls. Fix const buf in H5FD_write_vector().
* Format source
* Fix comments
* Add selection I/O to chunk code, used when: not using chunk cache, no
datatype conversion, no I/O filters, no page buffer, not using collective
I/O. Requires global variable H5_use_selection_io_g be set to TRUE.
Implemented selection to vector I/O transaltion at the file driver
layer.
* Fix formatting unrelated to previous change to stop github from
complaining.
* Add full API support for selection I/O. Add tests for this.
* Implement selection I/O for contiguous datasets. Fix bug in selection
I/O translation. Add const qualifiers to some internal selection I/O
routines to maintain const-correctness while avoiding memcpys.
* Added vector read / write support to the MPIO VFD, with associated
test code (see testpar/t_vfd.c).
Note that this implementation does NOT support vector entries of
size greater than 2 GB. This must be repaired before release,
but it should be good enough for correctness testing.
As MPIO requires vector I/O requests to be sorted in increasing
address order, also added a vector sort utility in H5FDint.c This
function is tested in passing by the MPIO vector I/O extension.
In passing, repaired a bug in size / type vector extension management
in H5FD_read/write_vector()
Tested parallel debug and production on charis and Jelly.
* Ran source code formatter
* Add support for independent parallel I/O with selection I/O. Add
HDF5_USE_SELECTION_IO env var to control selection I/O (default off).
* Implement parallel collective support for selection I/O.
* Fix comments and run formatter.
* Update selection IO branch with develop (#1215)
Merged branch 'develop' into selection_io
* Sync with develop (#1262)
Updated the branch with develop changes.
* Implement big I/O support for vector I/O requests in the MPIO file
driver.
* Free arrays in H5FD__mpio_read/write_vector() as soon as they're not
needed, to cut down on memory usage during I/O.
* Address comments from code review. Fix const warnings with
H5S_SEL_ITER_INIT().
* Committing clang-format changes
* Feature/subfiling (#1464)
* Initial checkin of merged sub-filing VFD.
Passes regression tests (debug/shared/paralle) on Jelly.
However, bugs and many compiler warnings remain -- not suitable
for merge to develop.
* Minor mods to src/H5FDsubfile_mpi.c to address errors reported by autogen.sh
* Code formatting run -- no test
* Merged my subfiling code fixes into the new selection_io_branch
* Forgot to add the FindMERCURY.cmake file. This will probably disappear soon
* attempting to make a more reliable subfile file open which doesn't return errors. For some unknown reason, the regular posix open will occasionally fail to create a subfile. Some better error handling for file close has been added.
* added NULL option for H5FD_subfiling_config_t in H5Pset_fapl_subfiling (#1034)
* NULL option automatically stacks IOC VFD for subfiling and returns a valid fapl.
* added doxygen subfiling APIs
* Various fixes which allow the IOR benchmark to run correctly
* Lots of updates including the packaging up of the mercury_util source files to enable easier builds for our Benchmarking
* Interim checkin of selection_io_with_subfiling_vfd branch
Moddified testpar/t_vfd.c to test the subfiling vfd with default configuration.
Must update this code to run with a variety of configurations -- most particularly
multiple IO concentrators, and stripe depth small enough to test the other IO
concentrators.
testpar/t_vfd.c exposed a large number of race condidtions -- symtoms included:
1) Crashes (usually seg faults)
2) Heap corruption
3) Stack corruption
4) Double frees of heap space
5) Hangs
6) Out of order execution of I/O requests / violations of POSIX semantics
7) Swapped write requests
Items 1 - 4 turned out to be primarily caused by file close issues --
specifically, the main I/O concentrator thread and its pool of worker threads
were not being shut down properly on file close. Addressing this issue in
combination with some other minor fixes seems to have addressed these issues.
Items 5 & 6 appear to have been caused by issue of I/O requests to the
thread pool in an order that did not maintain POSIX semantics. A rewrite of
the I/O request dispatch code appears to have solved these issues.
Item 7 seems to have been caused by multiple write requests from a given
rank being read by the wrong worker thread. Code to issue "unique" tags for
each write request via the ACK message appears to have cleaned this up.
Note that the code is still in poor condtition. A partial list of known
defects includes:
a) Race condiditon on file close that allows superblock writes to arrive
at the I/O concentrator after it has been shutdown. This defect is
most evident when testpar/t_subfiling_vfd is run with 8 ranks.
b) No error reporting from I/O concentrators -- must design and implement
this. For now, mostly just asserts, which suggests that it should be
run in debug mode.
c) Much commented out and/or un-used code.
d) Code orgnaization
e) Build system with bits of Mercury is awkward -- think of shifting
to pthreads with our own thread pool code.
f) Need to add native support for vector and selection I/O to the subfiling
VFD.
g) Need to review, and posibly rework configuration code.
h) Need to store subfile configuration data in a superblock extension message,
and add code to use this data on file open.
i) Test code is inadequate -- expect more issues as it is extended.
In particular, there is no unit test code for the I/O request dispatch code.
While I think it is correct at present, we need test code to verify this.
Similarly, we need to test with multiple I/O concentrators and much smaller
stripe depth.
My actual code changes were limited to:
src/H5FDioc.c
src/H5FDioc_threads.c
src/H5FDsubfile_int.c
src/H5FDsubfile_mpi.c
src/H5FDsubfiling.c
src/H5FDsubfiling.h
src/H5FDsubfiling_priv.h
testpar/t_subfiling_vfd.c
testpar/t_vfd.c
I'm not sure what is going on with the deletions in src/mercury/src/util.
Tested parallel/debug on Charis and Jelly
* subfiling with selection IO (#1219)
Merged branch 'selection_io' into subfiling branch.
* Subfile name fixes (#1250)
* fixed subfiling naming convention, and added leading zero to rank names.
* Merge branch 'selection_io' into selection_io_with_subfiling_vfd (#1265)
* Added script to join subfiles into a single HDF5 file (#1350)
* Modified H5FD__subfiling_query() to report that the sub-filing VFD supports MPI
This exposed issues with truncate and get EOF in the sub-filing VFD.
I believe I have addressed these issues (get EOF not as fully tested as it should be), howeer,
it exposed race conditions resulting in hangs. As of this writing, I have not been able
to chase these down.
Note that the tests that expose these race conditions are in testpar/t_subfiling_vfd.c, and
are currently skipped. Unskip these tests to reproduce the race conditions.
tested (to the extent possible) debug/parallel on charis and jelly.
* Committing clang-format changes
* fixed H5MM_free
Co-authored-by: mainzer <mainzer#hdfgroup.org>
Co-authored-by: jrmainzer <72230804+jrmainzer@users.noreply.github.com>
Co-authored-by: Richard Warren <Richard.Warren@hdfgroup.org>
Co-authored-by: Richard.Warren <richard.warren@jelly.ad.hdfgroup.org>
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
* Move Subfiling VFD components into H5FDsubfiling source directory
* Update Autotools build and add H5_HAVE_SUBFILING_VFD macro to H5pubconf.h
* Tidy up CMake build of subfiling sources
* Merge branch 'develop' into feature/subfiling (#1539)
Merge branch 'develop' into feature/subfiling
* Add VFD interface version field to Subfiling and IOC VFDs
* Merge branch 'develop' into feature/subfiling (#1557)
Merge branch 'develop' into feature/subfiling
* Merge branch 'develop' into feature/subfiling (#1563)
Merge branch 'develop' into feature/subfiling
* Tidy up merge artifacts after rebase on develop
* Fix incorrect variable in mirror VFD utils CMake
* Ensure VFD values are always defined
* Add subfiling to CMake VFD_LIST if built
* Mark MPI I/O driver self-initialization global as static
* Add Subfiling VFD to predefined VFDs for HDF5_DRIVER env. variable
* Initial progress towards separating private vs. public subfiling code
* include libgen.h in t_vfd tests for correct dirname/basename
* Committing clang-format changes
* removed mercury option, included subfiling header path (#1577)
Added subfiling status to configure output, installed h5fuse.sh to build directory for use in future tests.
* added check for stdatomic.h (#1578)
* added check for stdatomic.h with subfiling
* added H5_HAVE_SUBFILING_VFD for cmake
* fix old-style-definition warning (#1582)
* fix old-style-definition warning
* added test for enable parallel with subfiling VFD (#1586)
Fails if subfiling VFD is not used with parallel support.
* Subfiling/IOC VFD fixes and tidying (#1619)
* Rename CMake option for Subfiling VFD to be consistent with other VFDs
* Miscellaneous Subfiling fixes
Add error message for unset MPI communicator
Support dynamic loading of subfiling VFD with default configuration
* Temporary fix for subfile name issue
* Added subfile checks (#1634)
* added subfile checks
* Feature/subfiling (#1655)
* Subfiling/IOC VFD cleanup
Fix misuse of MPI_COMM_WORLD in IOC VFD
Propagate Subfiling FAPL MPI settings down to IOC FAPL in default
configuration case
Cleanup IOC VFD debugging code
Change sprintf to snprintf in a few places
* Major work on separating Subfiling and IOC VFDs from each other
* Re-write async_completion func to not overuse stack
* Replace usage of MPI_COMM_WORLD with file's actual MPI communicator
* Refactor H5FDsubfile_mpi.c
* Remove empty file H5FDsubfile_mpi.c
* Separate IOC VFD errors to its own error stack
* Committing clang-format changes
* Remove H5TRACE macros from H5FDioc.c
* Integrate H5FDioc_threads.c with IOC error stack
* Fix for subfile name generation
Use number of I/O concentrators from existing subfiling configuration file, if one exists
* Add temporary barrier in "Get EOF" operation to prevent races on EOF
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
* Fix for retrieval of machine Host ID
* Default to MPI_COMM_WORLD if no MPI params set
* added libs rt and pthreads (#1673)
* added libs rt and pthreads
* Feature/subfiling (#1689)
* More tidying of IOC VFD and subfiling debug code
* Remove old unused log file code
* Clear FID from active file map on failure
* Fix bug in generation of subfile names when truncating file
* Change subfile names to start from 1 instead of 0
* Use long long for user-specified stripe size from environment variable
* Skip 0-sized I/Os in low-level IOC I/O routines
* Don't update EOF on read
* Convert printed warning about data size mismatch to assertion
* Don't add base file address to I/O addresses twice
Base address should already be applied as part of H5FDwrite/read_vector calls
* Account for 0-sized I/O vector entries in subfile write/read functions
* Rewrite init_indep_io for clarity
* Correction for IOC wraparound calculations
* Some corrections to iovec calculations
* Remove temporary barrier on EOF retrieval
* Complete work request queue entry on error instead of skipping over
* Account for stripe size wraparound for sf_col_offset calculation
* Committing clang-format changes
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
* Re-write and fix bugs in I/O vector filling routines (#1703)
* Rewrite I/O vector filling routines for clarity
* Fix bug with iovec_fill_last when last I/O size is 0
* added subfiling_dir line read (#1714)
* added subfiling_dir line read and use it
* shellcheck fixes
* I/O request dispatch logic update (#1731)
Short-circuit I/O request dispatch when head of I/O queue is an
in-progress get EOF or truncate operation. This prevents an issue where
a write operation can be dispatched alongside a get EOF/truncate
operation, whereas all I/O requests are supposed to be ineligible for
dispatch until the get EOF/truncate is completed
* h5fuse.sh.in clean-up (#1757)
* Added command-line options
* Committing clang-format changes
* Align with changes from develop
* Mimic MPI I/O VFD for EOF handling
* Initialize context_id field for work request objects
* Use logfile for some debugging information
* Use atomic store to set IOC ready flag
* Use separate communicator for sending file EOF data
Minor IOC cleanup
* Use H5_subfile_fid_to_context to get context ID for file in Subfiling
VFD
* IOVEC calculation fixes
* Updates for debugging code
* Minor fixes for threaded code
* Committing clang-format changes
* Use separate MPI communicator for barrier operations
* Committing clang-format changes
* Rewrite EOF routine to use nonblocking MPI communication
* Committing clang-format changes
* Always dispatch I/O work requests in IOC main loop
* Return distinct MPI communicator to library when requested
* Minor warning cleanup
* Committing clang-format changes
* Generate h5fuse.sh from h5fuse.sh.in in CMake
* Send truncate messages to correct IOC rank
* Committing clang-format changes
* Miscellaneous cleanup
Post some MPI receives before sends
Free some duplicated MPI communicator/Info objects
Remove unnecessary extra MPI_Barrier
* Warning cleanup
* Fix for leaked MPI communicator
* Retrieve file EOF on single rank and bcast it
* Fixes for a few failure paths
* Cleanup of IOC file opens
* Committing clang-format changes
* Use plan MPI_Send for send of EOF messages
* Always check MPI thread support level during Subfiling init
* Committing clang-format changes
* Handle a hang on failure when IOCs can't open subfiles
* Committing clang-format changes
* Refactor file open status consensus check
* Committing clang-format changes
* Fix for MPI_Comm_free being called after MPI_Finalize
* Fix VFD test by setting MPI params before setting subfiling on FAPL
* Update Subfiling VFD error handling and error stack usage
* Improvements for Subfiling logfiles
* Remove prototypes for currently unused routines
* Disable I/O queue stat collecting by default
* Remove unused serialization mutex variable
* Update VFD testing to take subfiling VFD into account
* Fix usage of global subfiling application layout object
* Minor fixes for failure pathways
* Keep track of the number of failures in an IOC I/O queue
* Make sure not to exceed MPI_TAG_UB value for data communication messages
* Committing clang-format changes
* Update for rename of some H5FD 'ctl' opcodes
* Always include Subfiling's public header files in hdf5.h
* Remove old unused code and comments
* Implement support for per-file I/O queues
Allows the subfiling VFD to have multiple HDF5 files open simultaneously
* Use simple MPI_Iprobe over unnecessary MPI_Improbe
* Committing clang-format changes
* Update HDF5 testing to query driver for H5FD_FEAT_DEFAULT_VFD_COMPATIBLE
flag
* Fix a few bugs related to file multi-opens
* Avoid calling MPI routines if subfiling gets reinitialized
* Fix issue when files are closed in a random order
* Update HDF5 testing to query VFD for "using MPI" feature flag
* Register atexit handler in subfiling VFD to call MPI_Finalize after HDF5
closes
* Fail for collective I/O requests until support is implemented
* Correct VOL test function prototypes
* Minor cleanup of old code and comments
* Update mercury dependency
* Cleanup of subfiling configuration structure
* Committing clang-format changes
* Build system updates for Subfiling VFD
* Fix possible hang on failure in t_vfd tests caused by mismatched
MPI_Barrier calls
* Copy subfiling IOC fapl in "fapl get" method
* Mirror subfiling superblock writes to stub file for legacy POSIX-y HDF5
applications
* Allow collective I/O for MPI_BYTE types and rank 0 bcast strategy
* Committing clang-format changes
* Use different scheme for subfiling write message MPI tag calculations
* Committing clang-format changes
* Avoid performing fstat calls on all MPI ranks
* Add MPI_Barrier before finalizing IOC threads
* Use try_lock in I/O queue dispatch to minimize contention from worker threads
* Use simple Waitall for nonblocking I/O waits
* Add configurable IOC main thread delay and try_lock option to I/O queue dispatch
* Fix bug that could cause serialization of non-overlapping I/O requests
* Temporarily treat collective subfiling vector I/O calls as independent
* Removed unused mercury bits
* Add stubs for subfiling and IOC file delete callback
* Update VFD testing for Subfiling VFD
* Work around HDF5 metadata cache bug for Subfiling VFD when MPI Comm size
= 1
* Committing clang-format changes
Co-authored-by: mainzer <mainzer#hdfgroup.org>
Co-authored-by: Neil Fortner <nfortne2@hdfgroup.org>
Co-authored-by: Scot Breitenfeld <brtnfld@hdfgroup.org>
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: jrmainzer <72230804+jrmainzer@users.noreply.github.com>
Co-authored-by: Richard Warren <Richard.Warren@hdfgroup.org>
Co-authored-by: Richard.Warren <richard.warren@jelly.ad.hdfgroup.org>
Diffstat (limited to 'src/H5FDsubfiling')
39 files changed, 16436 insertions, 0 deletions
diff --git a/src/H5FDsubfiling/H5FDioc.c b/src/H5FDsubfiling/H5FDioc.c new file mode 100644 index 0000000..8017cc0 --- /dev/null +++ b/src/H5FDsubfiling/H5FDioc.c @@ -0,0 +1,1813 @@ +/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * + * Copyright by The HDF Group. * + * All rights reserved. * + * * + * This file is part of HDF5. The full HDF5 copyright notice, including * + * terms governing use, modification, and redistribution, is contained in * + * the COPYING file, which can be found at the root of the source code * + * distribution tree, or in https://support.hdfgroup.org/ftp/HDF5/releases. * + * If you do not have access to either file, you may request a copy from * + * help@hdfgroup.org. * + * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ + +/* + * Purpose: The IOC VFD implements a file driver which relays all the + * VFD calls to an underlying VFD, and send all the write calls to + * another underlying VFD. Maintains two files simultaneously. + */ + +/* This source code file is part of the H5FD driver module */ +#include "H5FDdrvr_module.h" + +#include "H5private.h" /* Generic Functions */ +#include "H5FDpublic.h" /* Basic H5FD definitions */ +#include "H5Eprivate.h" /* Error handling */ +#include "H5FDprivate.h" /* File drivers */ +#include "H5FDioc.h" /* IOC file driver */ +#include "H5FDioc_priv.h" /* IOC file driver */ +#include "H5FDsec2.h" /* Sec2 VFD */ +#include "H5FLprivate.h" /* Free Lists */ +#include "H5Fprivate.h" /* File access */ +#include "H5Iprivate.h" /* IDs */ +#include "H5MMprivate.h" /* Memory management */ +#include "H5Pprivate.h" /* Property lists */ + +/* The driver identification number, initialized at runtime */ +static hid_t H5FD_IOC_g = H5I_INVALID_HID; + +/* Whether the driver initialized MPI on its own */ +static hbool_t H5FD_mpi_self_initialized = FALSE; + +/* Pointer to value for MPI_TAG_UB */ +int *H5FD_IOC_tag_ub_val_ptr = NULL; + +/* The information of this ioc */ +typedef struct H5FD_ioc_t { + H5FD_t pub; /* public stuff, must be first */ + int fd; /* the filesystem file descriptor */ + H5FD_ioc_config_t fa; /* driver-specific file access properties */ + + /* MPI Info */ + MPI_Comm comm; + MPI_Info info; + int mpi_rank; + int mpi_size; + + H5FD_t *ioc_file; /* native HDF5 file pointer (sec2) */ + + int64_t context_id; /* The value used to lookup a subfiling context for the file */ + + char *file_dir; /* Directory where we find files */ + char *file_path; /* The user defined filename */ + +#ifndef H5_HAVE_WIN32_API + /* On most systems the combination of device and i-node number uniquely + * identify a file. Note that Cygwin, MinGW and other Windows POSIX + * environments have the stat function (which fakes inodes) + * and will use the 'device + inodes' scheme as opposed to the + * Windows code further below. + */ + dev_t device; /* file device number */ + ino_t inode; /* file i-node number */ +#else + /* Files in windows are uniquely identified by the volume serial + * number and the file index (both low and high parts). + * + * There are caveats where these numbers can change, especially + * on FAT file systems. On NTFS, however, a file should keep + * those numbers the same until renamed or deleted (though you + * can use ReplaceFile() on NTFS to keep the numbers the same + * while renaming). + * + * See the MSDN "BY_HANDLE_FILE_INFORMATION Structure" entry for + * more information. + * + * http://msdn.microsoft.com/en-us/library/aa363788(v=VS.85).aspx + */ + DWORD nFileIndexLow; + DWORD nFileIndexHigh; + DWORD dwVolumeSerialNumber; + + HANDLE hFile; /* Native windows file handle */ +#endif /* H5_HAVE_WIN32_API */ +} H5FD_ioc_t; + +/* + * These macros check for overflow of various quantities. These macros + * assume that HDoff_t is signed and haddr_t and size_t are unsigned. + * + * ADDR_OVERFLOW: Checks whether a file address of type `haddr_t' + * is too large to be represented by the second argument + * of the file seek function. + * + * SIZE_OVERFLOW: Checks whether a buffer size of type `hsize_t' is too + * large to be represented by the `size_t' type. + * + * REGION_OVERFLOW: Checks whether an address and size pair describe data + * which can be addressed entirely by the second + * argument of the file seek function. + */ +#define MAXADDR (((haddr_t)1 << (8 * sizeof(HDoff_t) - 1)) - 1) +#define ADDR_OVERFLOW(A) (HADDR_UNDEF == (A) || ((A) & ~(haddr_t)MAXADDR)) +#define SIZE_OVERFLOW(Z) ((Z) & ~(hsize_t)MAXADDR) +#define REGION_OVERFLOW(A, Z) \ + (ADDR_OVERFLOW(A) || SIZE_OVERFLOW(Z) || HADDR_UNDEF == (A) + (Z) || (HDoff_t)((A) + (Z)) < (HDoff_t)(A)) + +#ifdef H5FD_IOC_DEBUG +#define H5FD_IOC_LOG_CALL(name) \ + do { \ + HDprintf("called %s()\n", (name)); \ + HDfflush(stdout); \ + } while (0) +#else +#define H5FD_IOC_LOG_CALL(name) /* no-op */ +#endif + +/* Private functions */ +/* Prototypes */ +static herr_t H5FD__ioc_term(void); +static hsize_t H5FD__ioc_sb_size(H5FD_t *_file); +static herr_t H5FD__ioc_sb_encode(H5FD_t *_file, char *name /*out*/, unsigned char *buf /*out*/); +static herr_t H5FD__ioc_sb_decode(H5FD_t *_file, const char *name, const unsigned char *buf); +static void * H5FD__ioc_fapl_get(H5FD_t *_file); +static void * H5FD__ioc_fapl_copy(const void *_old_fa); +static herr_t H5FD__ioc_fapl_free(void *_fapl); +static H5FD_t *H5FD__ioc_open(const char *name, unsigned flags, hid_t fapl_id, haddr_t maxaddr); +static herr_t H5FD__ioc_close(H5FD_t *_file); +static int H5FD__ioc_cmp(const H5FD_t *_f1, const H5FD_t *_f2); +static herr_t H5FD__ioc_query(const H5FD_t *_file, unsigned long *flags /* out */); +static herr_t H5FD__ioc_get_type_map(const H5FD_t *_file, H5FD_mem_t *type_map); +static haddr_t H5FD__ioc_alloc(H5FD_t *file, H5FD_mem_t type, hid_t dxpl_id, hsize_t size); +static herr_t H5FD__ioc_free(H5FD_t *_file, H5FD_mem_t type, hid_t dxpl_id, haddr_t addr, hsize_t size); +static haddr_t H5FD__ioc_get_eoa(const H5FD_t *_file, H5FD_mem_t H5_ATTR_UNUSED type); +static herr_t H5FD__ioc_set_eoa(H5FD_t *_file, H5FD_mem_t H5_ATTR_UNUSED type, haddr_t addr); +static haddr_t H5FD__ioc_get_eof(const H5FD_t *_file, H5FD_mem_t H5_ATTR_UNUSED type); +static herr_t H5FD__ioc_get_handle(H5FD_t *_file, hid_t H5_ATTR_UNUSED fapl, void **file_handle); +static herr_t H5FD__ioc_read(H5FD_t *_file, H5FD_mem_t type, hid_t dxpl_id, haddr_t addr, size_t size, + void *buf); +static herr_t H5FD__ioc_write(H5FD_t *_file, H5FD_mem_t type, hid_t dxpl_id, haddr_t addr, size_t size, + const void *buf); +static herr_t H5FD__ioc_read_vector(H5FD_t *file, hid_t dxpl_id, uint32_t count, H5FD_mem_t types[], + haddr_t addrs[], size_t sizes[], void *bufs[] /* out */); +static herr_t H5FD__ioc_write_vector(H5FD_t *file, hid_t dxpl_id, uint32_t count, H5FD_mem_t types[], + haddr_t addrs[], size_t sizes[], const void *bufs[] /* in */); +static herr_t H5FD__ioc_flush(H5FD_t *_file, hid_t dxpl_id, hbool_t closing); +static herr_t H5FD__ioc_truncate(H5FD_t *_file, hid_t dxpl_id, hbool_t closing); +static herr_t H5FD__ioc_lock(H5FD_t *_file, hbool_t rw); +static herr_t H5FD__ioc_unlock(H5FD_t *_file); +static herr_t H5FD__ioc_del(const char *name, hid_t fapl); +/* +static herr_t H5FD__ioc_ctl(H5FD_t *file, uint64_t op_code, uint64_t flags, + const void *input, void **result); +*/ + +static herr_t H5FD__ioc_get_default_config(H5FD_ioc_config_t *config_out); +static herr_t H5FD__ioc_validate_config(const H5FD_ioc_config_t *fa); +static int H5FD__copy_plist(hid_t fapl_id, hid_t *id_out_ptr); + +static herr_t H5FD__ioc_close_int(H5FD_ioc_t *file_ptr); + +static herr_t H5FD__ioc_write_vector_internal(H5FD_t *_file, uint32_t count, H5FD_mem_t types[], + haddr_t addrs[], size_t sizes[], + const void *bufs[] /* data_in */); +static herr_t H5FD__ioc_read_vector_internal(H5FD_t *_file, uint32_t count, haddr_t addrs[], size_t sizes[], + void *bufs[] /* data_out */); + +static const H5FD_class_t H5FD_ioc_g = { + H5FD_CLASS_VERSION, /* VFD interface version */ + H5_VFD_IOC, /* value */ + H5FD_IOC_NAME, /* name */ + MAXADDR, /* maxaddr */ + H5F_CLOSE_WEAK, /* fc_degree */ + H5FD__ioc_term, /* terminate */ + H5FD__ioc_sb_size, /* sb_size */ + H5FD__ioc_sb_encode, /* sb_encode */ + H5FD__ioc_sb_decode, /* sb_decode */ + sizeof(H5FD_ioc_config_t), /* fapl_size */ + H5FD__ioc_fapl_get, /* fapl_get */ + H5FD__ioc_fapl_copy, /* fapl_copy */ + H5FD__ioc_fapl_free, /* fapl_free */ + 0, /* dxpl_size */ + NULL, /* dxpl_copy */ + NULL, /* dxpl_free */ + H5FD__ioc_open, /* open */ + H5FD__ioc_close, /* close */ + H5FD__ioc_cmp, /* cmp */ + H5FD__ioc_query, /* query */ + H5FD__ioc_get_type_map, /* get_type_map */ + H5FD__ioc_alloc, /* alloc */ + H5FD__ioc_free, /* free */ + H5FD__ioc_get_eoa, /* get_eoa */ + H5FD__ioc_set_eoa, /* set_eoa */ + H5FD__ioc_get_eof, /* get_eof */ + H5FD__ioc_get_handle, /* get_handle */ + H5FD__ioc_read, /* read */ + H5FD__ioc_write, /* write */ + H5FD__ioc_read_vector, /* read_vector */ + H5FD__ioc_write_vector, /* write_vector */ + NULL, /* read_selection */ + NULL, /* write_selection */ + H5FD__ioc_flush, /* flush */ + H5FD__ioc_truncate, /* truncate */ + H5FD__ioc_lock, /* lock */ + H5FD__ioc_unlock, /* unlock */ + H5FD__ioc_del, /* del */ + NULL, /* ctl */ + H5FD_FLMAP_DICHOTOMY /* fl_map */ +}; + +/* Declare a free list to manage the H5FD_ioc_t struct */ +H5FL_DEFINE_STATIC(H5FD_ioc_t); + +/* Declare a free list to manage the H5FD_ioc_config_t struct */ +H5FL_DEFINE_STATIC(H5FD_ioc_config_t); + +/*------------------------------------------------------------------------- + * Function: H5FD_ioc_init + * + * Purpose: Initialize the IOC driver by registering it with the + * library. + * + * Return: Success: The driver ID for the ioc driver. + * Failure: Negative + *------------------------------------------------------------------------- + */ +hid_t +H5FD_ioc_init(void) +{ + hid_t ret_value = H5I_INVALID_HID; + + H5FD_IOC_LOG_CALL(__func__); + + /* Register the IOC VFD, if it isn't already registered */ + if (H5I_VFL != H5I_get_type(H5FD_IOC_g)) { + char *env_var; + int key_val_retrieved = 0; + int mpi_code; + + if ((H5FD_IOC_g = H5FD_register(&H5FD_ioc_g, sizeof(H5FD_class_t), FALSE)) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_ID, H5E_CANTREGISTER, H5I_INVALID_HID, "can't register IOC VFD"); + + /* Check if IOC VFD has been loaded dynamically */ + env_var = HDgetenv(HDF5_DRIVER); + if (env_var && !HDstrcmp(env_var, H5FD_IOC_NAME)) { + int mpi_initialized = 0; + int provided = 0; + + /* Initialize MPI if not already initialized */ + if (MPI_SUCCESS != (mpi_code = MPI_Initialized(&mpi_initialized))) + H5_SUBFILING_MPI_GOTO_ERROR(H5I_INVALID_HID, "MPI_Initialized failed", mpi_code); + if (mpi_initialized) { + /* If MPI is initialized, validate that it was initialized with MPI_THREAD_MULTIPLE */ + if (MPI_SUCCESS != (mpi_code = MPI_Query_thread(&provided))) + H5_SUBFILING_MPI_GOTO_ERROR(H5I_INVALID_HID, "MPI_Query_thread failed", mpi_code); + if (provided != MPI_THREAD_MULTIPLE) + H5_SUBFILING_GOTO_ERROR( + H5E_VFL, H5E_CANTINIT, H5I_INVALID_HID, + "IOC VFD requires the use of MPI_Init_thread with MPI_THREAD_MULTIPLE"); + } + else { + int required = MPI_THREAD_MULTIPLE; + + /* Otherwise, initialize MPI */ + if (MPI_SUCCESS != (mpi_code = MPI_Init_thread(NULL, NULL, required, &provided))) + H5_SUBFILING_MPI_GOTO_ERROR(H5I_INVALID_HID, "MPI_Init_thread failed", mpi_code); + + H5FD_mpi_self_initialized = TRUE; + + if (provided != required) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTINIT, H5I_INVALID_HID, + "MPI doesn't support MPI_Init_thread with MPI_THREAD_MULTIPLE"); + } + } + + /* Retrieve upper bound for MPI message tag value */ + if (MPI_SUCCESS != (mpi_code = MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_TAG_UB, &H5FD_IOC_tag_ub_val_ptr, + &key_val_retrieved))) + H5_SUBFILING_MPI_GOTO_ERROR(H5I_INVALID_HID, "MPI_Comm_get_attr failed", mpi_code); + + if (!key_val_retrieved) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTINIT, H5I_INVALID_HID, + "couldn't retrieve value for MPI_TAG_UB"); + } + + ret_value = H5FD_IOC_g; + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD_ioc_init() */ + +/*--------------------------------------------------------------------------- + * Function: H5FD__ioc_term + * + * Purpose: Shut down the IOC VFD. + * + * Returns: SUCCEED (Can't fail) + *--------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_term(void) +{ + herr_t ret_value = SUCCEED; + + H5FD_IOC_LOG_CALL(__func__); + + if (H5FD_IOC_g >= 0) { + /* Terminate MPI if the driver initialized it */ + if (H5FD_mpi_self_initialized) { + int mpi_finalized = 0; + int mpi_code; + + if (MPI_SUCCESS != (mpi_code = MPI_Finalized(&mpi_finalized))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Finalized failed", mpi_code); + if (!mpi_finalized) { + if (MPI_SUCCESS != (mpi_code = MPI_Finalize())) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Finalize failed", mpi_code); + } + + H5FD_mpi_self_initialized = FALSE; + } + } + +done: + /* Reset VFL ID */ + H5FD_IOC_g = H5I_INVALID_HID; + + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_term() */ + +/*------------------------------------------------------------------------- + * Function: H5Pset_fapl_ioc + * + * Purpose: Sets the file access property list to use the + * ioc driver. + * + * Return: SUCCEED/FAIL + *------------------------------------------------------------------------- + */ +herr_t +H5Pset_fapl_ioc(hid_t fapl_id, H5FD_ioc_config_t *vfd_config) +{ + H5FD_ioc_config_t *ioc_conf = NULL; + H5P_genplist_t * plist_ptr = NULL; + herr_t ret_value = SUCCEED; + + H5FD_IOC_LOG_CALL(__func__); + + if (NULL == (plist_ptr = H5P_object_verify(fapl_id, H5P_FILE_ACCESS))) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, FAIL, "not a file access property list"); + + if (vfd_config == NULL) { + if (NULL == (ioc_conf = HDcalloc(1, sizeof(*ioc_conf)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate IOC VFD configuration"); + ioc_conf->ioc_fapl_id = H5I_INVALID_HID; + + /* Get IOC VFD defaults */ + if (H5FD__ioc_get_default_config(ioc_conf) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTSET, FAIL, "can't get default IOC VFD configuration"); + + vfd_config = ioc_conf; + } + + if (H5FD__ioc_validate_config(vfd_config) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "invalid IOC VFD configuration"); + + ret_value = H5P_set_driver(plist_ptr, H5FD_IOC, vfd_config, NULL); + +done: + if (ioc_conf) { + if (ioc_conf->ioc_fapl_id >= 0 && H5I_dec_ref(ioc_conf->ioc_fapl_id) < 0) + H5_SUBFILING_DONE_ERROR(H5E_PLIST, H5E_CANTDEC, FAIL, "can't close IOC FAPL"); + HDfree(ioc_conf); + } + + H5_SUBFILING_FUNC_LEAVE; +} /* end H5Pset_fapl_ioc() */ + +/*------------------------------------------------------------------------- + * Function: H5Pget_fapl_ioc + * + * Purpose: Returns information about the ioc file access property + * list through the structure config_out. + * + * Will fail if config_out is received without pre-set valid + * magic and version information. + * + * Return: SUCCEED/FAIL + *------------------------------------------------------------------------- + */ +herr_t +H5Pget_fapl_ioc(hid_t fapl_id, H5FD_ioc_config_t *config_out) +{ + const H5FD_ioc_config_t *config_ptr = NULL; + H5P_genplist_t * plist_ptr = NULL; + hbool_t use_default_config = FALSE; + herr_t ret_value = SUCCEED; + + H5FD_IOC_LOG_CALL(__func__); + + /* Check arguments */ + if (config_out == NULL) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "config_out is NULL"); + + if (NULL == (plist_ptr = H5P_object_verify(fapl_id, H5P_FILE_ACCESS))) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, FAIL, "not a file access property list"); + + if (H5FD_IOC != H5P_peek_driver(plist_ptr)) + use_default_config = TRUE; + else { + config_ptr = H5P_peek_driver_info(plist_ptr); + if (NULL == config_ptr) + use_default_config = TRUE; + } + + if (use_default_config) { + if (H5FD__ioc_get_default_config(config_out) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTGET, FAIL, "can't get default IOC VFD configuration"); + } + else { + /* Copy the IOC fapl data out */ + HDmemcpy(config_out, config_ptr, sizeof(H5FD_ioc_config_t)); + + /* Copy the driver info value */ + if (H5FD__copy_plist(config_ptr->ioc_fapl_id, &(config_out->ioc_fapl_id)) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_BADVALUE, FAIL, "can't copy IOC FAPL"); + } + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5Pget_fapl_ioc() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_get_default_config + * + * Purpose: This is called by H5Pset/get_fapl_ioc when called with no + * established configuration info. This simply fills in + * the basics. This avoids the necessity of having the + * user write code to initialize the config structure. + * + * Return: SUCCEED/FAIL + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_get_default_config(H5FD_ioc_config_t *config_out) +{ + herr_t ret_value = SUCCEED; + + HDassert(config_out); + + HDmemset(config_out, 0, sizeof(*config_out)); + + config_out->magic = H5FD_IOC_FAPL_MAGIC; + config_out->version = H5FD_CURR_IOC_FAPL_VERSION; + config_out->ioc_fapl_id = H5I_INVALID_HID; + config_out->stripe_count = 0; + config_out->stripe_depth = H5FD_DEFAULT_STRIPE_DEPTH; + config_out->ioc_selection = SELECT_IOC_ONE_PER_NODE; + + /* Create a default FAPL and choose an appropriate underlying driver */ + if ((config_out->ioc_fapl_id = H5Pcreate(H5P_FILE_ACCESS)) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTCREATE, FAIL, "can't create default FAPL"); + + /* Currently, only sec2 vfd supported */ + if (H5Pset_fapl_sec2(config_out->ioc_fapl_id) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTSET, FAIL, "can't set Sec2 VFD on IOC FAPL"); + + /* Specific to this I/O Concentrator */ + config_out->thread_pool_count = H5FD_IOC_THREAD_POOL_SIZE; + +done: + if (ret_value < 0) { + if (config_out->ioc_fapl_id >= 0 && H5Pclose(config_out->ioc_fapl_id) < 0) + H5_SUBFILING_DONE_ERROR(H5E_PLIST, H5E_CANTCLOSEOBJ, FAIL, "can't close FAPL"); + } + + H5_SUBFILING_FUNC_LEAVE; +} + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_validate_config() + * + * Purpose: Test to see if the supplied instance of + * H5FD_ioc_config_t contains internally consistent data. + * Return SUCCEED if so, and FAIL otherwise. + * + * Note the difference between internally consistent and + * correct. As we will have to try to setup the IOC to + * determine whether the supplied data is correct, + * we will settle for internal consistency at this point + * + * Return: SUCCEED if instance of H5FD_ioc_config_t contains + * internally consistent data, FAIL otherwise. + * + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_validate_config(const H5FD_ioc_config_t *fa) +{ + herr_t ret_value = SUCCEED; + + HDassert(fa != NULL); + + if (fa->version != H5FD_CURR_IOC_FAPL_VERSION) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "Unknown H5FD_ioc_config_t version"); + + if (fa->magic != H5FD_IOC_FAPL_MAGIC) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "invalid H5FD_ioc_config_t magic value"); + + /* TODO: add extra IOC configuration validation code */ + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_validate_config() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_sb_size + * + * Purpose: Obtains the number of bytes required to store the driver file + * access data in the HDF5 superblock. + * + * Return: Success: Number of bytes required. + * + * Failure: 0 if an error occurs or if the driver has no + * data to store in the superblock. + * + * NOTE: no public API for H5FD_sb_size, it needs to be added + *------------------------------------------------------------------------- + */ +static hsize_t +H5FD__ioc_sb_size(H5FD_t *_file) +{ + H5FD_ioc_t *file = (H5FD_ioc_t *)_file; + hsize_t ret_value = 0; + + H5FD_IOC_LOG_CALL(__func__); + + /* Sanity check */ + HDassert(file); + HDassert(file->ioc_file); + + if (file->ioc_file) + ret_value = H5FD_sb_size(file->ioc_file); + + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_sb_size */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_sb_encode + * + * Purpose: Encode driver-specific data into the output arguments. + * + * Return: SUCCEED/FAIL + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_sb_encode(H5FD_t *_file, char *name /*out*/, unsigned char *buf /*out*/) +{ + H5FD_ioc_t *file = (H5FD_ioc_t *)_file; + herr_t ret_value = SUCCEED; /* Return value */ + + H5FD_IOC_LOG_CALL(__func__); + + /* Sanity check */ + HDassert(file); + HDassert(file->ioc_file); + + if (file->ioc_file && H5FD_sb_encode(file->ioc_file, name, buf) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTENCODE, FAIL, "unable to encode the superblock in R/W file"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_sb_encode */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_sb_decode + * + * Purpose: Decodes the driver information block. + * + * Return: SUCCEED/FAIL + * + * NOTE: no public API for H5FD_sb_size, need to add + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_sb_decode(H5FD_t *_file, const char *name, const unsigned char *buf) +{ + H5FD_ioc_t *file = (H5FD_ioc_t *)_file; + herr_t ret_value = SUCCEED; /* Return value */ + + H5FD_IOC_LOG_CALL(__func__); + + /* Sanity check */ + HDassert(file); + HDassert(file->ioc_file); + + if (H5FD_sb_load(file->ioc_file, name, buf) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTDECODE, FAIL, "unable to decode the superblock in R/W file"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_sb_decode */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_fapl_get + * + * Purpose: Returns a file access property list which indicates how the + * specified file is being accessed. The return list could be + * used to access another file the same way. + * + * Return: Success: Ptr to new file access property list with all + * members copied from the file struct. + * Failure: NULL + *------------------------------------------------------------------------- + */ +static void * +H5FD__ioc_fapl_get(H5FD_t *_file) +{ + H5FD_ioc_t *file = (H5FD_ioc_t *)_file; + void * ret_value = NULL; + + H5FD_IOC_LOG_CALL(__func__); + + ret_value = H5FD__ioc_fapl_copy(&(file->fa)); + + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_fapl_get() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__copy_plist + * + * Purpose: Sanity-wrapped H5P_copy_plist() for each channel. + * Utility function for operation in multiple locations. + * + * Return: 0 on success, -1 on error. + *------------------------------------------------------------------------- + */ +static int +H5FD__copy_plist(hid_t fapl_id, hid_t *id_out_ptr) +{ + int ret_value = 0; + H5P_genplist_t *plist_ptr = NULL; + + H5FD_IOC_LOG_CALL(__func__); + + HDassert(id_out_ptr != NULL); + + if (FALSE == H5P_isa_class(fapl_id, H5P_FILE_ACCESS)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, -1, "not a file access property list"); + + plist_ptr = (H5P_genplist_t *)H5I_object(fapl_id); + if (NULL == plist_ptr) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, -1, "unable to get property list"); + + *id_out_ptr = H5P_copy_plist(plist_ptr, FALSE); + if (H5I_INVALID_HID == *id_out_ptr) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_BADTYPE, -1, "unable to copy file access property list"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__copy_plist() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_fapl_copy + * + * Purpose: Copies the file access properties. + * + * Return: Success: Pointer to a new property list info structure. + * Failure: NULL + *------------------------------------------------------------------------- + */ +static void * +H5FD__ioc_fapl_copy(const void *_old_fa) +{ + const H5FD_ioc_config_t *old_fa_ptr = (const H5FD_ioc_config_t *)_old_fa; + H5FD_ioc_config_t * new_fa_ptr = NULL; + void * ret_value = NULL; + + H5FD_IOC_LOG_CALL(__func__); + + HDassert(old_fa_ptr); + + new_fa_ptr = H5FL_CALLOC(H5FD_ioc_config_t); + if (NULL == new_fa_ptr) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTALLOC, NULL, "unable to allocate log file FAPL"); + + HDmemcpy(new_fa_ptr, old_fa_ptr, sizeof(H5FD_ioc_config_t)); + + /* Copy the FAPL */ + if (H5FD__copy_plist(old_fa_ptr->ioc_fapl_id, &(new_fa_ptr->ioc_fapl_id)) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_BADVALUE, NULL, "can't copy the IOC FAPL"); + + ret_value = (void *)new_fa_ptr; + +done: + if (NULL == ret_value) + if (new_fa_ptr) + new_fa_ptr = H5FL_FREE(H5FD_ioc_config_t, new_fa_ptr); + + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_fapl_copy() */ + +/*-------------------------------------------------------------------------- + * Function: H5FD__ioc_fapl_free + * + * Purpose: Releases the file access lists + * + * Return: SUCCEED/FAIL + *-------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_fapl_free(void *_fapl) +{ + H5FD_ioc_config_t *fapl = (H5FD_ioc_config_t *)_fapl; + herr_t ret_value = SUCCEED; + + H5FD_IOC_LOG_CALL(__func__); + + /* Check arguments */ + HDassert(fapl); + + if (fapl->ioc_fapl_id >= 0 && H5I_dec_ref(fapl->ioc_fapl_id) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTDEC, FAIL, "can't close FAPL ID"); + + /* Free the property list */ + fapl = H5FL_FREE(H5FD_ioc_config_t, fapl); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_fapl_free() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_open + * + * Purpose: Create and/or opens a file as an HDF5 file. + * + * Return: Success: A pointer to a new file data structure. The + * public fields will be initialized by the + * caller, which is always H5FD_open(). + * Failure: NULL + *------------------------------------------------------------------------- + */ +static H5FD_t * +H5FD__ioc_open(const char *name, unsigned flags, hid_t fapl_id, haddr_t maxaddr) +{ + H5FD_ioc_t * file_ptr = NULL; /* Ioc VFD info */ + const H5FD_ioc_config_t *config_ptr = NULL; /* Driver-specific property list */ + H5FD_ioc_config_t default_config; + H5FD_class_t * driver = NULL; /* VFD for file */ + H5P_genplist_t * plist_ptr = NULL; + H5FD_driver_prop_t driver_prop; /* Property for driver ID & info */ + int mpi_inited = 0; + int mpi_code; /* MPI return code */ + H5FD_t * ret_value = NULL; + + H5FD_IOC_LOG_CALL(__func__); + + /* Check arguments */ + if (!name || !*name) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, NULL, "invalid file name"); + if (0 == maxaddr || HADDR_UNDEF == maxaddr) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADRANGE, NULL, "bogus maxaddr"); + if (ADDR_OVERFLOW(maxaddr)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_OVERFLOW, NULL, "bogus maxaddr"); + + if (NULL == (file_ptr = (H5FD_ioc_t *)H5FL_CALLOC(H5FD_ioc_t))) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTALLOC, NULL, "unable to allocate file struct"); + file_ptr->comm = MPI_COMM_NULL; + file_ptr->info = MPI_INFO_NULL; + file_ptr->context_id = -1; + file_ptr->fa.ioc_fapl_id = H5I_INVALID_HID; + + /* Get the driver-specific file access properties */ + if (NULL == (plist_ptr = (H5P_genplist_t *)H5I_object(fapl_id))) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, NULL, "not a file access property list"); + + if (H5FD_mpi_self_initialized) { + file_ptr->comm = MPI_COMM_WORLD; + file_ptr->info = MPI_INFO_NULL; + + mpi_inited = 1; + } + else { + /* Get the MPI communicator and info object from the property list */ + if (H5P_get(plist_ptr, H5F_ACS_MPI_PARAMS_COMM_NAME, &file_ptr->comm) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTGET, NULL, "can't get MPI communicator"); + if (H5P_get(plist_ptr, H5F_ACS_MPI_PARAMS_INFO_NAME, &file_ptr->info) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTGET, NULL, "can't get MPI info object"); + + if (file_ptr->comm == MPI_COMM_NULL) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_BADVALUE, NULL, "invalid or unset MPI communicator in FAPL"); + + /* Get the status of MPI initialization */ + if (MPI_SUCCESS != (mpi_code = MPI_Initialized(&mpi_inited))) + H5_SUBFILING_MPI_GOTO_ERROR(NULL, "MPI_Initialized failed", mpi_code); + if (!mpi_inited) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_UNINITIALIZED, NULL, "MPI has not been initialized"); + } + + /* Get the MPI rank of this process and the total number of processes */ + if (MPI_SUCCESS != (mpi_code = MPI_Comm_rank(file_ptr->comm, &file_ptr->mpi_rank))) + H5_SUBFILING_MPI_GOTO_ERROR(NULL, "MPI_Comm_rank failed", mpi_code); + if (MPI_SUCCESS != (mpi_code = MPI_Comm_size(file_ptr->comm, &file_ptr->mpi_size))) + H5_SUBFILING_MPI_GOTO_ERROR(NULL, "MPI_Comm_size failed", mpi_code); + + config_ptr = H5P_peek_driver_info(plist_ptr); + if (!config_ptr || (H5P_FILE_ACCESS_DEFAULT == fapl_id)) { + if (H5FD__ioc_get_default_config(&default_config) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTGET, NULL, "can't get default IOC VFD configuration"); + config_ptr = &default_config; + } + + /* Fill in the file config values */ + HDmemcpy(&file_ptr->fa, config_ptr, sizeof(H5FD_ioc_config_t)); + + if (NULL != (file_ptr->file_path = HDrealpath(name, NULL))) { + char *path = NULL; + char *directory = dirname(path); + + if (NULL == (path = HDstrdup(file_ptr->file_path))) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTCOPY, NULL, "can't copy subfiling subfile path"); + if (NULL == (file_ptr->file_dir = HDstrdup(directory))) { + HDfree(path); + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTCOPY, NULL, + "can't copy subfiling subfile directory path"); + } + + HDfree(path); + } + else { + if (ENOENT == errno) { + if (NULL == (file_ptr->file_path = HDstrdup(name))) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTCOPY, NULL, "can't copy file name"); + if (NULL == (file_ptr->file_dir = HDstrdup("."))) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTOPENFILE, NULL, "can't set subfile directory path"); + } + else + H5_SUBFILING_SYS_GOTO_ERROR(H5E_VFL, H5E_CANTGET, NULL, "can't resolve subfile path"); + } + + /* Copy the ioc FAPL. */ + if (H5FD__copy_plist(config_ptr->ioc_fapl_id, &(file_ptr->fa.ioc_fapl_id)) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_BADVALUE, NULL, "can't copy FAPL"); + + /* Check the underlying driver (sec2/mpio/etc.) */ + if (NULL == (plist_ptr = (H5P_genplist_t *)H5I_object(config_ptr->ioc_fapl_id))) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, NULL, "not a file access property list"); + + if (H5P_peek(plist_ptr, H5F_ACS_FILE_DRV_NAME, &driver_prop) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTGET, NULL, "can't get driver ID & info"); + if (NULL == (driver = (H5FD_class_t *)H5I_object(driver_prop.driver_id))) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_BADVALUE, NULL, + "invalid driver ID in file access property list"); + + if (driver->value != H5_VFD_SEC2) { + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_CANTOPENFILE, NULL, + "unable to open file '%s' - only Sec2 VFD is currently supported", name); + } + else { + subfiling_context_t *sf_context = NULL; + uint64_t inode_id = UINT64_MAX; + int ioc_flags; + int l_error = 0; + int g_error = 0; + + /* Translate the HDF5 file open flags into standard POSIX open flags */ + ioc_flags = (H5F_ACC_RDWR & flags) ? O_RDWR : O_RDONLY; + if (H5F_ACC_TRUNC & flags) + ioc_flags |= O_TRUNC; + if (H5F_ACC_CREAT & flags) + ioc_flags |= O_CREAT; + if (H5F_ACC_EXCL & flags) + ioc_flags |= O_EXCL; + + file_ptr->ioc_file = H5FD_open(file_ptr->file_path, flags, config_ptr->ioc_fapl_id, HADDR_UNDEF); + if (file_ptr->ioc_file) { + h5_stat_t sb; + void * file_handle = NULL; + + if (file_ptr->mpi_rank == 0) { + if (H5FDget_vfd_handle(file_ptr->ioc_file, config_ptr->ioc_fapl_id, &file_handle) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTGET, NULL, "can't get file handle"); + + if (HDfstat(*(int *)file_handle, &sb) < 0) + H5_SUBFILING_SYS_GOTO_ERROR(H5E_FILE, H5E_BADFILE, NULL, "unable to fstat file"); + + HDcompile_assert(sizeof(uint64_t) >= sizeof(ino_t)); + file_ptr->inode = sb.st_ino; + inode_id = (uint64_t)sb.st_ino; + } + + if (MPI_SUCCESS != (mpi_code = MPI_Bcast(&inode_id, 1, MPI_UINT64_T, 0, file_ptr->comm))) + H5_SUBFILING_MPI_GOTO_ERROR(NULL, "MPI_Bcast failed", mpi_code); + + if (file_ptr->mpi_rank != 0) + file_ptr->inode = (ino_t)inode_id; + } + else { + /* The two-step file opening approach may be + * the root cause for the sec2 open to return a NULL. + * It is prudent then, to collectively fail (early) in this case. + */ + l_error = 1; + } + + /* Check if any ranks had an issue opening the file */ + if (MPI_SUCCESS != + (mpi_code = MPI_Allreduce(&l_error, &g_error, 1, MPI_INT, MPI_SUM, file_ptr->comm))) + H5_SUBFILING_MPI_GOTO_ERROR(NULL, "MPI_Allreduce failed", mpi_code); + if (g_error) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_CANTOPENFILE, NULL, + "one or more MPI ranks were unable to open file '%s'", name); + + /* + * Open the subfiles for this HDF5 file. A subfiling + * context ID will be returned, which is used for + * further interactions with this file's subfiles. + */ + if (H5_open_subfiles(file_ptr->file_path, inode_id, file_ptr->fa.ioc_selection, ioc_flags, + file_ptr->comm, &file_ptr->context_id) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_CANTOPENFILE, NULL, "unable to open subfiles for file '%s'", + name); + + /* Initialize I/O concentrator threads if this MPI rank is an I/O concentrator */ + sf_context = H5_get_subfiling_object(file_ptr->context_id); + if (sf_context && sf_context->topology->rank_is_ioc) { + if (initialize_ioc_threads(sf_context) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_CANTINIT, NULL, + "unable to initialize I/O concentrator threads"); + } + } + + ret_value = (H5FD_t *)file_ptr; + +done: + /* run a barrier just before exit. The objective is to + * ensure that the IOCs are fully up and running before + * we proceed. Note that this barrier is not sufficient + * by itself -- we also need code in initialize_ioc_threads() + * to wait until the main IOC thread has finished its + * initialization. + */ + if (mpi_inited) { + MPI_Comm barrier_comm = MPI_COMM_WORLD; + + if (file_ptr && (file_ptr->comm != MPI_COMM_NULL)) + barrier_comm = file_ptr->comm; + + if (MPI_SUCCESS != (mpi_code = MPI_Barrier(barrier_comm))) + H5_SUBFILING_MPI_DONE_ERROR(NULL, "MPI_Barrier failed", mpi_code); + } + + if (NULL == ret_value) { + if (file_ptr) { + if (H5FD__ioc_close_int(file_ptr) < 0) + H5_SUBFILING_DONE_ERROR(H5E_FILE, H5E_CLOSEERROR, NULL, "can't close IOC file"); + } + } /* end if error */ + + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_open() */ + +static herr_t +H5FD__ioc_close_int(H5FD_ioc_t *file_ptr) +{ + herr_t ret_value = SUCCEED; + + HDassert(file_ptr); + +#ifdef H5FD_IOC_DEBUG + { + subfiling_context_t *sf_context = H5_get_subfiling_object(file_ptr->fa.context_id); + if (sf_context) { + if (sf_context->topology->rank_is_ioc) + HDprintf("[%s %d] fd=%d\n", __func__, file_ptr->mpi_rank, sf_context->sf_fid); + else + HDprintf("[%s %d] fd=*\n", __func__, file_ptr->mpi_rank); + } + else + HDprintf("[%s %d] invalid subfiling context", __func__, file_ptr->mpi_rank); + HDfflush(stdout); + } +#endif + + if (file_ptr->fa.ioc_fapl_id >= 0 && H5I_dec_ref(file_ptr->fa.ioc_fapl_id) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_ARGS, FAIL, "can't close FAPL"); + file_ptr->fa.ioc_fapl_id = H5I_INVALID_HID; + + /* Close underlying file */ + if (file_ptr->ioc_file) { + if (H5FD_close(file_ptr->ioc_file) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTCLOSEFILE, FAIL, "unable to close HDF5 file"); + file_ptr->ioc_file = NULL; + } + + if (file_ptr->context_id >= 0) { + subfiling_context_t *sf_context = H5_get_subfiling_object(file_ptr->context_id); + int mpi_code; + + /* Don't allow IOC threads to be finalized until everyone gets here */ + if (MPI_SUCCESS != (mpi_code = MPI_Barrier(file_ptr->comm))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Barrier failed", mpi_code); + + if (sf_context && sf_context->topology->rank_is_ioc) { + if (finalize_ioc_threads(sf_context) < 0) + /* Note that closing of subfiles is collective */ + H5_SUBFILING_DONE_ERROR(H5E_VFL, H5E_CANTCLOSEFILE, FAIL, "unable to finalize IOC threads"); + } + + if (H5_close_subfiles(file_ptr->context_id) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTCLOSEFILE, FAIL, "unable to close subfiling file(s)"); + file_ptr->context_id = -1; + } + + if (H5_mpi_comm_free(&file_ptr->comm) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTFREE, FAIL, "unable to free MPI Communicator"); + if (H5_mpi_info_free(&file_ptr->info) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTFREE, FAIL, "unable to free MPI Info object"); + +done: + HDfree(file_ptr->file_path); + file_ptr->file_path = NULL; + + HDfree(file_ptr->file_dir); + file_ptr->file_dir = NULL; + + /* Release the file info */ + file_ptr = H5FL_FREE(H5FD_ioc_t, file_ptr); + + H5_SUBFILING_FUNC_LEAVE; +} + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_close + * + * Purpose: Closes files + * + * Return: Success: SUCCEED + * Failure: FAIL, file not closed. + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_close(H5FD_t *_file) +{ + H5FD_ioc_t *file = (H5FD_ioc_t *)_file; + herr_t ret_value = SUCCEED; + + H5FD_IOC_LOG_CALL(__func__); + + if (H5FD__ioc_close_int(file) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_CANTCLOSEFILE, FAIL, "can't close IOC file"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_close() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_cmp + * + * Purpose: Compare the keys of two files. + * + * Return: Success: A value like strcmp() + * Failure: Must never fail + *------------------------------------------------------------------------- + */ +static int +H5FD__ioc_cmp(const H5FD_t *_f1, const H5FD_t *_f2) +{ + const H5FD_ioc_t *f1 = (const H5FD_ioc_t *)_f1; + const H5FD_ioc_t *f2 = (const H5FD_ioc_t *)_f2; + herr_t ret_value = 0; /* Return value */ + + H5FD_IOC_LOG_CALL(__func__); + + HDassert(f1); + HDassert(f2); + + ret_value = H5FD_cmp(f1->ioc_file, f2->ioc_file); + + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_cmp */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_query + * + * Purpose: Set the flags that this VFL driver is capable of supporting. + * (listed in H5FDpublic.h) + * + * Return: SUCCEED/FAIL + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_query(const H5FD_t *_file, unsigned long *flags /* out */) +{ + const H5FD_ioc_t *file_ptr = (const H5FD_ioc_t *)_file; + herr_t ret_value = SUCCEED; + + H5FD_IOC_LOG_CALL(__func__); + + if (file_ptr == NULL) { + if (flags) + *flags = 0; + } + else if (file_ptr->ioc_file) { + if (H5FDquery(file_ptr->ioc_file, flags) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTLOCK, FAIL, "unable to query R/W file"); + } + else { + /* There is no file. Because this is a pure passthrough VFD, + * it has no features of its own. + */ + if (flags) + *flags = 0; + } + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_query() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_get_type_map + * + * Purpose: Retrieve the memory type mapping for this file + * + * Return: SUCCEED/FAIL + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_get_type_map(const H5FD_t *_file, H5FD_mem_t *type_map) +{ + const H5FD_ioc_t *file = (const H5FD_ioc_t *)_file; + herr_t ret_value = SUCCEED; + + H5FD_IOC_LOG_CALL(__func__); + + /* Check arguments */ + HDassert(file); + HDassert(file->ioc_file); + + /* Retrieve memory type mapping for R/W channel only */ + if (H5FD_get_fs_type_map(file->ioc_file, type_map) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTGET, FAIL, "unable to allocate for R/W file"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_get_type_map() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_alloc + * + * Purpose: Allocate file memory. + * + * Return: Address of allocated space (HADDR_UNDEF if error). + *------------------------------------------------------------------------- + */ +static haddr_t +H5FD__ioc_alloc(H5FD_t *_file, H5FD_mem_t type, hid_t dxpl_id, hsize_t size) +{ + H5FD_ioc_t *file = (H5FD_ioc_t *)_file; /* VFD file struct */ + haddr_t ret_value = HADDR_UNDEF; /* Return value */ + + H5FD_IOC_LOG_CALL(__func__); + + /* Check arguments */ + HDassert(file); + HDassert(file->ioc_file); + + /* Allocate memory for each file, only return the return value for R/W file. + */ + if ((ret_value = H5FDalloc(file->ioc_file, type, dxpl_id, size)) == HADDR_UNDEF) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTINIT, HADDR_UNDEF, "unable to allocate for R/W file"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_alloc() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_free + * + * Purpose: Free the resources for the ioc VFD. + * + * Return: SUCCEED/FAIL + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_free(H5FD_t *_file, H5FD_mem_t type, hid_t dxpl_id, haddr_t addr, hsize_t size) +{ + H5FD_ioc_t *file = (H5FD_ioc_t *)_file; /* VFD file struct */ + herr_t ret_value = SUCCEED; /* Return value */ + + H5FD_IOC_LOG_CALL(__func__); + + /* Check arguments */ + HDassert(file); + HDassert(file->ioc_file); + + if (H5FDfree(file->ioc_file, type, dxpl_id, addr, size) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTFREE, FAIL, "unable to free for R/W file"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_free() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_get_eoa + * + * Purpose: Returns the end-of-address marker for the file. The EOA + * marker is the first address past the last byte allocated in + * the format address space. + * + * Return: Success: The end-of-address-marker + * + * Failure: HADDR_UNDEF + *------------------------------------------------------------------------- + */ +static haddr_t +H5FD__ioc_get_eoa(const H5FD_t *_file, H5FD_mem_t H5_ATTR_UNUSED type) +{ + const H5FD_ioc_t *file = (const H5FD_ioc_t *)_file; + haddr_t ret_value = HADDR_UNDEF; + + H5FD_IOC_LOG_CALL(__func__); + + /* Sanity check */ + HDassert(file); + HDassert(file->ioc_file); + + if ((ret_value = H5FD_get_eoa(file->ioc_file, type)) == HADDR_UNDEF) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_BADVALUE, HADDR_UNDEF, "unable to get eoa"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_get_eoa */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_set_eoa + * + * Purpose: Set the end-of-address marker for the file. This function is + * called shortly after an existing HDF5 file is opened in order + * to tell the driver where the end of the HDF5 data is located. + * + * Return: SUCCEED/FAIL + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_set_eoa(H5FD_t *_file, H5FD_mem_t H5_ATTR_UNUSED type, haddr_t addr) +{ + H5FD_ioc_t *file = (H5FD_ioc_t *)_file; + herr_t ret_value = SUCCEED; /* Return value */ + + H5FD_IOC_LOG_CALL(__func__); + + /* Sanity check */ + HDassert(file); + HDassert(file->ioc_file); + HDassert(file->ioc_file); + + if (H5FD_set_eoa(file->ioc_file, type, addr) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTSET, FAIL, "H5FDset_eoa failed for R/W file"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_set_eoa() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_get_eof + * + * Purpose: Returns the end-of-address marker for the file. The EOA + * marker is the first address past the last byte allocated in + * the format address space. + * + * Return: Success: The end-of-address-marker + * + * Failure: HADDR_UNDEF + *------------------------------------------------------------------------- + */ +static haddr_t +H5FD__ioc_get_eof(const H5FD_t *_file, H5FD_mem_t H5_ATTR_UNUSED type) +{ + const H5FD_ioc_t * file = (const H5FD_ioc_t *)_file; + haddr_t ret_value = HADDR_UNDEF; /* Return value */ + subfiling_context_t *sf_context = NULL; + + H5FD_IOC_LOG_CALL(__func__); + + /* Sanity check */ + HDassert(file); + HDassert(file->ioc_file); + + sf_context = H5_get_subfiling_object(file->context_id); + if (sf_context) { + ret_value = sf_context->sf_eof; + goto done; + } + + if (HADDR_UNDEF == (ret_value = H5FD_get_eof(file->ioc_file, type))) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTGET, HADDR_UNDEF, "unable to get eof"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_get_eof */ + +/*-------------------------------------------------------------------------- + * Function: H5FD__ioc_get_handle + * + * Purpose: Returns a pointer to the file handle of low-level virtual + * file driver. + * + * Return: SUCCEED/FAIL + *-------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_get_handle(H5FD_t *_file, hid_t H5_ATTR_UNUSED fapl, void **file_handle) +{ + H5FD_ioc_t *file = (H5FD_ioc_t *)_file; + herr_t ret_value = SUCCEED; /* Return value */ + + H5FD_IOC_LOG_CALL(__func__); + + /* Check arguments */ + HDassert(file); + HDassert(file->ioc_file); + HDassert(file_handle); + + if (H5FD_get_vfd_handle(file->ioc_file, file->fa.ioc_fapl_id, file_handle) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTGET, FAIL, "unable to get handle of R/W file"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_get_handle */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_read + * + * Purpose: Reads SIZE bytes of data from the R/W channel, beginning at + * address ADDR into buffer BUF according to data transfer + * properties in DXPL_ID. + * + * Return: Success: SUCCEED + * The read result is written into the BUF buffer + * which should be allocated by the caller. + * Failure: FAIL + * The contents of BUF are undefined. + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_read(H5FD_t *_file, H5FD_mem_t H5_ATTR_UNUSED type, hid_t H5_ATTR_UNUSED dxpl_id, haddr_t addr, + size_t size, void *buf) +{ + H5FD_ioc_t *file = (H5FD_ioc_t *)_file; + herr_t ret_value = SUCCEED; + + H5FD_IOC_LOG_CALL(__func__); + + HDassert(file && file->pub.cls); + HDassert(buf); + + /* Check for overflow conditions */ + if (!H5F_addr_defined(addr)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "addr undefined, addr = %" PRIuHADDR, addr); + if (REGION_OVERFLOW(addr, size)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_OVERFLOW, FAIL, "addr overflow, addr = %" PRIuHADDR, addr); + + /* Public API for dxpl "context" */ + if (H5FDread(file->ioc_file, type, dxpl_id, addr, size, buf) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_READERROR, FAIL, "Reading from R/W channel failed"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_read() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_write + * + * Purpose: Writes SIZE bytes of data to IOC file, beginning at address + * ADDR from buffer BUF according to data transfer properties + * in DXPL_ID. + * + * Return: SUCCEED/FAIL + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_write(H5FD_t *_file, H5FD_mem_t type, hid_t dxpl_id, haddr_t addr, size_t size, const void *buf) +{ + H5P_genplist_t *plist_ptr = NULL; + herr_t ret_value = SUCCEED; + + if (NULL == (plist_ptr = (H5P_genplist_t *)H5I_object(dxpl_id))) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, FAIL, "not a property list"); + + addr += _file->base_addr; + + ret_value = H5FD__ioc_write_vector_internal(_file, 1, &type, &addr, &size, &buf); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_write() */ + +static herr_t +H5FD__ioc_read_vector(H5FD_t *_file, hid_t dxpl_id, uint32_t count, H5FD_mem_t types[], haddr_t addrs[], + size_t sizes[], void *bufs[] /* out */) +{ + H5FD_ioc_t *file_ptr = (H5FD_ioc_t *)_file; + herr_t ret_value = SUCCEED; /* Return value */ + + /* Check arguments */ + if (!file_ptr) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "file pointer cannot be NULL"); + + if ((!types) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "types parameter can't be NULL if count is positive"); + + if ((!addrs) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "addrs parameter can't be NULL if count is positive"); + + if ((!sizes) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "sizes parameter can't be NULL if count is positive"); + + if ((!bufs) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "bufs parameter can't be NULL if count is positive"); + + /* Get the default dataset transfer property list if the user didn't provide + * one */ + if (H5P_DEFAULT == dxpl_id) { + dxpl_id = H5P_DATASET_XFER_DEFAULT; + } + else { + if (TRUE != H5P_isa_class(dxpl_id, H5P_DATASET_XFER)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, FAIL, "not a data transfer property list"); + } + + ret_value = H5FD__ioc_read_vector_internal(_file, count, addrs, sizes, bufs); + +done: + H5_SUBFILING_FUNC_LEAVE; +} + +static herr_t +H5FD__ioc_write_vector(H5FD_t *_file, hid_t dxpl_id, uint32_t count, H5FD_mem_t types[], haddr_t addrs[], + size_t sizes[], const void *bufs[] /* in */) +{ + H5FD_ioc_t *file = (H5FD_ioc_t *)_file; + herr_t ret_value = SUCCEED; /* Return value */ + + /* Check arguments */ + if (!file) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "file pointer cannot be NULL"); + + if ((!types) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "types parameter can't be NULL if count is positive"); + + if ((!addrs) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "addrs parameter can't be NULL if count is positive"); + + if ((!sizes) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "sizes parameter can't be NULL if count is positive"); + + if ((!bufs) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "bufs parameter can't be NULL if count is positive"); + + /* Get the default dataset transfer property list if the user didn't provide + * one */ + if (H5P_DEFAULT == dxpl_id) { + dxpl_id = H5P_DATASET_XFER_DEFAULT; + } + else { + if (TRUE != H5P_isa_class(dxpl_id, H5P_DATASET_XFER)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, FAIL, "not a data transfer property list"); + } + + ret_value = H5FD__ioc_write_vector_internal(_file, count, types, addrs, sizes, bufs); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FDioc__write_vector() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_flush + * + * Purpose: Flushes all data to disk for underlying VFD. + * + * Return: SUCCEED/FAIL + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_flush(H5FD_t *_file, hid_t H5_ATTR_UNUSED dxpl_id, hbool_t closing) +{ + H5FD_ioc_t *file = (H5FD_ioc_t *)_file; + herr_t ret_value = SUCCEED; /* Return value */ + + H5FD_IOC_LOG_CALL(__func__); + + if (H5FDflush(file->ioc_file, dxpl_id, closing) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTFLUSH, FAIL, "unable to flush R/W file"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_flush() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__ioc_truncate + * + * Purpose: Notify driver to truncate the file back to the allocated size. + * + * Return: SUCCEED/FAIL + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_truncate(H5FD_t *_file, hid_t dxpl_id, hbool_t closing) +{ + H5FD_ioc_t *file = (H5FD_ioc_t *)_file; + herr_t ret_value = SUCCEED; /* Return value */ + + H5FD_IOC_LOG_CALL(__func__); + + HDassert(file); + HDassert(file->ioc_file); + HDassert(file->ioc_file); + + if (H5FDtruncate(file->ioc_file, dxpl_id, closing) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTUPDATE, FAIL, "unable to truncate R/W file"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_truncate */ + +/*-------------------------------------------------------------------------- + * Function: H5FD__ioc_lock + * + * Purpose: Sets a file lock. + * + * Return: SUCCEED/FAIL + *-------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_lock(H5FD_t *_file, hbool_t rw) +{ + H5FD_ioc_t *file = (H5FD_ioc_t *)_file; /* VFD file struct */ + herr_t ret_value = SUCCEED; /* Return value */ + + H5FD_IOC_LOG_CALL(__func__); + + HDassert(file); + HDassert(file->ioc_file); + + if (H5FD_lock(file->ioc_file, rw) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTLOCKFILE, FAIL, "unable to lock file"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_lock */ + +/*-------------------------------------------------------------------------- + * Function: H5FD__ioc_unlock + * + * Purpose: Removes a file lock. + * + * Return: SUCCEED/FAIL + *-------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_unlock(H5FD_t *_file) +{ + H5FD_ioc_t *file = (H5FD_ioc_t *)_file; /* VFD file struct */ + herr_t ret_value = SUCCEED; /* Return value */ + + H5FD_IOC_LOG_CALL(__func__); + + /* Check arguments */ + HDassert(file); + HDassert(file->ioc_file); + + if (H5FD_unlock(file->ioc_file) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTUNLOCKFILE, FAIL, "unable to unlock file"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__ioc_unlock */ + +static herr_t +H5FD__ioc_del(const char *name, hid_t fapl) +{ + herr_t ret_value = SUCCEED; + + (void)name; + (void)fapl; + + /* TODO: implement later */ + + H5_SUBFILING_FUNC_LEAVE; +} + +/*-------------------------------------------------------------------------- + * Function: H5FD__ioc_write_vector_internal + * + * Purpose: This function takes 'count' vector entries + * and initiates an asynch write operation for each. + * By asynchronous, we mean that MPI_Isends are utilized + * to communicate the write operations to the 'count' + * IO Concentrators. The calling function will have + * decomposed the actual user IO request into the + * component segments, each IO having a maximum size + * of "stripe_depth", which is recorded in the + * subfiling_context_t 'sf_context' structure. + * + * Return: SUCCEED if no errors, FAIL otherwise. + *-------------------------------------------------------------------------- + */ +static herr_t +H5FD__ioc_write_vector_internal(H5FD_t *_file, uint32_t count, H5FD_mem_t types[], haddr_t addrs[], + size_t sizes[], const void *bufs[] /* in */) +{ + subfiling_context_t *sf_context = NULL; + MPI_Request * active_reqs = NULL; + H5FD_ioc_t * file_ptr = (H5FD_ioc_t *)_file; + io_req_t ** sf_async_reqs = NULL; + int64_t sf_context_id = -1; + herr_t ret_value = SUCCEED; + struct __mpi_req { + int n_reqs; + MPI_Request *active_reqs; + } *mpi_reqs = NULL; + + HDassert(_file); + HDassert(addrs); + HDassert(sizes); + HDassert(bufs); + + if (count == 0) + H5_SUBFILING_GOTO_DONE(SUCCEED); + + sf_context_id = file_ptr->context_id; + + if (NULL == (sf_context = H5_get_subfiling_object(sf_context_id))) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_CANTGET, FAIL, "can't get subfiling context from ID"); + HDassert(sf_context->topology); + HDassert(sf_context->topology->n_io_concentrators); + + if (NULL == (active_reqs = HDcalloc((size_t)(count + 2), sizeof(struct __mpi_req)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate active I/O requests array"); + + if (NULL == (sf_async_reqs = HDcalloc((size_t)count, sizeof(*sf_async_reqs)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, "can't allocate I/O request array"); + + /* + * Note: We allocated extra space in the active_requests (above). + * The extra should be enough for an integer plus a pointer. + */ + mpi_reqs = (struct __mpi_req *)&active_reqs[count]; + mpi_reqs->n_reqs = (int)count; + mpi_reqs->active_reqs = active_reqs; + + /* Each pass thru the following should queue an MPI write + * to a new IOC. Both the IOC selection and offset within the + * particular subfile are based on the combination of striping + * factors and the virtual file offset (addrs[i]). + */ + for (size_t i = 0; i < (size_t)count; i++) { + herr_t write_status; + + if (sizes[i] == 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_WRITEERROR, FAIL, "invalid size argument of 0"); + + H5_CHECK_OVERFLOW(addrs[i], haddr_t, int64_t); + H5_CHECK_OVERFLOW(sizes[i], size_t, int64_t); + write_status = + ioc__write_independent_async(sf_context_id, sf_context->topology->n_io_concentrators, + (int64_t)addrs[i], (int64_t)sizes[i], bufs[i], &sf_async_reqs[i]); + + if (write_status < 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_WRITEERROR, FAIL, "couldn't queue write operation"); + + mpi_reqs->active_reqs[i] = sf_async_reqs[i]->completion_func.io_args.io_req; + } + + /* + * Mirror superblock writes to the stub file so that + * legacy HDF5 applications can check what type of + * file they are reading + */ + for (size_t i = 0; i < (size_t)count; i++) { + if (types[i] == H5FD_MEM_SUPER) { + if (H5FDwrite(file_ptr->ioc_file, H5FD_MEM_SUPER, H5P_DEFAULT, addrs[i], sizes[i], bufs[i]) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_WRITEERROR, FAIL, + "couldn't write superblock information to stub file"); + } + } + + /* Here, we should have queued 'count' async requests. + * We can can now try to complete those before returning + * to the caller for the next set of IO operations. + */ + if (sf_async_reqs[0]->completion_func.io_function) + ret_value = (*sf_async_reqs[0]->completion_func.io_function)(mpi_reqs); + +done: + if (active_reqs) + HDfree(active_reqs); + + if (sf_async_reqs) { + for (size_t i = 0; i < (size_t)count; i++) { + if (sf_async_reqs[i]) { + HDfree(sf_async_reqs[i]); + } + } + HDfree(sf_async_reqs); + } + + H5_SUBFILING_FUNC_LEAVE; +} + +static herr_t +H5FD__ioc_read_vector_internal(H5FD_t *_file, uint32_t count, haddr_t addrs[], size_t sizes[], + void *bufs[] /* out */) +{ + subfiling_context_t *sf_context = NULL; + MPI_Request * active_reqs = NULL; + H5FD_ioc_t * file_ptr = (H5FD_ioc_t *)_file; + io_req_t ** sf_async_reqs = NULL; + int64_t sf_context_id = -1; + herr_t ret_value = SUCCEED; + struct __mpi_req { + int n_reqs; + MPI_Request *active_reqs; + } *mpi_reqs = NULL; + + HDassert(_file); + HDassert(addrs); + HDassert(sizes); + HDassert(bufs); + + if (count == 0) + H5_SUBFILING_GOTO_DONE(SUCCEED); + + sf_context_id = file_ptr->context_id; + + if (NULL == (sf_context = H5_get_subfiling_object(sf_context_id))) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_CANTGET, FAIL, "can't get subfiling context from ID"); + HDassert(sf_context->topology); + HDassert(sf_context->topology->n_io_concentrators); + + if (NULL == (active_reqs = HDcalloc((size_t)(count + 2), sizeof(struct __mpi_req)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate active I/O requests array"); + + if (NULL == (sf_async_reqs = HDcalloc((size_t)count, sizeof(*sf_async_reqs)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, "can't allocate I/O request array"); + + /* + * Note: We allocated extra space in the active_requests (above). + * The extra should be enough for an integer plus a pointer. + */ + mpi_reqs = (struct __mpi_req *)&active_reqs[count]; + mpi_reqs->n_reqs = (int)count; + mpi_reqs->active_reqs = active_reqs; + + for (size_t i = 0; i < (size_t)count; i++) { + int read_status; + + H5_CHECK_OVERFLOW(addrs[i], haddr_t, int64_t); + H5_CHECK_OVERFLOW(sizes[i], size_t, int64_t); + read_status = + ioc__read_independent_async(sf_context_id, sf_context->topology->n_io_concentrators, + (int64_t)addrs[i], (int64_t)sizes[i], bufs[i], &sf_async_reqs[i]); + + if (read_status < 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_READERROR, FAIL, "couldn't queue read operation"); + + mpi_reqs->active_reqs[i] = sf_async_reqs[i]->completion_func.io_args.io_req; + } + + /* Here, we should have queued 'count' async requests + * (one to each required IOC). + * + * We can can now try to complete those before returning + * to the caller for the next set of IO operations. + */ + if (sf_async_reqs[0]->completion_func.io_function) + ret_value = (*sf_async_reqs[0]->completion_func.io_function)(mpi_reqs); + +done: + if (active_reqs) + HDfree(active_reqs); + + if (sf_async_reqs) { + for (size_t i = 0; i < count; i++) { + if (sf_async_reqs[i]) { + HDfree(sf_async_reqs[i]); + } + } + HDfree(sf_async_reqs); + } + + H5_SUBFILING_FUNC_LEAVE; +} diff --git a/src/H5FDsubfiling/H5FDioc.h b/src/H5FDsubfiling/H5FDioc.h new file mode 100644 index 0000000..04850f3 --- /dev/null +++ b/src/H5FDsubfiling/H5FDioc.h @@ -0,0 +1,96 @@ +/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * + * Copyright by The HDF Group. * + * All rights reserved. * + * * + * This file is part of HDF5. The full HDF5 copyright notice, including * + * terms governing use, modification, and redistribution, is contained in * + * the COPYING file, which can be found at the root of the source code * + * distribution tree, or in https://support.hdfgroup.org/ftp/HDF5/releases. * + * If you do not have access to either file, you may request a copy from * + * help@hdfgroup.org. * + * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ + +/* + * Purpose: The public header file for the "io concentrator" driver. + * This provides a similar functionality to that of the subfiling driver + * but introduces the necessary file access functionality via a multi- + * threading MPI service + */ + +#ifndef H5FDioc_H +#define H5FDioc_H + +#ifdef H5_HAVE_IOC_VFD +#define H5FD_IOC (H5FDperform_init(H5FD_ioc_init)) +#else +#define H5FD_IOC (H5I_INVALID_HID) +#endif + +#define H5FD_IOC_NAME "ioc" + +#ifdef H5_HAVE_IOC_VFD + +#ifndef H5FD_IOC_FAPL_MAGIC +#define H5FD_CURR_IOC_FAPL_VERSION 1 +#define H5FD_IOC_FAPL_MAGIC 0xFED21331 +#endif + +#define H5FD_IOC_THREAD_POOL_SIZE 4 + +/* + * Environment variables interpreted by the IOC VFD + */ +#define H5_IOC_THREAD_POOL_COUNT "H5_IOC_THREAD_POOL_COUNT" + +/* + * Define the various constants to allow different allocations + * of subfile ranks. The choices are self explanatory, starting + * with the default of one IO Concentrator (IOC) per node and + * lastly, defining a fixed number. + */ +typedef enum { + SELECT_IOC_ONE_PER_NODE = 0, /* Default */ + SELECT_IOC_EVERY_NTH_RANK, /* Starting at rank 0, select-next += N */ + SELECT_IOC_WITH_CONFIG, /* NOT IMPLEMENTED: Read-from-file */ + SELECT_IOC_TOTAL, /* Starting at rank 0, mpi_size / total */ + ioc_selection_options /* (Uses same selection as every Nth rank) */ +} ioc_selection_t; + +/* + * In addition to the common configuration fields, we can have + * VFD specific fields. Here's one for the IO Concentrator VFD. + * + * thread_pool_count (int32_t) + * Indicate the number of helper threads that we want for + * creating a thread pool + * + * ---------------------------------------------------------------------------- + */ + +typedef struct H5FD_ioc_config_t { + uint32_t magic; /* set to H5FD_IOC_FAPL_MAGIC */ + uint32_t version; /* set to H5FD_CURR_IOC_FAPL_VERSION */ + int32_t stripe_count; /* How many io concentrators */ + int64_t stripe_depth; /* Max # of bytes in contiguous IO to an IOC */ + ioc_selection_t ioc_selection; /* Method to select IO Concentrators */ + hid_t ioc_fapl_id; /* The hid_t value of the stacked VFD */ + int32_t thread_pool_count; +} H5FD_ioc_config_t; + +#ifdef __cplusplus +extern "C" { +#endif + +H5_DLL hid_t H5FD_ioc_init(void); +H5_DLL herr_t H5Pset_fapl_ioc(hid_t fapl_id, H5FD_ioc_config_t *config_ptr); +H5_DLL herr_t H5Pget_fapl_ioc(hid_t fapl_id, H5FD_ioc_config_t *config_ptr); +H5_DLL void begin_thread_exclusive(void); +H5_DLL void end_thread_exclusive(void); + +#ifdef __cplusplus +} +#endif + +#endif /* H5_HAVE_IOC_VFD */ + +#endif diff --git a/src/H5FDsubfiling/H5FDioc_int.c b/src/H5FDsubfiling/H5FDioc_int.c new file mode 100644 index 0000000..c1ce669 --- /dev/null +++ b/src/H5FDsubfiling/H5FDioc_int.c @@ -0,0 +1,382 @@ +/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * + * Copyright by The HDF Group. * + * All rights reserved. * + * * + * This file is part of HDF5. The full HDF5 copyright notice, including * + * terms governing use, modification, and redistribution, is contained in * + * the COPYING file, which can be found at the root of the source code * + * distribution tree, or in https://support.hdfgroup.org/ftp/HDF5/releases. * + * If you do not have access to either file, you may request a copy from * + * help@hdfgroup.org. * + * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ + +/* + * Purpose: This is part of an I/O concentrator driver. + */ + +#include "H5FDioc_priv.h" + +static int async_completion(void *arg); + +/* + * Given a file offset, the stripe size and + * the number of IOCs, calculate the target + * IOC for I/O and the file offset for the + * subfile that IOC controls + */ +static inline void +calculate_target_ioc(int64_t file_offset, int64_t stripe_size, int n_io_concentrators, int64_t *target_ioc, + int64_t *ioc_file_offset) +{ + int64_t stripe_idx; + int64_t subfile_row; + + HDassert(target_ioc); + HDassert(ioc_file_offset); + HDassert(stripe_size > 0); + HDassert(n_io_concentrators > 0); + + stripe_idx = file_offset / stripe_size; + subfile_row = stripe_idx / n_io_concentrators; + + *target_ioc = stripe_idx % n_io_concentrators; + *ioc_file_offset = (subfile_row * stripe_size) + (file_offset % stripe_size); +} + +/* + * Utility routine to hack around casting away const + */ +static inline void * +cast_to_void(const void *data) +{ + union { + const void *const_ptr_to_data; + void * ptr_to_data; + } eliminate_const_warning; + eliminate_const_warning.const_ptr_to_data = data; + return eliminate_const_warning.ptr_to_data; +} + +/*------------------------------------------------------------------------- + * Function: ioc__write_independent_async + * + * Purpose: The IO operations can be striped across a selection of + * IO concentrators. The read and write independent calls + * compute the group of 1 or more IOCs and further create + * derived MPI datatypes when required by the size of the + * contiguous read or write requests. + * + * IOC(0) contains the logical data storage for file offset + * zero and all offsets that reside within modulo range of + * the subfiling stripe_size. + * + * We cycle through all 'n_io_conentrators' and send a + * descriptor to each IOC that has a non-zero sized IO + * request to fulfill. + * + * Sending descriptors to an IOC usually gets an ACK or + * NACK in response. For the write operations, we post + * asynch READs to receive ACKs from IOC ranks that have + * allocated memory receive the data to write to the + * subfile. Upon receiving an ACK, we send the actual + * user data to the IOC. + * + * Return: Non-negative on success/Negative on failure + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + *------------------------------------------------------------------------- + */ +herr_t +ioc__write_independent_async(int64_t context_id, int n_io_concentrators, int64_t offset, int64_t elements, + const void *data, io_req_t **io_req) +{ + subfiling_context_t *sf_context = NULL; + MPI_Request ack_request = MPI_REQUEST_NULL; + io_req_t * sf_io_request = NULL; + int64_t ioc_start; + int64_t ioc_offset; + int64_t msg[3] = {0}; + int * io_concentrators = NULL; + int data_tag = 0; + int mpi_code; + herr_t ret_value = SUCCEED; + + HDassert(io_req); + + if (NULL == (sf_context = H5_get_subfiling_object(context_id))) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_WRITEERROR, FAIL, "can't get subfiling context from ID"); + HDassert(sf_context->topology); + HDassert(sf_context->topology->io_concentrators); + + io_concentrators = sf_context->topology->io_concentrators; + + /* + * Calculate the IOC that we'll send the I/O request to + * and the offset within that IOC's subfile + */ + calculate_target_ioc(offset, sf_context->sf_stripe_size, n_io_concentrators, &ioc_start, &ioc_offset); + + /* + * Wait for memory to be allocated on the target IOC before + * beginning send of user data. Once that memory has been + * allocated, we will receive an ACK (or NACK) message from + * the IOC to allow us to proceed. + * + * On ACK, the IOC will send the tag to be used for sending + * data. This allows us to distinguish between multiple + * concurrent writes from a single rank. + * + * Post an early non-blocking receive for the MPI tag here. + */ + if (MPI_SUCCESS != (mpi_code = MPI_Irecv(&data_tag, 1, MPI_INT, io_concentrators[ioc_start], + WRITE_INDEP_ACK, sf_context->sf_data_comm, &ack_request))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Irecv failed", mpi_code); + + /* + * Prepare and send an I/O request to the IOC identified + * by the file offset + */ + msg[0] = elements; + msg[1] = ioc_offset; + msg[2] = context_id; + if (MPI_SUCCESS != (mpi_code = MPI_Send(msg, 3, MPI_INT64_T, io_concentrators[ioc_start], WRITE_INDEP, + sf_context->sf_msg_comm))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Send failed", mpi_code); + + /* Wait to receive data tag */ + if (MPI_SUCCESS != (mpi_code = MPI_Wait(&ack_request, MPI_STATUS_IGNORE))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Wait failed", mpi_code); + + if (data_tag == 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_WRITEERROR, FAIL, "received NACK from IOC"); + + /* At this point in the new implementation, we should queue + * the async write so that when the top level VFD tells us + * to complete all pending IO requests, we have all the info + * we need to accomplish that. + */ + if (NULL == (sf_io_request = HDmalloc(sizeof(io_req_t)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_WRITEERROR, FAIL, "couldn't allocate I/O request"); + + H5_CHECK_OVERFLOW(ioc_start, int64_t, int); + sf_io_request->completion_func.io_args.ioc = (int)ioc_start; + sf_io_request->completion_func.io_args.context_id = context_id; + sf_io_request->completion_func.io_args.offset = offset; + sf_io_request->completion_func.io_args.elements = elements; + sf_io_request->completion_func.io_args.data = cast_to_void(data); + sf_io_request->completion_func.io_args.io_req = MPI_REQUEST_NULL; + sf_io_request->completion_func.io_function = async_completion; + sf_io_request->completion_func.pending = 0; + + sf_io_request->prev = sf_io_request->next = NULL; + + /* + * Start the actual data transfer using the ack received + * from the IOC as the tag for the send + */ + H5_CHECK_OVERFLOW(elements, int64_t, int); + if (MPI_SUCCESS != + (mpi_code = MPI_Isend(data, (int)elements, MPI_BYTE, io_concentrators[ioc_start], data_tag, + sf_context->sf_data_comm, &sf_io_request->completion_func.io_args.io_req))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Isend failed", mpi_code); + + /* + * NOTE: When we actually have the async I/O support, + * the request should be queued before we return to + * the caller. Having queued the I/O operation, we + * might want to get additional work started before + * allowing the queued I/O requests to make further + * progress and/or to complete, so we just return + * to the caller. + */ + + sf_io_request->completion_func.pending = 1; + *io_req = sf_io_request; + +done: + if (ret_value < 0) { + if (ack_request != MPI_REQUEST_NULL) { + if (MPI_SUCCESS != (mpi_code = MPI_Cancel(&ack_request))) + H5_SUBFILING_MPI_DONE_ERROR(FAIL, "MPI_Cancel failed", mpi_code); + } + + HDfree(sf_io_request); + *io_req = NULL; + } + + H5_SUBFILING_FUNC_LEAVE; +} /* end ioc__write_independent_async() */ + +/*------------------------------------------------------------------------- + * Function: Internal ioc__read_independent_async + * + * Purpose: The IO operations can be striped across a selection of + * IO concentrators. The read and write independent calls + * compute the group of 1 or more IOCs and further create + * derived MPI datatypes when required by the size of the + * contiguous read or write requests. + * + * IOC(0) contains the logical data storage for file offset + * zero and all offsets that reside within modulo range of + * the subfiling stripe_size. + * + * We cycle through all 'n_io_conentrators' and send a + * descriptor to each IOC that has a non-zero sized IO + * request to fulfill. + * + * Sending descriptors to an IOC usually gets an ACK or + * NACK in response. For the read operations, we post + * asynch READs to receive the file data and wait until + * all pending operations have completed. + * + * Return: Non-negative on success/Negative on failure + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + *------------------------------------------------------------------------- + */ +herr_t +ioc__read_independent_async(int64_t context_id, int n_io_concentrators, int64_t offset, int64_t elements, + void *data, io_req_t **io_req) +{ + subfiling_context_t *sf_context = NULL; + io_req_t * sf_io_request = NULL; + int64_t ioc_start; + int64_t ioc_offset; + int64_t msg[3] = {0}; + int * io_concentrators = NULL; + int mpi_code; + herr_t ret_value = SUCCEED; + + HDassert(io_req); + + if (NULL == (sf_context = H5_get_subfiling_object(context_id))) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_READERROR, FAIL, "can't get subfiling context from ID"); + HDassert(sf_context->topology); + HDassert(sf_context->topology->io_concentrators); + + io_concentrators = sf_context->topology->io_concentrators; + + /* + * Calculate the IOC that we'll send the I/O request to + * and the offset within that IOC's subfile + */ + calculate_target_ioc(offset, sf_context->sf_stripe_size, n_io_concentrators, &ioc_start, &ioc_offset); + + /* + * At this point in the new implementation, we should queue + * the non-blocking recv so that when the top level VFD tells + * us to complete all pending IO requests, we have all the info + * we need to accomplish that. + * + * Post the early non-blocking receive here. + */ + if (NULL == (sf_io_request = HDmalloc(sizeof(io_req_t)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_READERROR, FAIL, "couldn't allocate I/O request"); + + H5_CHECK_OVERFLOW(ioc_start, int64_t, int); + sf_io_request->completion_func.io_args.ioc = (int)ioc_start; + sf_io_request->completion_func.io_args.context_id = context_id; + sf_io_request->completion_func.io_args.offset = offset; + sf_io_request->completion_func.io_args.elements = elements; + sf_io_request->completion_func.io_args.data = data; + sf_io_request->completion_func.io_args.io_req = MPI_REQUEST_NULL; + sf_io_request->completion_func.io_function = async_completion; + sf_io_request->completion_func.pending = 0; + + sf_io_request->prev = sf_io_request->next = NULL; + + H5_CHECK_OVERFLOW(elements, int64_t, int); + if (MPI_SUCCESS != + (mpi_code = MPI_Irecv(data, (int)elements, MPI_BYTE, io_concentrators[ioc_start], READ_INDEP_DATA, + sf_context->sf_data_comm, &sf_io_request->completion_func.io_args.io_req))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Irecv failed", mpi_code); + + sf_io_request->completion_func.pending = 1; + *io_req = sf_io_request; + + /* + * Prepare and send an I/O request to the IOC identified + * by the file offset + */ + msg[0] = elements; + msg[1] = ioc_offset; + msg[2] = context_id; + if (MPI_SUCCESS != (mpi_code = MPI_Send(msg, 3, MPI_INT64_T, io_concentrators[ioc_start], READ_INDEP, + sf_context->sf_msg_comm))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Send failed", mpi_code); + +done: + if (ret_value < 0) { + if (sf_io_request && sf_io_request->completion_func.io_args.io_req != MPI_REQUEST_NULL) { + if (MPI_SUCCESS != (mpi_code = MPI_Cancel(&sf_io_request->completion_func.io_args.io_req))) + H5_SUBFILING_MPI_DONE_ERROR(FAIL, "MPI_Cancel failed", mpi_code); + } + + HDfree(sf_io_request); + *io_req = NULL; + } + + H5_SUBFILING_FUNC_LEAVE; +} /* end ioc__read_independent_async() */ + +/*------------------------------------------------------------------------- + * Function: async_completion + * + * Purpose: Given a single io_func_t structure containing the function + * pointer and it's input arguments and a single MPI_Request + * argument which needs to be completed, we make progress + * by calling MPI_Test. In this initial example, we loop + * until the request is completed as indicated by a non-zero + * flag variable. + * + * As we go further with the implementation, we anticipate that + * rather than testing a single request variable, we will + * deal with a collection of all pending IO requests (on + * this rank). + * + * Return: an integer status. Zero(0) indicates success. Negative + * values (-1) indicates an error. + *------------------------------------------------------------------------- + */ +static int +async_completion(void *arg) +{ + int n_reqs; + int mpi_code; + int ret_value = 0; + struct async_arg { + int n_reqs; + MPI_Request *sf_reqs; + } *in_progress = (struct async_arg *)arg; + + HDassert(arg); + + n_reqs = in_progress->n_reqs; + + if (n_reqs < 0) { +#ifdef H5FD_IOC_DEBUG + HDprintf("%s: invalid number of in progress I/O requests\n", __func__); +#endif + + ret_value = -1; + goto done; + } + + if (MPI_SUCCESS != (mpi_code = MPI_Waitall(n_reqs, in_progress->sf_reqs, MPI_STATUSES_IGNORE))) { +#ifdef H5FD_IOC_DEBUG + HDprintf("%s: MPI_Waitall failed with rc %d\n", __func__, mpi_code); +#endif + + ret_value = -1; + goto done; + } + +done: + H5_SUBFILING_FUNC_LEAVE; +} diff --git a/src/H5FDsubfiling/H5FDioc_priv.h b/src/H5FDsubfiling/H5FDioc_priv.h new file mode 100644 index 0000000..07eb124 --- /dev/null +++ b/src/H5FDsubfiling/H5FDioc_priv.h @@ -0,0 +1,440 @@ +/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * + * Copyright by The HDF Group. * + * All rights reserved. * + * * + * This file is part of HDF5. The full HDF5 copyright notice, including * + * terms governing use, modification, and redistribution, is contained in * + * the COPYING file, which can be found at the root of the source code * + * distribution tree, or in https://support.hdfgroup.org/ftp/HDF5/releases. * + * If you do not have access to either file, you may request a copy from * + * help@hdfgroup.org. * + * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ + +/* + * Private definitions for HDF5 IOC VFD + */ + +#ifndef H5FDioc_priv_H +#define H5FDioc_priv_H + +/********************/ +/* Standard Headers */ +/********************/ + +#include <stdatomic.h> +#include <libgen.h> + +/**************/ +/* H5 Headers */ +/**************/ + +#include "H5private.h" /* Generic Functions */ +#include "H5CXprivate.h" /* API Contexts */ +#include "H5Dprivate.h" /* Datasets */ +#include "H5Eprivate.h" /* Error handling */ +#include "H5FDioc.h" /* IOC VFD */ +#include "H5Iprivate.h" /* IDs */ +#include "H5MMprivate.h" /* Memory management */ +#include "H5Pprivate.h" /* Property lists */ + +#include "H5subfiling_common.h" +#include "H5subfiling_err.h" + +#include "mercury_thread.h" +#include "mercury_thread_mutex.h" +#include "mercury_thread_pool.h" + +/* + * Some definitions for debugging the IOC VFD + */ + +/* #define H5FD_IOC_DEBUG */ +/* #define H5FD_IOC_REQUIRE_FLUSH */ +/* #define H5FD_IOC_COLLECT_STATS */ + +/**************************************************************************** + * + * IOC I/O Queue management macros: + * + * The following macros perform the necessary operations on the IOC I/O + * Queue, which is implemented as a doubly linked list of instances of + * ioc_io_queue_entry_t. + * + * WARNING: q_ptr->q_mutex must be held when these macros are executed.. + * + * At present, the necessary operations are append (insert an entry at the + * end of the queue), and delete (remove an entry from the queue). + * + * At least initially, all sanity checking is done with asserts, as the + * the existing I/O concentrator code is not well integrated into the HDF5 + * error reporting system. This will have to be revisited for a production + * version, but it should be sufficient for now. + * + * JRM -- 11/2/21 + * + ****************************************************************************/ + +#define H5FD_IOC__IO_Q_ENTRY_MAGIC 0x1357 + +/* clang-format off */ + +#define H5FD_IOC__Q_APPEND(q_ptr, entry_ptr) \ +do { \ + HDassert(q_ptr); \ + HDassert((q_ptr)->magic == H5FD_IOC__IO_Q_MAGIC); \ + HDassert((((q_ptr)->q_len == 0) && ((q_ptr)->q_head == NULL) && ((q_ptr)->q_tail == NULL)) || \ + (((q_ptr)->q_len > 0) && ((q_ptr)->q_head != NULL) && ((q_ptr)->q_tail != NULL))); \ + HDassert(entry_ptr); \ + HDassert((entry_ptr)->magic == H5FD_IOC__IO_Q_ENTRY_MAGIC); \ + HDassert((entry_ptr)->next == NULL); \ + HDassert((entry_ptr)->prev == NULL); \ + HDassert((entry_ptr)->in_progress == FALSE); \ + \ + if ( ((q_ptr)->q_head) == NULL ) \ + { \ + ((q_ptr)->q_head) = (entry_ptr); \ + ((q_ptr)->q_tail) = (entry_ptr); \ + } \ + else \ + { \ + ((q_ptr)->q_tail)->next = (entry_ptr); \ + (entry_ptr)->prev = ((q_ptr)->q_tail); \ + ((q_ptr)->q_tail) = (entry_ptr); \ + } \ + ((q_ptr)->q_len)++; \ +} while ( FALSE ) /* H5FD_IOC__Q_APPEND() */ + +#define H5FD_IOC__Q_REMOVE(q_ptr, entry_ptr) \ +do { \ + HDassert(q_ptr); \ + HDassert((q_ptr)->magic == H5FD_IOC__IO_Q_MAGIC); \ + HDassert((((q_ptr)->q_len == 1) && ((q_ptr)->q_head ==((q_ptr)->q_tail)) && ((q_ptr)->q_head == (entry_ptr))) || \ + (((q_ptr)->q_len > 0) && ((q_ptr)->q_head != NULL) && ((q_ptr)->q_tail != NULL))); \ + HDassert(entry_ptr); \ + HDassert((entry_ptr)->magic == H5FD_IOC__IO_Q_ENTRY_MAGIC); \ + HDassert((((q_ptr)->q_len == 1) && ((entry_ptr)->next == NULL) && ((entry_ptr)->prev == NULL)) || \ + (((q_ptr)->q_len > 1) && (((entry_ptr)->next != NULL) || ((entry_ptr)->prev != NULL)))); \ + HDassert((entry_ptr)->in_progress == TRUE); \ + \ + { \ + if ( (((q_ptr)->q_head)) == (entry_ptr) ) \ + { \ + (((q_ptr)->q_head)) = (entry_ptr)->next; \ + if ( (((q_ptr)->q_head)) != NULL ) \ + (((q_ptr)->q_head))->prev = NULL; \ + } \ + else \ + { \ + (entry_ptr)->prev->next = (entry_ptr)->next; \ + } \ + if (((q_ptr)->q_tail) == (entry_ptr) ) \ + { \ + ((q_ptr)->q_tail) = (entry_ptr)->prev; \ + if ( ((q_ptr)->q_tail) != NULL ) \ + ((q_ptr)->q_tail)->next = NULL; \ + } \ + else \ + { \ + (entry_ptr)->next->prev = (entry_ptr)->prev; \ + } \ + (entry_ptr)->next = NULL; \ + (entry_ptr)->prev = NULL; \ + ((q_ptr)->q_len)--; \ + } \ +} while ( FALSE ) /* H5FD_IOC__Q_REMOVE() */ + +/* clang-format on */ + +/**************************************************************************** + * + * structure ioc_io_queue_entry + * + * magic: Unsigned 32 bit integer always set to H5FD_IOC__IO_Q_ENTRY_MAGIC. + * This field is used to validate pointers to instances of + * ioc_io_queue_entry_t. + * + * next: Next pointer in the doubly linked list used to implement + * the IOC I/O Queue. This field points to the next entry + * in the queue, or NULL if there is no next entry. + * + * prev: Prev pointer in the doubly linked list used to implement + * the IOC I/O Queue. This field points to the previous entry + * in the queue, or NULL if there is no previous entry. + * + * in_progress: Boolean flag that must be FALSE when the entry is inserted + * into the IOC I/O Queue, and set to TRUE when the entry is dispatched + * to the worker thread pool for execution. + * + * When in_progress is FALS, the entry is said to be pending. + * + * counter: uint32_t containing a serial number assigned to this IOC + * I/O Queue entry. Note that this will roll over on long + * computations, and thus is not in general unique. + * + * The counter fields is used to construct a tag to distinguish + * multiple concurrent I/O requests from a give rank, and thus + * this should not be a problem as long as there is sufficient + * time between roll overs. As only the lower bits of the counter + * are used in tag construction, this is more frequent than the + * size of the counter field would suggest -- albeit hopefully + * still infrequent enough. + * + * wk_req: Instance of sf_work_request_t. Replace with individual + * fields when convenient. + * + * + * Statistics: + * + * The following fields are only defined if H5FD_IOC_COLLECT_STATS is TRUE. + * They are intended to allow collection of basic statistics on the + * behaviour of the IOC I/O Queue for purposes of debugging and performance + * optimization. + * + * q_time: uint64_t containing the time the entry was place on the + * IOC I/O Queue in usec after the UNIX epoch. + * + * This value is used to compute the queue wait time, and the + * total processing time for the entry. + * + * dispatch_time: uint64_t containing the time the entry is dispatched in + * usec after the UNIX epoch. This field is undefined if the + * entry is pending. + * + * This value is used to compute the execution time for the + * entry. + * + ****************************************************************************/ + +typedef struct ioc_io_queue_entry { + + uint32_t magic; + struct ioc_io_queue_entry *next; + struct ioc_io_queue_entry *prev; + hbool_t in_progress; + uint32_t counter; + + sf_work_request_t wk_req; + struct hg_thread_work thread_wk; + int wk_ret; + + /* statistics */ +#ifdef H5FD_IOC_COLLECT_STATS + + uint64_t q_time; + uint64_t dispatch_time; + +#endif + +} ioc_io_queue_entry_t; + +/**************************************************************************** + * + * structure ioc_io_queue + * + * This is a temporary structure -- its fields should be moved to an I/O + * concentrator Catchall structure eventually. + * + * The fields of this structure support the io queue used to receive and + * sequence I/O requests for execution by the worker threads. The rules + * for sequencing are as follows: + * + * 1) Non-overlaping I/O requests must be fed to the worker threads in + * the order received, and may execute concurrently + * + * 2) Overlapping read requests must be fed to the worker threads in + * the order received, but may execute concurrently. + * + * 3) If any pair of I/O requests overlap, and at least one is a write + * request, they must be executed in strict arrival order, and the + * first must complete before the second starts. + * + * Due to the strict ordering requirement in rule 3, entries must be + * inserted at the tail of the queue in receipt order, and retained on + * the queue until completed. Entries in the queue are marked pending + * when inserted on the queue, in progress when handed to a worker + * thread, and deleted from the queue when completed. + * + * The dispatch algorithm is as follows: + * + * 1) Set X equal to the element at the head of the queue. + * + * 2) If X is pending, and there exists no prior element (i.e. between X + * and the head of the queue) that intersects with X, goto 5). + * + * 3) If X is pending, X is a read, and all prior intersecting elements + * are reads, goto 5). + * + * 4) If X is in progress, or if any prior intersecting element is a + * write, or if X is a write, set X equal to its successor in the + * queue (i.e. the next element further down the queue from the head) + * and goto 2) If there is no next element, exit without dispatching + * any I/O request. + * + * 5) If we get to 5, X must be pending. Mark it in progress, and + * dispatch it. If the number of in progress entries is less than + * the number of worker threads, and X has a successor in the queue, + * set X equal to its predecessor, and goto 2). Otherwise exit without + * dispatching further I/O requests. + * + * Note that the above dispatch algorithm doesn't address collective + * I/O requests -- this should be OK for now, but it will have to + * addressed prior to production release. + * + * On I/O request completion, worker threads must delete their assigned + * I/O requests from the queue, check to see if there are any pending + * requests, and trigger the dispatch algorithm if there are. + * + * The fields in the structure are discussed individually below. + * + * magic: Unsigned 32 bit integer always set to H5FD_IOC__IO_Q_MAGIC. + * This field is used to validate pointers to instances of + * H5C_t. + * + * q_head: Pointer to the head of the doubly linked list of entries in + * the I/O queue. + * + * This field is NULL if the I/O queue is empty. + * + * q_tail: Pointer to the tail of the doubly linked list of entries in + * the I/O queue. + * + * This field is NULL if the I/O queue is empty. + * + * num_pending: Number of I/O request pending on the I/O queue. + * + * num_in_progress: Number of I/O requests in progress on the I/O queue. + * + * q_len: Number of I/O requests on the I/O queue. Observe that q_len + * must equal (num_pending + num_in_progress). + * + * req_counter: unsigned 16 bit integer used to provide a "unique" tag for + * each I/O request. This value is incremented by 1, and then + * passed to the worker thread where its lower bits are incorporated + * into the tag used to disambiguate multiple, concurrent I/O + * requests from a single rank. The value is 32 bits, as MPI tags + * are limited to 32 bits. The value is unsigned as it is expected + * to wrap around once its maximum value is reached. + * + * q_mutex: Mutex used to ensure that only one thread accesses the IOC I/O + * Queue at once. This mutex must be held to access of modify + * all fields of the + * + * + * Statistics: + * + * The following fields are only defined if H5FD_IOC_COLLECT_STATS is TRUE. + * They are intended to allow collection of basic statistics on the + * behaviour of the IOC I/O Queue for purposes of debugging and performance + * optimization. + * + * max_q_len: Maximum number of requests residing on the IOC I/O Queue at + * any point in time in the current run. + * + * max_num_pending: Maximum number of pending requests residing on the IOC + * I/O Queue at any point in time in the current run. + * + * max_num_in_progress: Maximum number of in progress requests residing on + * the IOC I/O Queue at any point in time in the current run. + * + * ind_read_requests: Number of independent read requests received by the + * IOC to date. + * + * ind_write_requests Number of independent write requests received by the + * IOC to date. + * + * truncate_requests: Number of truncate requests received by the IOC to + * date. + * + * get_eof_requests: Number fo get EOF request received by the IO to date. + * + * requests_queued: Number of I/O requests received and placed on the IOC + * I/O queue. + * + * requests_dispatched: Number of I/O requests dispatched for execution by + * the worker threads. + * + * requests_completed: Number of I/O requests completed by the worker threads. + * Observe that on file close, requests_queued, requests_dispatched, + * and requests_completed should be equal. + * + ****************************************************************************/ + +#define H5FD_IOC__IO_Q_MAGIC 0x2468 + +typedef struct ioc_io_queue { + + uint32_t magic; + ioc_io_queue_entry_t *q_head; + ioc_io_queue_entry_t *q_tail; + int32_t num_pending; + int32_t num_in_progress; + int32_t num_failed; + int32_t q_len; + uint32_t req_counter; + hg_thread_mutex_t q_mutex; + + /* statistics */ +#ifdef H5FD_IOC_COLLECT_STATS + int32_t max_q_len; + int32_t max_num_pending; + int32_t max_num_in_progress; + int64_t ind_read_requests; + int64_t ind_write_requests; + int64_t truncate_requests; + int64_t get_eof_requests; + int64_t requests_queued; + int64_t requests_dispatched; + int64_t requests_completed; +#endif + +} ioc_io_queue_t; + +/* + * Structure definitions to enable async io completions + * We first define a structure which contains the basic + * input arguments for the functions which were originally + * invoked. See below. + */ +typedef struct _client_io_args { + int ioc; /* ID of the IO Concentrator handling this IO. */ + int64_t context_id; /* The context id provided for the read or write */ + int64_t offset; /* The file offset for the IO operation */ + int64_t elements; /* How many bytes */ + void * data; /* A pointer to the (contiguous) data segment */ + MPI_Request io_req; /* An MPI request to allow the code to loop while */ + /* making progress on multiple IOs */ +} io_args_t; + +typedef struct _client_io_func { + int (*io_function)(void *this_io); /* pointer to a completion function */ + io_args_t io_args; /* arguments passed to the completion function */ + int pending; /* The function is complete (0) or pending (1)? */ +} io_func_t; + +typedef struct _io_req { + struct _io_req *prev; /* A simple list structure containing completion */ + struct _io_req *next; /* functions. These should get removed as IO ops */ + io_func_t completion_func; /* are completed */ +} io_req_t; + +extern int *H5FD_IOC_tag_ub_val_ptr; + +#ifdef __cplusplus +extern "C" { +#endif + +H5_DLL int initialize_ioc_threads(void *_sf_context); +H5_DLL int finalize_ioc_threads(void *_sf_context); + +H5_DLL herr_t ioc__write_independent_async(int64_t context_id, int n_io_concentrators, int64_t offset, + int64_t elements, const void *data, io_req_t **io_req); +H5_DLL herr_t ioc__read_independent_async(int64_t context_id, int n_io_concentrators, int64_t offset, + int64_t elements, void *data, io_req_t **io_req); + +H5_DLL int wait_for_thread_main(void); + +#ifdef __cplusplus +} +#endif + +#endif /* H5FDioc_priv_H */ diff --git a/src/H5FDsubfiling/H5FDioc_threads.c b/src/H5FDsubfiling/H5FDioc_threads.c new file mode 100644 index 0000000..0d620b5 --- /dev/null +++ b/src/H5FDsubfiling/H5FDioc_threads.c @@ -0,0 +1,1658 @@ +/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * + * Copyright by The HDF Group. * + * All rights reserved. * + * * + * This file is part of HDF5. The full HDF5 copyright notice, including * + * terms governing use, modification, and redistribution, is contained in * + * the COPYING file, which can be found at the root of the source code * + * distribution tree, or in https://support.hdfgroup.org/ftp/HDF5/releases. * + * If you do not have access to either file, you may request a copy from * + * help@hdfgroup.org. * + * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ + +#include "H5FDioc_priv.h" + +#include "H5FDsubfiling.h" + +#ifndef HG_TEST_NUM_THREADS_DEFAULT +#define HG_TEST_NUM_THREADS_DEFAULT 4 +#endif + +#define MIN_READ_RETRIES 10 + +/* + * The amount of time (in nanoseconds) for the IOC main + * thread to sleep when there are no incoming I/O requests + * to process + */ +#define IOC_MAIN_SLEEP_DELAY (20000) + +/* + * IOC data for a file that is stored in that + * file's subfiling context object + */ +typedef struct ioc_data_t { + ioc_io_queue_t io_queue; + hg_thread_t ioc_main_thread; + hg_thread_pool_t *io_thread_pool; + int64_t sf_context_id; + + /* sf_io_ops_pending is use to track the number of I/O operations pending so that we can wait + * until all I/O operations have been serviced before shutting down the worker thread pool. + * The value of this variable must always be non-negative. + * + * Note that this is a convenience variable -- we could use io_queue.q_len instead. + * However, accessing this field requires locking io_queue.q_mutex. + */ + atomic_int sf_ioc_ready; + atomic_int sf_shutdown_flag; + atomic_int sf_io_ops_pending; + atomic_int sf_work_pending; +} ioc_data_t; + +/* + * NOTES: + * Rather than re-create the code for creating and managing a thread pool, + * I'm utilizing a reasonably well tested implementation from the mercury + * project. At some point, we should revisit this decision or possibly + * directly link against the mercury library. This would make sense if + * we move away from using MPI as the messaging infrastructure and instead + * use mercury for that purpose... + */ + +static hg_thread_mutex_t ioc_thread_mutex = PTHREAD_MUTEX_INITIALIZER; + +#ifdef H5FD_IOC_COLLECT_STATS +static int sf_write_ops = 0; +static int sf_read_ops = 0; +static double sf_pwrite_time = 0.0; +static double sf_pread_time = 0.0; +static double sf_write_wait_time = 0.0; +static double sf_read_wait_time = 0.0; +static double sf_queue_delay_time = 0.0; +#endif + +/* Prototypes */ +static HG_THREAD_RETURN_TYPE ioc_thread_main(void *arg); +static int ioc_main(ioc_data_t *ioc_data); + +static int ioc_file_queue_write_indep(sf_work_request_t *msg, int subfile_rank, int source, MPI_Comm comm, + uint32_t counter); +static int ioc_file_queue_read_indep(sf_work_request_t *msg, int subfile_rank, int source, MPI_Comm comm); + +static int ioc_file_write_data(int fd, int64_t file_offset, void *data_buffer, int64_t data_size, + int subfile_rank); +static int ioc_file_read_data(int fd, int64_t file_offset, void *data_buffer, int64_t data_size, + int subfile_rank); +static int ioc_file_truncate(int fd, int64_t length, int subfile_rank); +static int ioc_file_report_eof(sf_work_request_t *msg, int subfile_rank, int source, MPI_Comm comm); + +static ioc_io_queue_entry_t *ioc_io_queue_alloc_entry(void); +static void ioc_io_queue_complete_entry(ioc_data_t *ioc_data, ioc_io_queue_entry_t *entry_ptr); +static void ioc_io_queue_dispatch_eligible_entries(ioc_data_t *ioc_data, hbool_t try_lock); +static void ioc_io_queue_free_entry(ioc_io_queue_entry_t *q_entry_ptr); +static void ioc_io_queue_add_entry(ioc_data_t *ioc_data, sf_work_request_t *wk_req_ptr); + +/*------------------------------------------------------------------------- + * Function: initialize_ioc_threads + * + * Purpose: The principal entry point to initialize the execution + * context for an I/O Concentrator (IOC). The main thread + * is responsible for receiving I/O requests from each + * HDF5 "client" and distributing those to helper threads + * for actual processing. We initialize a fixed number + * of helper threads by creating a thread pool. + * + * Return: SUCCESS (0) or FAIL (-1) if any errors are detected + * for the multi-threaded initialization. + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + * + *------------------------------------------------------------------------- + */ +int +initialize_ioc_threads(void *_sf_context) +{ + subfiling_context_t *sf_context = _sf_context; + ioc_data_t * ioc_data = NULL; + unsigned thread_pool_count = HG_TEST_NUM_THREADS_DEFAULT; + char * env_value; + int ret_value = 0; +#ifdef H5FD_IOC_COLLECT_STATS + double t_start = 0.0, t_end = 0.0; +#endif + + HDassert(sf_context); + + /* + * Allocate and initialize IOC data that will be passed + * to the IOC main thread + */ + if (NULL == (ioc_data = HDmalloc(sizeof(*ioc_data)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, (-1), + "can't allocate IOC data for IOC main thread"); + ioc_data->sf_context_id = sf_context->sf_context_id; + ioc_data->io_thread_pool = NULL; + ioc_data->io_queue = (ioc_io_queue_t){/* magic = */ H5FD_IOC__IO_Q_MAGIC, + /* q_head = */ NULL, + /* q_tail = */ NULL, + /* num_pending = */ 0, + /* num_in_progress = */ 0, + /* num_failed = */ 0, + /* q_len = */ 0, + /* req_counter = */ 0, + /* q_mutex = */ + PTHREAD_MUTEX_INITIALIZER, +#ifdef H5FD_IOC_COLLECT_STATS + /* max_q_len = */ 0, + /* max_num_pending = */ 0, + /* max_num_in_progress = */ 0, + /* ind_read_requests = */ 0, + /* ind_write_requests = */ 0, + /* truncate_requests = */ 0, + /* get_eof_requests = */ 0, + /* requests_queued = */ 0, + /* requests_dispatched = */ 0, + /* requests_completed = */ 0 +#endif + }; + + /* Initialize atomic vars */ + atomic_init(&ioc_data->sf_ioc_ready, 0); + atomic_init(&ioc_data->sf_shutdown_flag, 0); + atomic_init(&ioc_data->sf_io_ops_pending, 0); + atomic_init(&ioc_data->sf_work_pending, 0); + +#ifdef H5FD_IOC_COLLECT_STATS + t_start = MPI_Wtime(); +#endif + + if (hg_thread_mutex_init(&ioc_data->io_queue.q_mutex) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTINIT, (-1), "can't initialize IOC thread queue mutex"); + + /* Allow experimentation with the number of helper threads */ + if ((env_value = HDgetenv(H5_IOC_THREAD_POOL_COUNT)) != NULL) { + int value_check = HDatoi(env_value); + if (value_check > 0) { + thread_pool_count = (unsigned int)value_check; + } + } + + /* Initialize a thread pool for the I/O concentrator's worker threads */ + if (hg_thread_pool_init(thread_pool_count, &ioc_data->io_thread_pool) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTINIT, (-1), "can't initialize IOC worker thread pool"); + + /* Create the main IOC thread that will receive and dispatch I/O requests */ + if (hg_thread_create(&ioc_data->ioc_main_thread, ioc_thread_main, ioc_data) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTINIT, (-1), "can't create IOC main thread"); + + /* Wait until ioc_main() reports that it is ready */ + while (atomic_load(&ioc_data->sf_ioc_ready) != 1) { + usleep(20); + } + +#ifdef H5FD_IOC_COLLECT_STATS + t_end = MPI_Wtime(); + +#ifdef H5FD_IOC_DEBUG + if (sf_verbose_flag) { + if (sf_context->topology->subfile_rank == 0) { + HDprintf("%s: time = %lf seconds\n", __func__, (t_end - t_start)); + HDfflush(stdout); + } + } +#endif + +#endif + + sf_context->ioc_data = ioc_data; + +done: + H5_SUBFILING_FUNC_LEAVE; +} + +int +finalize_ioc_threads(void *_sf_context) +{ + subfiling_context_t *sf_context = _sf_context; + ioc_data_t * ioc_data = NULL; + int ret_value = 0; + + HDassert(sf_context); + HDassert(sf_context->topology->rank_is_ioc); + + ioc_data = sf_context->ioc_data; + if (ioc_data) { + HDassert(0 == atomic_load(&ioc_data->sf_shutdown_flag)); + + /* Shutdown the main IOC thread */ + atomic_store(&ioc_data->sf_shutdown_flag, 1); + + /* Allow ioc_main to exit.*/ + do { + usleep(20); + } while (0 != atomic_load(&ioc_data->sf_shutdown_flag)); + + /* Tear down IOC worker thread pool */ + HDassert(0 == atomic_load(&ioc_data->sf_io_ops_pending)); + hg_thread_pool_destroy(ioc_data->io_thread_pool); + + hg_thread_mutex_destroy(&ioc_data->io_queue.q_mutex); + + /* Wait for IOC main thread to exit */ + hg_thread_join(ioc_data->ioc_main_thread); + } + + HDfree(ioc_data); + + H5_SUBFILING_FUNC_LEAVE; +} + +/*------------------------------------------------------------------------- + * Function: ioc_thread_main + * + * Purpose: An IO Concentrator instance is initialized with the + * specified subfiling context. + * + * Return: The IO concentrator thread executes as long as the HDF5 + * file associated with this context is open. At file close, + * the thread will return from 'ioc_main' and the thread + * exit status will be checked by the main program. + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + * + *------------------------------------------------------------------------- + */ +static HG_THREAD_RETURN_TYPE +ioc_thread_main(void *arg) +{ + hg_thread_ret_t thread_ret = (hg_thread_ret_t)0; + ioc_data_t * ioc_data = (ioc_data_t *)arg; + + /* Pass along the ioc_data_t */ + ioc_main(ioc_data); + + return thread_ret; +} + +/*------------------------------------------------------------------------- + * Function: ioc_main + * + * Purpose: This is the principal function run by the I/O Concentrator + * main thread. It remains within a loop until allowed to + * exit by means of setting the 'sf_shutdown_flag'. This is + * usually accomplished as part of the file close operation. + * + * The function implements an asynchronous polling approach + * for incoming messages. These messages can be thought of + * as a primitive RPC which utilizes MPI tags to code and + * implement the desired subfiling functionality. + * + * As each incoming message is received, it gets added to + * a queue for processing by a thread_pool thread. The + * message handlers are dispatched via the + * "handle_work_request" routine. + + * Subfiling is effectively a software RAID-0 implementation + * where having multiple I/O Concentrators and independent + * subfiles is equated to the multiple disks and a true + * hardware base RAID implementation. + * + * I/O Concentrators are ordered according to their MPI rank. + * In the simplest interpretation, IOC(0) will always contain + * the initial bytes of the logical disk image. Byte 0 of + * IOC(1) will contain the byte written to the logical disk + * offset "stripe_size" X IOC(number). + * + * Example: If the stripe size is defined to be 256K, then + * byte 0 of subfile(1) is at logical offset 262144 of the + * file. Similarly, byte 0 of subfile(2) represents the + * logical file offset = 524288. For logical files larger + * than 'N' X stripe_size, we simply "wrap around" back to + * subfile(0). The following shows the mapping of 30 + * logical blocks of data over 3 subfiles: + * +--------+--------+--------+--------+--------+--------+ + * | blk(0 )| blk(1) | blk(2 )| blk(3 )| blk(4 )| blk(5 )| + * | IOC(0) | IOC(1) | IOC(2) | IOC(0) | IOC(1) | IOC(2) | + * +--------+--------+--------+--------+--------+--------+ + * | blk(6 )| blk(7) | blk(8 )| blk(9 )| blk(10)| blk(11)| + * | IOC(0) | IOC(1) | IOC(2) | IOC(0) | IOC(1) | IOC(2) | + * +--------+--------+--------+--------+--------+--------+ + * | blk(12)| blk(13)| blk(14)| blk(15)| blk(16)| blk(17)| + * | IOC(0) | IOC(1) | IOC(2) | IOC(0) | IOC(1) | IOC(2) | + * +--------+--------+--------+--------+--------+--------+ + * | blk(18)| blk(19)| blk(20)| blk(21)| blk(22)| blk(23)| + * | IOC(0) | IOC(1) | IOC(2) | IOC(0) | IOC(1) | IOC(2) | + * +--------+--------+--------+--------+--------+--------+ + * | blk(24)| blk(25)| blk(26)| blk(27)| blk(28)| blk(29)| + * | IOC(0) | IOC(1) | IOC(2) | IOC(0) | IOC(1) | IOC(2) | + * +--------+--------+--------+--------+--------+--------+ + * + * Return: None + * Errors: None + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + *------------------------------------------------------------------------- + */ +static int +ioc_main(ioc_data_t *ioc_data) +{ + subfiling_context_t *context = NULL; + sf_work_request_t wk_req; + int subfile_rank; + int shutdown_requested; + int ret_value = 0; + + HDassert(ioc_data); + + context = H5_get_subfiling_object(ioc_data->sf_context_id); + HDassert(context); + + /* We can't have opened any files at this point.. + * The file open approach has changed so that the normal + * application rank (hosting this thread) does the file open. + * We can simply utilize the file descriptor (which should now + * represent an open file). + */ + + subfile_rank = context->sf_group_rank; + + /* tell initialize_ioc_threads() that ioc_main() is ready to enter its main loop */ + atomic_store(&ioc_data->sf_ioc_ready, 1); + + shutdown_requested = 0; + + while ((!shutdown_requested) || (0 < atomic_load(&ioc_data->sf_io_ops_pending)) || + (0 < atomic_load(&ioc_data->sf_work_pending))) { + MPI_Status status; + int flag = 0; + int mpi_code; + + /* Probe for incoming work requests */ + if (MPI_SUCCESS != + (mpi_code = (MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, context->sf_msg_comm, &flag, &status)))) + H5_SUBFILING_MPI_GOTO_ERROR(-1, "MPI_Iprobe failed", mpi_code); + + if (flag) { + double queue_start_time; + int count; + int source = status.MPI_SOURCE; + int tag = status.MPI_TAG; + + if ((tag != READ_INDEP) && (tag != WRITE_INDEP) && (tag != TRUNC_OP) && (tag != GET_EOF_OP)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, -1, "invalid work request operation (%d)", + tag); + + if (MPI_SUCCESS != (mpi_code = MPI_Get_count(&status, MPI_BYTE, &count))) + H5_SUBFILING_MPI_GOTO_ERROR(-1, "MPI_Get_count failed", mpi_code); + + if (count < 0) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, -1, "invalid work request message size (%d)", + count); + + if ((size_t)count > sizeof(sf_work_request_t)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, -1, "work request message is too large (%d)", + count); + + /* + * Zero out work request, since the received message should + * be smaller than sizeof(sf_work_request_t) + */ + HDmemset(&wk_req, 0, sizeof(sf_work_request_t)); + + if (MPI_SUCCESS != (mpi_code = MPI_Recv(&wk_req, count, MPI_BYTE, source, tag, + context->sf_msg_comm, MPI_STATUS_IGNORE))) + H5_SUBFILING_MPI_GOTO_ERROR(-1, "MPI_Recv failed", mpi_code); + + /* Dispatch work request to worker threads in thread pool */ + + queue_start_time = MPI_Wtime(); + + wk_req.tag = tag; + wk_req.source = source; + wk_req.subfile_rank = subfile_rank; + wk_req.context_id = ioc_data->sf_context_id; + wk_req.start_time = queue_start_time; + wk_req.buffer = NULL; + + ioc_io_queue_add_entry(ioc_data, &wk_req); + + HDassert(atomic_load(&ioc_data->sf_io_ops_pending) >= 0); + } + else { + struct timespec sleep_spec = {0, IOC_MAIN_SLEEP_DELAY}; + + HDnanosleep(&sleep_spec, NULL); + } + + ioc_io_queue_dispatch_eligible_entries(ioc_data, flag ? 0 : 1); + + shutdown_requested = atomic_load(&ioc_data->sf_shutdown_flag); + } + + /* Reset the shutdown flag */ + atomic_store(&ioc_data->sf_shutdown_flag, 0); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* ioc_main() */ + +#ifdef H5_SUBFILING_DEBUG +static const char * +translate_opcode(io_op_t op) +{ + switch (op) { + case READ_OP: + return "READ_OP"; + break; + case WRITE_OP: + return "WRITE_OP"; + break; + case OPEN_OP: + return "OPEN_OP"; + break; + case CLOSE_OP: + return "CLOSE_OP"; + break; + case TRUNC_OP: + return "TRUNC_OP"; + break; + case GET_EOF_OP: + return "GET_EOF_OP"; + break; + case FINI_OP: + return "FINI_OP"; + break; + case LOGGING_OP: + return "LOGGING_OP"; + break; + } + return "unknown"; +} +#endif + +/*------------------------------------------------------------------------- + * Function: handle_work_request + * + * Purpose: Handle a work request from the thread pool work queue. + * We dispatch the specific function as indicated by the + * TAG that has been added to the work request by the + * IOC main thread (which is just a copy of the MPI tag + * associated with the RPC message) and provide the subfiling + * context associated with the HDF5 file. + * + * Any status associated with the function processing is + * returned directly to the client via ACK or NACK messages. + * + * Return: (none) Doesn't fail. + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + * + *------------------------------------------------------------------------- + */ +static HG_THREAD_RETURN_TYPE +handle_work_request(void *arg) +{ + ioc_io_queue_entry_t *q_entry_ptr = (ioc_io_queue_entry_t *)arg; + subfiling_context_t * sf_context = NULL; + sf_work_request_t * msg = &(q_entry_ptr->wk_req); + ioc_data_t * ioc_data = NULL; + int64_t file_context_id = msg->header[2]; + int op_ret; + hg_thread_ret_t ret_value = 0; + + HDassert(q_entry_ptr); + HDassert(q_entry_ptr->magic == H5FD_IOC__IO_Q_ENTRY_MAGIC); + HDassert(q_entry_ptr->in_progress); + + sf_context = H5_get_subfiling_object(file_context_id); + HDassert(sf_context); + + ioc_data = sf_context->ioc_data; + HDassert(ioc_data); + + atomic_fetch_add(&ioc_data->sf_work_pending, 1); + + msg->in_progress = 1; + + switch (msg->tag) { + case WRITE_INDEP: + op_ret = ioc_file_queue_write_indep(msg, msg->subfile_rank, msg->source, sf_context->sf_data_comm, + q_entry_ptr->counter); + break; + + case READ_INDEP: + op_ret = ioc_file_queue_read_indep(msg, msg->subfile_rank, msg->source, sf_context->sf_data_comm); + break; + + case TRUNC_OP: + op_ret = ioc_file_truncate(sf_context->sf_fid, q_entry_ptr->wk_req.header[0], + sf_context->topology->subfile_rank); + break; + + case GET_EOF_OP: + op_ret = ioc_file_report_eof(msg, msg->subfile_rank, msg->source, sf_context->sf_eof_comm); + break; + + default: +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(file_context_id, "%s: IOC %d received unknown message with tag %x from rank %d", + __func__, msg->subfile_rank, msg->tag, msg->source); +#endif + + op_ret = -1; + break; + } + + atomic_fetch_sub(&ioc_data->sf_work_pending, 1); + + if (op_ret < 0) { +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log( + file_context_id, + "%s: IOC %d request(%s) filename=%s from rank(%d), size=%ld, offset=%ld FAILED with ret %d", + __func__, msg->subfile_rank, translate_opcode((io_op_t)msg->tag), sf_context->sf_filename, + msg->source, msg->header[0], msg->header[1], op_ret); +#endif + + q_entry_ptr->wk_ret = op_ret; + } + +#ifdef H5FD_IOC_DEBUG + { + int curr_io_ops_pending = atomic_load(&ioc_data->sf_io_ops_pending); + HDassert(curr_io_ops_pending > 0); + } +#endif + + /* complete the I/O request */ + ioc_io_queue_complete_entry(ioc_data, q_entry_ptr); + + HDassert(atomic_load(&ioc_data->sf_io_ops_pending) >= 0); + + /* Check the I/O Queue to see if there are any dispatchable entries */ + ioc_io_queue_dispatch_eligible_entries(ioc_data, 1); + + H5_SUBFILING_FUNC_LEAVE; +} + +/*------------------------------------------------------------------------- + * Function: begin_thread_exclusive + * + * Purpose: Mutex lock to restrict access to code or variables. + * + * Return: integer result of mutex_lock request. + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + * + *------------------------------------------------------------------------- + */ +void +begin_thread_exclusive(void) +{ + hg_thread_mutex_lock(&ioc_thread_mutex); +} + +/*------------------------------------------------------------------------- + * Function: end_thread_exclusive + * + * Purpose: Mutex unlock. Should only be called by the current holder + * of the locked mutex. + * + * Return: result of mutex_unlock operation. + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + * + *------------------------------------------------------------------------- + */ +void +end_thread_exclusive(void) +{ + hg_thread_mutex_unlock(&ioc_thread_mutex); +} + +static herr_t +send_ack_to_client(int ack_val, int dest_rank, int source_rank, int msg_tag, MPI_Comm comm) +{ + int mpi_code; + herr_t ret_value = SUCCEED; + + HDassert(ack_val > 0); + + (void)source_rank; + + if (MPI_SUCCESS != (mpi_code = MPI_Send(&ack_val, 1, MPI_INT, dest_rank, msg_tag, comm))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Send", mpi_code); + +done: + H5_SUBFILING_FUNC_LEAVE; +} + +static herr_t +send_nack_to_client(int dest_rank, int source_rank, int msg_tag, MPI_Comm comm) +{ + int nack = 0; + int mpi_code; + herr_t ret_value = SUCCEED; + + (void)source_rank; + + if (MPI_SUCCESS != (mpi_code = MPI_Send(&nack, 1, MPI_INT, dest_rank, msg_tag, comm))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Send", mpi_code); + +done: + H5_SUBFILING_FUNC_LEAVE; +} + +/* +========================================= +queue_xxx functions that should be run +from the thread pool threads... +========================================= +*/ + +/*------------------------------------------------------------------------- + * Function: ioc_file_queue_write_indep + * + * Purpose: Implement the IOC independent write function. The + * function is invoked as a result of the IOC receiving the + * "header"/RPC. What remains is to allocate memory for the + * data sent by the client and then write the data to our + * subfile. We utilize pwrite for the actual file writing. + * File flushing is done at file close. + * + * Return: The integer status returned by the Internal read_independent + * function. Successful operations will return 0. + * Errors: An MPI related error value. + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + * + *------------------------------------------------------------------------- + */ +static int +ioc_file_queue_write_indep(sf_work_request_t *msg, int subfile_rank, int source, MPI_Comm comm, + uint32_t counter) +{ + subfiling_context_t *sf_context = NULL; + MPI_Status msg_status; + hbool_t send_nack = FALSE; + int64_t data_size; + int64_t file_offset; + int64_t file_context_id; + int64_t stripe_id; + haddr_t sf_eof; +#ifdef H5FD_IOC_COLLECT_STATS + double t_start; + double t_end; + double t_write; + double t_wait; + double t_queue_delay; +#endif + char *recv_buf = NULL; + int rcv_tag; + int sf_fid; + int data_bytes_received; + int write_ret; + int mpi_code; + int ret_value = 0; + + HDassert(msg); + + /* Retrieve the fields of the RPC message for the write operation */ + data_size = msg->header[0]; + file_offset = msg->header[1]; + file_context_id = msg->header[2]; + + if (data_size < 0) { + send_nack = TRUE; + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_BADVALUE, -1, "invalid data size for write"); + } + + sf_context = H5_get_subfiling_object(file_context_id); + HDassert(sf_context); + + stripe_id = file_offset + data_size; + sf_eof = (haddr_t)(stripe_id % sf_context->sf_stripe_size); + + stripe_id /= sf_context->sf_stripe_size; + sf_eof += (haddr_t)((stripe_id * sf_context->sf_blocksize_per_stripe) + sf_context->sf_base_addr); + + /* Flag that we've attempted to write data to the file */ + sf_context->sf_write_count++; + +#ifdef H5FD_IOC_COLLECT_STATS + /* For debugging performance */ + sf_write_ops++; + + t_start = MPI_Wtime(); + t_queue_delay = t_start - msg->start_time; + +#ifdef H5FD_IOC_DEBUG + if (sf_verbose_flag) { + if (sf_logfile) { + HDfprintf(sf_logfile, + "[ioc(%d) %s]: msg from %d: datasize=%ld\toffset=%ld, " + "queue_delay = %lf seconds\n", + subfile_rank, __func__, source, data_size, file_offset, t_queue_delay); + } + } +#endif + +#endif + + /* Allocate space to receive data sent from the client */ + if (NULL == (recv_buf = HDmalloc((size_t)data_size))) { + send_nack = TRUE; + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, -1, "couldn't allocate receive buffer for data"); + } + + /* + * Calculate message tag for the client to use for sending + * data, then send an ACK message to the client with the + * calculated message tag. This calculated message tag + * allows us to distinguish between multiple concurrent + * writes from a single rank. + */ + HDassert(H5FD_IOC_tag_ub_val_ptr && (*H5FD_IOC_tag_ub_val_ptr >= WRITE_TAG_BASE)); + rcv_tag = (int)(counter % (INT_MAX - WRITE_TAG_BASE)); + rcv_tag %= (*H5FD_IOC_tag_ub_val_ptr - WRITE_TAG_BASE); + rcv_tag += WRITE_TAG_BASE; + + if (send_ack_to_client(rcv_tag, source, subfile_rank, WRITE_INDEP_ACK, comm) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_WRITEERROR, -1, "couldn't send ACK to client"); + + /* Receive data from client */ + H5_CHECK_OVERFLOW(data_size, int64_t, int); + if (MPI_SUCCESS != + (mpi_code = MPI_Recv(recv_buf, (int)data_size, MPI_BYTE, source, rcv_tag, comm, &msg_status))) + H5_SUBFILING_MPI_GOTO_ERROR(-1, "MPI_Recv failed", mpi_code); + + if (MPI_SUCCESS != (mpi_code = MPI_Get_count(&msg_status, MPI_BYTE, &data_bytes_received))) + H5_SUBFILING_MPI_GOTO_ERROR(-1, "MPI_Get_count failed", mpi_code); + + if (data_bytes_received != data_size) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_WRITEERROR, -1, + "message size mismatch -- expected = %" PRId64 ", actual = %d", data_size, + data_bytes_received); + +#ifdef H5FD_IOC_COLLECT_STATS + t_end = MPI_Wtime(); + t_wait = t_end - t_start; + sf_write_wait_time += t_wait; + + t_start = t_end; + +#ifdef H5FD_IOC_DEBUG + if (sf_verbose_flag) { + if (sf_logfile) { + HDfprintf(sf_logfile, "[ioc(%d) %s] MPI_Recv(%ld bytes, from = %d) status = %d\n", subfile_rank, + __func__, data_size, source, mpi_code); + } + } +#endif + +#endif + + sf_fid = sf_context->sf_fid; + +#ifdef H5FD_IOC_DEBUG + if (sf_fid < 0) + H5_subfiling_log(file_context_id, "%s: WARNING: attempt to write data to closed subfile FID %d", + __func__, sf_fid); +#endif + + if (sf_fid >= 0) { + /* Actually write data received from client into subfile */ + if ((write_ret = ioc_file_write_data(sf_fid, file_offset, recv_buf, data_size, subfile_rank)) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_WRITEERROR, -1, + "write function(FID=%d, Source=%d) returned an error (%d)", sf_fid, + source, write_ret); + +#ifdef H5FD_IOC_COLLECT_STATS + t_end = MPI_Wtime(); + t_write = t_end - t_start; + sf_pwrite_time += t_write; +#endif + } + +#ifdef H5FD_IOC_COLLECT_STATS + sf_queue_delay_time += t_queue_delay; +#endif + + begin_thread_exclusive(); + + /* Adjust EOF if necessary */ + if (sf_eof > sf_context->sf_eof) + sf_context->sf_eof = sf_eof; + + end_thread_exclusive(); + +done: + if (send_nack) { + /* Send NACK back to client so client can handle failure gracefully */ + if (send_nack_to_client(source, subfile_rank, WRITE_INDEP_ACK, comm) < 0) + H5_SUBFILING_DONE_ERROR(H5E_IO, H5E_WRITEERROR, -1, "couldn't send NACK to client"); + } + + HDfree(recv_buf); + + H5_SUBFILING_FUNC_LEAVE; +} /* ioc_file_queue_write_indep() */ + +/*------------------------------------------------------------------------- + * Function: ioc_file_queue_read_indep + * + * Purpose: Implement the IOC independent read function. The + * function is invoked as a result of the IOC receiving the + * "header"/RPC. What remains is to allocate memory for + * reading the data and then to send this to the client. + * We utilize pread for the actual file reading. + * + * Return: The integer status returned by the Internal read_independent + * function. Successful operations will return 0. + * Errors: An MPI related error value. + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + * + *------------------------------------------------------------------------- + */ +static int +ioc_file_queue_read_indep(sf_work_request_t *msg, int subfile_rank, int source, MPI_Comm comm) +{ + subfiling_context_t *sf_context = NULL; + hbool_t send_empty_buf = TRUE; + int64_t data_size; + int64_t file_offset; + int64_t file_context_id; +#ifdef H5FD_IOC_COLLECT_STATS + double t_start; + double t_end; + double t_read; + double t_queue_delay; +#endif + char *send_buf = NULL; + int sf_fid; + int read_ret; + int mpi_code; + int ret_value = 0; + + HDassert(msg); + + /* Retrieve the fields of the RPC message for the read operation */ + data_size = msg->header[0]; + file_offset = msg->header[1]; + file_context_id = msg->header[2]; + + if (data_size < 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_BADVALUE, -1, "invalid data size for read"); + + sf_context = H5_get_subfiling_object(file_context_id); + HDassert(sf_context); + + /* Flag that we've attempted to read data from the file */ + sf_context->sf_read_count++; + +#ifdef H5FD_IOC_COLLECT_STATS + /* For debugging performance */ + sf_read_ops++; + + t_start = MPI_Wtime(); + t_queue_delay = t_start - msg->start_time; + +#ifdef H5FD_IOC_DEBUG + if (sf_verbose_flag && (sf_logfile != NULL)) { + HDfprintf(sf_logfile, + "[ioc(%d) %s] msg from %d: datasize=%ld\toffset=%ld " + "queue_delay=%lf seconds\n", + subfile_rank, __func__, source, data_size, file_offset, t_queue_delay); + } +#endif + +#endif + + /* Allocate space to send data read from file to client */ + if (NULL == (send_buf = HDmalloc((size_t)data_size))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, -1, "couldn't allocate send buffer for data"); + + sf_fid = sf_context->sf_fid; + if (sf_fid < 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_BADVALUE, -1, "subfile file descriptor %d is invalid", sf_fid); + + /* Read data from the subfile */ + if ((read_ret = ioc_file_read_data(sf_fid, file_offset, send_buf, data_size, subfile_rank)) < 0) { + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_READERROR, read_ret, + "read function(FID=%d, Source=%d) returned an error (%d)", sf_fid, source, + read_ret); + } + + send_empty_buf = FALSE; + + /* Send read data to the client */ + H5_CHECK_OVERFLOW(data_size, int64_t, int); + if (MPI_SUCCESS != + (mpi_code = MPI_Send(send_buf, (int)data_size, MPI_BYTE, source, READ_INDEP_DATA, comm))) + H5_SUBFILING_MPI_GOTO_ERROR(-1, "MPI_Send failed", mpi_code); + +#ifdef H5FD_IOC_COLLECT_STATS + t_end = MPI_Wtime(); + t_read = t_end - t_start; + sf_pread_time += t_read; + sf_queue_delay_time += t_queue_delay; + +#ifdef H5FD_IOC_DEBUG + if (sf_verbose_flag && (sf_logfile != NULL)) { + HDfprintf(sf_logfile, "[ioc(%d)] MPI_Send to source(%d) completed\n", subfile_rank, source); + } +#endif + +#endif + +done: + if (send_empty_buf) { + /* + * Send an empty message back to client on failure. The client will + * likely get a message truncation error, but at least shouldn't hang. + */ + if (MPI_SUCCESS != (mpi_code = MPI_Send(NULL, 0, MPI_BYTE, source, READ_INDEP_DATA, comm))) + H5_SUBFILING_MPI_DONE_ERROR(-1, "MPI_Send failed", mpi_code); + } + + HDfree(send_buf); + + return ret_value; +} /* end ioc_file_queue_read_indep() */ + +/* +====================================================== +File functions + +The pread and pwrite posix functions are described as +being thread safe. +====================================================== +*/ + +static int +ioc_file_write_data(int fd, int64_t file_offset, void *data_buffer, int64_t data_size, int subfile_rank) +{ + ssize_t bytes_remaining = (ssize_t)data_size; + ssize_t bytes_written = 0; + char * this_data = (char *)data_buffer; + int ret_value = 0; + +#ifndef H5FD_IOC_DEBUG + (void)subfile_rank; +#endif + + HDcompile_assert(H5_SIZEOF_OFF_T == sizeof(file_offset)); + + while (bytes_remaining) { + errno = 0; + + bytes_written = HDpwrite(fd, this_data, (size_t)bytes_remaining, file_offset); + + if (bytes_written >= 0) { + bytes_remaining -= bytes_written; + +#ifdef H5FD_IOC_DEBUG + HDprintf("[ioc(%d) %s]: wrote %ld bytes, remaining=%ld, file_offset=%" PRId64 "\n", subfile_rank, + __func__, bytes_written, bytes_remaining, file_offset); +#endif + + this_data += bytes_written; + file_offset += bytes_written; + } + else { + H5_SUBFILING_SYS_GOTO_ERROR(H5E_IO, H5E_WRITEERROR, -1, "HDpwrite failed"); + } + } + + /* We don't usually use this for each file write. We usually do the file + * flush as part of file close operation. + */ +#ifdef H5FD_IOC_REQUIRE_FLUSH + fdatasync(fd); +#endif + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end ioc_file_write_data() */ + +static int +ioc_file_read_data(int fd, int64_t file_offset, void *data_buffer, int64_t data_size, int subfile_rank) +{ + useconds_t delay = 100; + ssize_t bytes_remaining = (ssize_t)data_size; + ssize_t bytes_read = 0; + char * this_buffer = (char *)data_buffer; + int retries = MIN_READ_RETRIES; + int ret_value = 0; + +#ifndef H5FD_IOC_DEBUG + (void)subfile_rank; +#endif + + HDcompile_assert(H5_SIZEOF_OFF_T == sizeof(file_offset)); + + while (bytes_remaining) { + errno = 0; + + bytes_read = HDpread(fd, this_buffer, (size_t)bytes_remaining, file_offset); + + if (bytes_read > 0) { + /* Reset retry params */ + retries = MIN_READ_RETRIES; + delay = 100; + + bytes_remaining -= bytes_read; + +#ifdef H5FD_IOC_DEBUG + HDprintf("[ioc(%d) %s]: read %ld bytes, remaining=%ld, file_offset=%" PRId64 "\n", subfile_rank, + __func__, bytes_read, bytes_remaining, file_offset); +#endif + + this_buffer += bytes_read; + file_offset += bytes_read; + } + else if (bytes_read == 0) { + HDassert(bytes_remaining > 0); + + /* end of file but not end of format address space */ + HDmemset(this_buffer, 0, (size_t)bytes_remaining); + break; + } + else { + if (retries == 0) { +#ifdef H5FD_IOC_DEBUG + HDprintf("[ioc(%d) %s]: TIMEOUT: file_offset=%" PRId64 ", data_size=%ld\n", subfile_rank, + __func__, file_offset, data_size); +#endif + + H5_SUBFILING_SYS_GOTO_ERROR(H5E_IO, H5E_READERROR, -1, "HDpread failed"); + } + + retries--; + usleep(delay); + delay *= 2; + } + } + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end ioc_file_read_data() */ + +static int +ioc_file_truncate(int fd, int64_t length, int subfile_rank) +{ + int ret_value = 0; + +#ifndef H5FD_IOC_DEBUG + (void)subfile_rank; +#endif + + if (HDftruncate(fd, (off_t)length) != 0) + H5_SUBFILING_SYS_GOTO_ERROR(H5E_FILE, H5E_SEEKERROR, -1, "HDftruncate failed"); + +#ifdef H5FD_IOC_DEBUG + HDprintf("[ioc(%d) %s]: truncated subfile to %lld bytes. ret = %d\n", subfile_rank, __func__, + (long long)length, errno); + HDfflush(stdout); +#endif + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end ioc_file_truncate() */ + +/*------------------------------------------------------------------------- + * Function: ioc_file_report_eof + * + * Purpose: Determine the target sub-file's eof and report this value + * to the requesting rank. + * + * Notes: This function will have to be reworked once we solve + * the IOC error reporting problem. + * + * This function mixes functionality that should be + * in two different VFDs. + * + * Return: 0 if successful, 1 or an MPI error code on failure. + * + * Programmer: John Mainzer + * 7/17/2020 + * + * Changes: Initial Version/None. + * + *------------------------------------------------------------------------- + */ + +static int +ioc_file_report_eof(sf_work_request_t *msg, int subfile_rank, int source, MPI_Comm comm) +{ + subfiling_context_t *sf_context = NULL; + h5_stat_t sb; + int64_t eof_req_reply[3]; + int64_t file_context_id; + int fd; + int mpi_code; + int ret_value = 0; + + HDassert(msg); + + /* first get the EOF of the target file. */ + + file_context_id = msg->header[2]; + + if (NULL == (sf_context = H5_get_subfiling_object(file_context_id))) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_CANTGET, -1, "couldn't retrieve subfiling context"); + + fd = sf_context->sf_fid; + + if (HDfstat(fd, &sb) < 0) + H5_SUBFILING_SYS_GOTO_ERROR(H5E_FILE, H5E_SYSERRSTR, -1, "HDfstat failed"); + + eof_req_reply[0] = (int64_t)subfile_rank; + eof_req_reply[1] = (int64_t)(sb.st_size); + eof_req_reply[2] = 0; /* not used */ + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(file_context_id, "%s: reporting file EOF as %" PRId64 ".", __func__, eof_req_reply[1]); +#endif + + /* return the subfile EOF to the querying rank */ + if (MPI_SUCCESS != (mpi_code = MPI_Send(eof_req_reply, 3, MPI_INT64_T, source, GET_EOF_COMPLETED, comm))) + H5_SUBFILING_MPI_GOTO_ERROR(-1, "MPI_Send", mpi_code); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* ioc_file_report_eof() */ + +/*------------------------------------------------------------------------- + * Function: ioc_io_queue_alloc_entry + * + * Purpose: Allocate and initialize an instance of + * ioc_io_queue_entry_t. Return pointer to the new + * instance on success, and NULL on failure. + * + * Return: Pointer to new instance of ioc_io_queue_entry_t + * on success, and NULL on failure. + * + * Programmer: JRM -- 11/6/21 + * + * Changes: None. + * + *------------------------------------------------------------------------- + */ +static ioc_io_queue_entry_t * +ioc_io_queue_alloc_entry(void) +{ + ioc_io_queue_entry_t *q_entry_ptr = NULL; + + q_entry_ptr = (ioc_io_queue_entry_t *)HDmalloc(sizeof(ioc_io_queue_entry_t)); + + if (q_entry_ptr) { + + q_entry_ptr->magic = H5FD_IOC__IO_Q_ENTRY_MAGIC; + q_entry_ptr->next = NULL; + q_entry_ptr->prev = NULL; + q_entry_ptr->in_progress = FALSE; + q_entry_ptr->counter = 0; + + /* will memcpy the wk_req field, so don't bother to initialize */ + /* will initialize thread_wk field before use */ + + q_entry_ptr->wk_ret = 0; + +#ifdef H5FD_IOC_COLLECT_STATS + q_entry_ptr->q_time = 0; + q_entry_ptr->dispatch_time = 0; +#endif + } + + return q_entry_ptr; +} /* ioc_io_queue_alloc_entry() */ + +/*------------------------------------------------------------------------- + * Function: ioc_io_queue_add_entry + * + * Purpose: Add an I/O request to the tail of the IOC I/O Queue. + * + * To do this, we must: + * + * 1) allocate a new instance of ioc_io_queue_entry_t + * + * 2) Initialize the new instance and copy the supplied + * instance of sf_work_request_t into it. + * + * 3) Append it to the IOC I/O queue. + * + * Note that this does not dispatch the request even if it + * is eligible for immediate dispatch. This is done with + * a call to ioc_io_queue_dispatch_eligible_entries(). + * + * Return: void. + * + * Programmer: JRM -- 11/7/21 + * + * Changes: None. + * + *------------------------------------------------------------------------- + */ +static void +ioc_io_queue_add_entry(ioc_data_t *ioc_data, sf_work_request_t *wk_req_ptr) +{ + ioc_io_queue_entry_t *entry_ptr = NULL; + + HDassert(ioc_data); + HDassert(ioc_data->io_queue.magic == H5FD_IOC__IO_Q_MAGIC); + HDassert(wk_req_ptr); + + entry_ptr = ioc_io_queue_alloc_entry(); + + HDassert(entry_ptr); + HDassert(entry_ptr->magic == H5FD_IOC__IO_Q_ENTRY_MAGIC); + + HDmemcpy((void *)(&(entry_ptr->wk_req)), (const void *)wk_req_ptr, sizeof(sf_work_request_t)); + + /* must obtain io_queue mutex before appending */ + hg_thread_mutex_lock(&ioc_data->io_queue.q_mutex); + + HDassert(ioc_data->io_queue.q_len == atomic_load(&ioc_data->sf_io_ops_pending)); + + entry_ptr->counter = ioc_data->io_queue.req_counter++; + + ioc_data->io_queue.num_pending++; + + H5FD_IOC__Q_APPEND(&ioc_data->io_queue, entry_ptr); + + atomic_fetch_add(&ioc_data->sf_io_ops_pending, 1); + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(wk_req_ptr->context_id, + "%s: request %d queued. op = %d, offset/len = %lld/%lld, q-ed/disp/ops_pend = %d/%d/%d.", + __func__, entry_ptr->counter, (entry_ptr->wk_req.tag), + (long long)(entry_ptr->wk_req.header[1]), (long long)(entry_ptr->wk_req.header[0]), + ioc_data->io_queue.num_pending, ioc_data->io_queue.num_in_progress, + atomic_load(&ioc_data->sf_io_ops_pending)); +#endif + + HDassert(ioc_data->io_queue.num_pending + ioc_data->io_queue.num_in_progress == ioc_data->io_queue.q_len); + +#ifdef H5FD_IOC_COLLECT_STATS + entry_ptr->q_time = H5_now_usec(); + + if (ioc_data->io_queue.q_len > ioc_data->io_queue.max_q_len) { + ioc_data->io_queue.max_q_len = ioc_data->io_queue.q_len; + } + + if (ioc_data->io_queue.num_pending > ioc_data->io_queue.max_num_pending) { + ioc_data->io_queue.max_num_pending = ioc_data->io_queue.num_pending; + } + + if (entry_ptr->wk_req.tag == READ_INDEP) { + ioc_data->io_queue.ind_read_requests++; + } + else if (entry_ptr->wk_req.tag == WRITE_INDEP) { + ioc_data->io_queue.ind_write_requests++; + } + else if (entry_ptr->wk_req.tag == TRUNC_OP) { + ioc_data->io_queue.truncate_requests++; + } + else if (entry_ptr->wk_req.tag == GET_EOF_OP) { + ioc_data->io_queue.get_eof_requests++; + } + + ioc_data->io_queue.requests_queued++; +#endif + +#ifdef H5_SUBFILING_DEBUG + if (ioc_data->io_queue.q_len != atomic_load(&ioc_data->sf_io_ops_pending)) { + H5_subfiling_log( + wk_req_ptr->context_id, + "%s: ioc_data->io_queue->q_len = %d != %d = atomic_load(&ioc_data->sf_io_ops_pending).", __func__, + ioc_data->io_queue.q_len, atomic_load(&ioc_data->sf_io_ops_pending)); + } +#endif + + HDassert(ioc_data->io_queue.q_len == atomic_load(&ioc_data->sf_io_ops_pending)); + + hg_thread_mutex_unlock(&ioc_data->io_queue.q_mutex); + + return; +} /* ioc_io_queue_add_entry() */ + +/*------------------------------------------------------------------------- + * Function: ioc_io_queue_dispatch_eligible_entries + * + * Purpose: Scan the IOC I/O Queue for dispatchable entries, and + * dispatch any such entries found. + * + * Do this by scanning the I/O queue from head to tail for + * entries that: + * + * 1) Have not already been dispatched + * + * 2) Either: + * + * a) do not intersect with any prior entries on the + * I/O queue, or + * + * b) Are read requests, and all intersections are with + * prior read requests. + * + * Dispatch any such entries found. + * + * Do this to maintain the POSIX semantics required by + * HDF5. + * + * Note that TRUNC_OPs and GET_EOF_OPs are a special case. + * Specifically, no I/O queue entry can be dispatched if + * there is a truncate or get EOF operation between it and + * the head of the queue. Further, a truncate or get EOF + * request cannot be executed unless it is at the head of + * the queue. + * + * Return: void. + * + * Programmer: JRM -- 11/7/21 + * + * Changes: None. + * + *------------------------------------------------------------------------- + */ +/* TODO: Keep an eye on statistics and optimize this algorithm if necessary. While it is O(N) + * where N is the number of elements in the I/O Queue if there are are no-overlaps, it + * can become O(N**2) in the worst case. + */ +static void +ioc_io_queue_dispatch_eligible_entries(ioc_data_t *ioc_data, hbool_t try_lock) +{ + hbool_t conflict_detected; + int64_t entry_offset; + int64_t entry_len; + int64_t scan_offset; + int64_t scan_len; + ioc_io_queue_entry_t *entry_ptr = NULL; + ioc_io_queue_entry_t *scan_ptr = NULL; + + HDassert(ioc_data); + HDassert(ioc_data->io_queue.magic == H5FD_IOC__IO_Q_MAGIC); + + if (try_lock) { + if (hg_thread_mutex_try_lock(&ioc_data->io_queue.q_mutex) < 0) + return; + } + else + hg_thread_mutex_lock(&ioc_data->io_queue.q_mutex); + + entry_ptr = ioc_data->io_queue.q_head; + + /* sanity check on first element in the I/O queue */ + HDassert((entry_ptr == NULL) || (entry_ptr->prev == NULL)); + + while ((entry_ptr) && (ioc_data->io_queue.num_pending > 0)) { + + HDassert(entry_ptr->magic == H5FD_IOC__IO_Q_ENTRY_MAGIC); + + /* Check for a get EOF or truncate operation at head of queue */ + if (ioc_data->io_queue.q_head->in_progress) { + if ((ioc_data->io_queue.q_head->wk_req.tag == TRUNC_OP) || + (ioc_data->io_queue.q_head->wk_req.tag == GET_EOF_OP)) { + + /* we have a truncate or get eof operation in progress -- thus no other operations + * can be dispatched until the truncate or get eof operation completes. Just break + * out of the loop. + */ + + break; + } + } + + if (!entry_ptr->in_progress) { + + entry_offset = entry_ptr->wk_req.header[1]; + entry_len = entry_ptr->wk_req.header[0]; + + conflict_detected = FALSE; + + scan_ptr = entry_ptr->prev; + + HDassert((scan_ptr == NULL) || (scan_ptr->magic == H5FD_IOC__IO_Q_ENTRY_MAGIC)); + + if ((entry_ptr->wk_req.tag == TRUNC_OP) || (entry_ptr->wk_req.tag == GET_EOF_OP)) { + + if (scan_ptr != NULL) { + + /* the TRUNC_OP or GET_EOF_OP is not at the head of the queue, and thus cannot + * be dispatched. Further, no operation can be dispatched if a truncate request + * appears before it in the queue. Thus we have done all we can and will break + * out of the loop. + */ + break; + } + } + + while ((scan_ptr) && (!conflict_detected)) { + + /* check for overlaps */ + scan_offset = scan_ptr->wk_req.header[1]; + scan_len = scan_ptr->wk_req.header[0]; + + /* at present, I/O requests are scalar -- i.e. single blocks specified by offset and length. + * when this changes, this if statement will have to be updated accordingly. + */ + if (((scan_offset + scan_len) > entry_offset) && ((entry_offset + entry_len) > scan_offset)) { + + /* the two request overlap -- unless they are both reads, we have detected a conflict */ + + /* TODO: update this if statement when we add collective I/O */ + if ((entry_ptr->wk_req.tag != READ_INDEP) || (scan_ptr->wk_req.tag != READ_INDEP)) { + + conflict_detected = TRUE; + } + } + + scan_ptr = scan_ptr->prev; + } + + if (!conflict_detected) { /* dispatch I/O request */ + + HDassert(scan_ptr == NULL); + HDassert(!entry_ptr->in_progress); + + entry_ptr->in_progress = TRUE; + + HDassert(ioc_data->io_queue.num_pending > 0); + + ioc_data->io_queue.num_pending--; + ioc_data->io_queue.num_in_progress++; + + HDassert(ioc_data->io_queue.num_pending + ioc_data->io_queue.num_in_progress == + ioc_data->io_queue.q_len); + + entry_ptr->thread_wk.func = handle_work_request; + entry_ptr->thread_wk.args = entry_ptr; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(entry_ptr->wk_req.context_id, + "%s: request %d dispatched. op = %d, offset/len = %lld/%lld, " + "q-ed/disp/ops_pend = %d/%d/%d.", + __func__, entry_ptr->counter, (entry_ptr->wk_req.tag), + (long long)(entry_ptr->wk_req.header[1]), + (long long)(entry_ptr->wk_req.header[0]), ioc_data->io_queue.num_pending, + ioc_data->io_queue.num_in_progress, + atomic_load(&ioc_data->sf_io_ops_pending)); +#endif + +#ifdef H5FD_IOC_COLLECT_STATS + if (ioc_data->io_queue.num_in_progress > ioc_data->io_queue.max_num_in_progress) { + ioc_data->io_queue.max_num_in_progress = ioc_data->io_queue.num_in_progress; + } + + ioc_data->io_queue.requests_dispatched++; + + entry_ptr->dispatch_time = H5_now_usec(); +#endif + + hg_thread_pool_post(ioc_data->io_thread_pool, &(entry_ptr->thread_wk)); + } + } + + entry_ptr = entry_ptr->next; + } + + HDassert(ioc_data->io_queue.q_len == atomic_load(&ioc_data->sf_io_ops_pending)); + + hg_thread_mutex_unlock(&ioc_data->io_queue.q_mutex); +} /* ioc_io_queue_dispatch_eligible_entries() */ + +/*------------------------------------------------------------------------- + * Function: ioc_io_queue_complete_entry + * + * Purpose: Update the IOC I/O Queue for the completion of an I/O + * request. + * + * To do this: + * + * 1) Remove the entry from the I/O Queue + * + * 2) If so configured, update statistics + * + * 3) Discard the instance of ioc_io_queue_entry_t. + * + * Return: void. + * + * Programmer: JRM -- 11/7/21 + * + * Changes: None. + * + *------------------------------------------------------------------------- + */ +static void +ioc_io_queue_complete_entry(ioc_data_t *ioc_data, ioc_io_queue_entry_t *entry_ptr) +{ +#ifdef H5FD_IOC_COLLECT_STATS + uint64_t queued_time; + uint64_t execution_time; +#endif + + HDassert(ioc_data); + HDassert(ioc_data->io_queue.magic == H5FD_IOC__IO_Q_MAGIC); + HDassert(entry_ptr); + HDassert(entry_ptr->magic == H5FD_IOC__IO_Q_ENTRY_MAGIC); + + /* must obtain io_queue mutex before deleting and updating stats */ + hg_thread_mutex_lock(&ioc_data->io_queue.q_mutex); + + HDassert(ioc_data->io_queue.num_pending + ioc_data->io_queue.num_in_progress == ioc_data->io_queue.q_len); + HDassert(ioc_data->io_queue.num_in_progress > 0); + + if (entry_ptr->wk_ret < 0) + ioc_data->io_queue.num_failed++; + + H5FD_IOC__Q_REMOVE(&ioc_data->io_queue, entry_ptr); + + ioc_data->io_queue.num_in_progress--; + + HDassert(ioc_data->io_queue.num_pending + ioc_data->io_queue.num_in_progress == ioc_data->io_queue.q_len); + + atomic_fetch_sub(&ioc_data->sf_io_ops_pending, 1); + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(entry_ptr->wk_req.context_id, + "%s: request %d completed with ret %d. op = %d, offset/len = %lld/%lld, " + "q-ed/disp/ops_pend = %d/%d/%d.", + __func__, entry_ptr->counter, entry_ptr->wk_ret, (entry_ptr->wk_req.tag), + (long long)(entry_ptr->wk_req.header[1]), (long long)(entry_ptr->wk_req.header[0]), + ioc_data->io_queue.num_pending, ioc_data->io_queue.num_in_progress, + atomic_load(&ioc_data->sf_io_ops_pending)); + + /* + * If this I/O request is a truncate or "get eof" op, make sure + * there aren't other operations in progress + */ + if ((entry_ptr->wk_req.tag == GET_EOF_OP) || (entry_ptr->wk_req.tag == TRUNC_OP)) + HDassert(ioc_data->io_queue.num_in_progress == 0); +#endif + + HDassert(ioc_data->io_queue.q_len == atomic_load(&ioc_data->sf_io_ops_pending)); + +#ifdef H5FD_IOC_COLLECT_STATS + /* Compute the queued and execution time */ + queued_time = entry_ptr->dispatch_time - entry_ptr->q_time; + execution_time = H5_now_usec() = entry_ptr->dispatch_time; + + ioc_data->io_queue.requests_completed++; + + entry_ptr->q_time = H5_now_usec(); + +#endif + + hg_thread_mutex_unlock(&ioc_data->io_queue.q_mutex); + + HDassert(entry_ptr->wk_req.buffer == NULL); + + ioc_io_queue_free_entry(entry_ptr); + + entry_ptr = NULL; + + return; +} /* ioc_io_queue_complete_entry() */ + +/*------------------------------------------------------------------------- + * Function: ioc_io_queue_free_entry + * + * Purpose: Free the supplied instance of ioc_io_queue_entry_t. + * + * Verify that magic field is set to + * H5FD_IOC__IO_Q_ENTRY_MAGIC, and that the next and prev + * fields are NULL. + * + * Return: void. + * + * Programmer: JRM -- 11/6/21 + * + * Changes: None. + * + *------------------------------------------------------------------------- + */ +static void +ioc_io_queue_free_entry(ioc_io_queue_entry_t *q_entry_ptr) +{ + /* use assertions for error checking, since the following should never fail. */ + HDassert(q_entry_ptr); + HDassert(q_entry_ptr->magic == H5FD_IOC__IO_Q_ENTRY_MAGIC); + HDassert(q_entry_ptr->next == NULL); + HDassert(q_entry_ptr->prev == NULL); + HDassert(q_entry_ptr->wk_req.buffer == NULL); + + q_entry_ptr->magic = 0; + + HDfree(q_entry_ptr); + + q_entry_ptr = NULL; + + return; +} /* H5FD_ioc__free_c_io_q_entry() */ diff --git a/src/H5FDsubfiling/H5FDsubfile_int.c b/src/H5FDsubfiling/H5FDsubfile_int.c new file mode 100644 index 0000000..577762a --- /dev/null +++ b/src/H5FDsubfiling/H5FDsubfile_int.c @@ -0,0 +1,328 @@ +/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * + * Copyright by The HDF Group. * + * All rights reserved. * + * * + * This file is part of HDF5. The full HDF5 copyright notice, including * + * terms governing use, modification, and redistribution, is contained in * + * the COPYING file, which can be found at the root of the source code * + * distribution tree, or in https://support.hdfgroup.org/ftp/HDF5/releases. * + * If you do not have access to either file, you may request a copy from * + * help@hdfgroup.org. * + * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ + +/* + * Programmer: Richard Warren + * Wednesday, July 1, 2020 + * + * Purpose: This is part of a parallel subfiling I/O driver. + * + */ + +/***********/ +/* Headers */ +/***********/ + +#include "H5FDsubfiling_priv.h" + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling__truncate_sub_files + * + * Note: This code should be moved -- most likely to the IOC + * code files. + * + * Purpose: Apply a truncate operation to the sub-files. + * + * In the context of the I/O concentrators, the eof must be + * translated into the appropriate value for each of the + * sub-files, and then applied to same. + * + * Further, we must ensure that all prior I/O requests complete + * before the truncate is applied. + * + * We do this as follows: + * + * 1) Run a barrier on entry. + * + * 2) Determine if this rank is a IOC. If it is, compute + * the correct EOF for this sub-file, and send a truncate + * request to the IOC. + * + * 3) On the IOC thread, allow all pending I/O requests + * received prior to the truncate request to complete + * before performing the truncate. + * + * 4) Run a barrier on exit. + * + * Observe that the barrier on entry ensures that any prior + * I/O requests will have been queue before the truncate + * request is sent to the IOC. + * + * Similarly, the barrier on exit ensures that no subsequent + * I/O request will reach the IOC before the truncate request + * has been queued. + * + * Return: SUCCEED/FAIL + * + * Programmer: JRM -- 12/13/21 + * + * Changes: None. + * + *------------------------------------------------------------------------- + */ +herr_t +H5FD__subfiling__truncate_sub_files(hid_t context_id, int64_t logical_file_eof, MPI_Comm comm) +{ + int mpi_code; /* MPI return code */ + subfiling_context_t *sf_context = NULL; + int64_t msg[3] = { + 0, + }; + herr_t ret_value = SUCCEED; /* Return value */ + + /* Barrier on entry */ + if (MPI_SUCCESS != (mpi_code = MPI_Barrier(comm))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Barrier failed", mpi_code); + + if (NULL == (sf_context = (subfiling_context_t *)H5_get_subfiling_object(context_id))) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_BADVALUE, FAIL, "can't get subfile context"); + + /* Test to see if this rank is running an I/O concentrator. */ + + if (sf_context->topology->rank_is_ioc) { + + int i; + int64_t subfile_eof; + int64_t num_full_stripes; + int64_t partial_stripe_len; +#ifndef NDEBUG + int64_t test_file_eof; +#endif /* NDEBUG */ + + /* if it is, first compute the sub-file EOF */ + + num_full_stripes = logical_file_eof / sf_context->sf_blocksize_per_stripe; + partial_stripe_len = logical_file_eof % sf_context->sf_blocksize_per_stripe; + + subfile_eof = num_full_stripes * sf_context->sf_stripe_size; + + if (sf_context->topology->subfile_rank < (partial_stripe_len / sf_context->sf_stripe_size)) { + + subfile_eof += sf_context->sf_stripe_size; + } + else if (sf_context->topology->subfile_rank == (partial_stripe_len / sf_context->sf_stripe_size)) { + + subfile_eof += partial_stripe_len % sf_context->sf_stripe_size; + } + + /* sanity check -- compute the file eof using the same mechanism used to + * compute the sub-file eof. Assert that the computed value and the + * actual value match. + * + * Do this only for debug builds -- probably delete this before release. + * + * JRM -- 12/15/21 + */ + +#ifndef NDEBUG + test_file_eof = 0; + + for (i = 0; i < sf_context->topology->n_io_concentrators; i++) { + + test_file_eof += num_full_stripes * sf_context->sf_stripe_size; + + if (i < (partial_stripe_len / sf_context->sf_stripe_size)) { + + test_file_eof += sf_context->sf_stripe_size; + } + else if (i == (partial_stripe_len / sf_context->sf_stripe_size)) { + + test_file_eof += partial_stripe_len % sf_context->sf_stripe_size; + } + } + HDassert(test_file_eof == logical_file_eof); +#endif /* NDEBUG */ + + /* then direct the IOC to truncate the sub-file to the correct EOF */ + + msg[0] = subfile_eof; + msg[1] = 0; /* padding -- not used in this message */ + msg[2] = context_id; + + if (MPI_SUCCESS != + (mpi_code = MPI_Send(msg, 3, MPI_INT64_T, + sf_context->topology->io_concentrators[sf_context->topology->subfile_rank], + TRUNC_OP, sf_context->sf_msg_comm))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Send failed", mpi_code); + } + + /* Barrier on exit */ + if (MPI_SUCCESS != (mpi_code = MPI_Barrier(comm))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Barrier failed", mpi_code); + +done: + + H5_SUBFILING_FUNC_LEAVE; +} /* H5FD__subfiling__truncate_sub_files() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling__get_real_eof + * + * Note: This code should be moved -- most likely to the IOC + * code files. + * + * Purpose: Query each subfile to get its local EOF, and then use this + * data to calculate the actual EOF. + * + * Do this as follows: + * + * 1) allocate an array of int64_t of length equal to the + * the number of IOCs, and initialize all fields to -1. + * + * 2) Send each IOC a message requesting that sub-file's EOF. + * + * 3) Await reply from each IOC, storing the reply in + * the appropriate entry in the array allocated in 1. + * + * 4) After all IOCs have replied, compute the offset of + * each subfile in the logical file. Take the maximum + * of these values, and report this value as the overall + * EOF. + * + * Note that this operation is not collective, and can return + * invalid data if other ranks perform writes while this + * operation is in progress. + * + * Return: SUCCEED/FAIL + * + * Programmer: JRM -- 1/18/22 + * + * Changes: None. + * + *------------------------------------------------------------------------- + */ +herr_t +H5FD__subfiling__get_real_eof(hid_t context_id, int64_t *logical_eof_ptr) +{ + subfiling_context_t *sf_context = NULL; + MPI_Request * recv_reqs = NULL; + int64_t * recv_msg = NULL; + int64_t * sf_eofs = NULL; /* dynamically allocated array for subfile EOFs */ + int64_t msg[3] = {0, 0, 0}; + int64_t logical_eof = 0; + int64_t sf_logical_eof; + int n_io_concentrators = 0; /* copy of value in topology */ + int mpi_code; /* MPI return code */ + herr_t ret_value = SUCCEED; /* Return value */ + + HDassert(logical_eof_ptr); + + if (NULL == (sf_context = (subfiling_context_t *)H5_get_subfiling_object(context_id))) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_BADVALUE, FAIL, "can't get subfile context"); + + HDassert(sf_context->topology); + + n_io_concentrators = sf_context->topology->n_io_concentrators; + + HDassert(n_io_concentrators > 0); + + if (NULL == (sf_eofs = HDmalloc((size_t)n_io_concentrators * sizeof(int64_t)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, "can't allocate sub-file EOFs array"); + if (NULL == (recv_reqs = HDmalloc((size_t)n_io_concentrators * sizeof(*recv_reqs)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, "can't allocate receive requests array"); + if (NULL == (recv_msg = HDmalloc((size_t)n_io_concentrators * 3 * sizeof(*recv_msg)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, "can't allocate message array"); + + for (int i = 0; i < n_io_concentrators; i++) { + sf_eofs[i] = -1; + recv_reqs[i] = MPI_REQUEST_NULL; + } + + /* Post early non-blocking receives for replies from each IOC */ + for (int i = 0; i < n_io_concentrators; i++) { + int ioc_rank = sf_context->topology->io_concentrators[i]; + + if (MPI_SUCCESS != (mpi_code = MPI_Irecv(&recv_msg[3 * i], 3, MPI_INT64_T, ioc_rank, + GET_EOF_COMPLETED, sf_context->sf_eof_comm, &recv_reqs[i]))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Irecv", mpi_code); + } + + /* Send each IOC a message requesting that subfile's EOF */ + + msg[0] = 0; /* padding -- not used in this message */ + msg[1] = 0; /* padding -- not used in this message */ + msg[2] = context_id; + + for (int i = 0; i < n_io_concentrators; i++) { + int ioc_rank = sf_context->topology->io_concentrators[i]; + + if (MPI_SUCCESS != + (mpi_code = MPI_Send(msg, 3, MPI_INT64_T, ioc_rank, GET_EOF_OP, sf_context->sf_msg_comm))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Send", mpi_code); + } + + /* Wait for EOF communication to complete */ + if (MPI_SUCCESS != (mpi_code = MPI_Waitall(n_io_concentrators, recv_reqs, MPI_STATUSES_IGNORE))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Waitall", mpi_code); + + for (int i = 0; i < n_io_concentrators; i++) { + int ioc_rank = (int)recv_msg[3 * i]; + + HDassert(ioc_rank >= 0); + HDassert(ioc_rank < n_io_concentrators); + HDassert(sf_eofs[ioc_rank] == -1); + + sf_eofs[ioc_rank] = recv_msg[(3 * i) + 1]; + } + + /* 4) After all IOCs have replied, compute the offset of + * each subfile in the logical file. Take the maximum + * of these values, and report this value as the overall + * EOF. + */ + + for (int i = 0; i < n_io_concentrators; i++) { + + /* compute number of complete stripes */ + sf_logical_eof = sf_eofs[i] / sf_context->sf_stripe_size; + + /* multiply by stripe size */ + sf_logical_eof *= sf_context->sf_stripe_size * n_io_concentrators; + + /* if the sub-file doesn't end on a stripe size boundary, must add in a partial stripe */ + if (sf_eofs[i] % sf_context->sf_stripe_size > 0) { + + /* add in the size of the partial stripe up to but not including this subfile */ + sf_logical_eof += i * sf_context->sf_stripe_size; + + /* finally, add in the number of bytes in the last partial stripe depth in the sub-file */ + sf_logical_eof += sf_eofs[i] % sf_context->sf_stripe_size; + } + + if (sf_logical_eof > logical_eof) { + + logical_eof = sf_logical_eof; + } + } + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(context_id, "%s: calculated logical EOF = %" PRId64 ".", __func__, logical_eof); +#endif + + *logical_eof_ptr = logical_eof; + +done: + if (ret_value < 0) { + for (int i = 0; i < n_io_concentrators; i++) { + if (recv_reqs && (recv_reqs[i] != MPI_REQUEST_NULL)) { + if (MPI_SUCCESS != (mpi_code = MPI_Cancel(&recv_reqs[i]))) + H5_SUBFILING_MPI_DONE_ERROR(FAIL, "MPI_Cancel", mpi_code); + } + } + } + + HDfree(recv_msg); + HDfree(recv_reqs); + HDfree(sf_eofs); + + H5_SUBFILING_FUNC_LEAVE; +} /* H5FD__subfiling__get_real_eof() */ diff --git a/src/H5FDsubfiling/H5FDsubfiling.c b/src/H5FDsubfiling/H5FDsubfiling.c new file mode 100644 index 0000000..32ac6a8 --- /dev/null +++ b/src/H5FDsubfiling/H5FDsubfiling.c @@ -0,0 +1,3386 @@ +/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * + * Copyright by The HDF Group. * + * All rights reserved. * + * * + * This file is part of HDF5. The full HDF5 copyright notice, including * + * terms governing use, modification, and redistribution, is contained in * + * the COPYING file, which can be found at the root of the source code * + * distribution tree, or in https://support.hdfgroup.org/ftp/HDF5/releases. * + * If you do not have access to either file, you may request a copy from * + * help@hdfgroup.org. * + * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ + +/* + * Programmer: Richard Warren + * + * + * Purpose: An initial implementation of a subfiling VFD which is + * derived from other "stacked" VFDs such as the splitter, + * mirror, and family VFDs. + */ + +#include "H5FDdrvr_module.h" /* This source code file is part of the H5FD driver module */ + +#include "H5private.h" /* Generic Functions */ +#include "H5CXprivate.h" /* API contexts, etc. */ +#include "H5Dprivate.h" /* Dataset stuff */ +#include "H5Eprivate.h" /* Error handling */ +#include "H5FDprivate.h" /* File drivers */ +#include "H5FDsubfiling.h" /* Subfiling file driver */ +#include "H5FDsubfiling_priv.h" /* Subfiling file driver */ +#include "H5FDsec2.h" /* Sec2 VFD */ +#include "H5FLprivate.h" /* Free Lists */ +#include "H5Fprivate.h" /* File access */ +#include "H5Iprivate.h" /* IDs */ +#include "H5MMprivate.h" /* Memory management */ +#include "H5Pprivate.h" /* Property lists */ + +/* The driver identification number, initialized at runtime */ +static hid_t H5FD_SUBFILING_g = H5I_INVALID_HID; + +/* Whether the driver initialized MPI on its own */ +static hbool_t H5FD_mpi_self_initialized = FALSE; + +/* The description of a file belonging to this driver. The 'eoa' and 'eof' + * determine the amount of hdf5 address space in use and the high-water mark + * of the file (the current size of the underlying filesystem file). The + * 'pos' value is used to eliminate file position updates when they would be a + * no-op. Unfortunately we've found systems that use separate file position + * indicators for reading and writing so the lseek can only be eliminated if + * the current operation is the same as the previous operation. When opening + * a file the 'eof' will be set to the current file size, `eoa' will be set + * to zero, 'pos' will be set to H5F_ADDR_UNDEF (as it is when an error + * occurs), and 'op' will be set to H5F_OP_UNKNOWN. + */ +/*************************************************************************** + * + * Structure: H5FD_subfiling_t + * + * Purpose: + * + * H5FD_subfiling_t is a structure used to store all information needed + * to setup, manage, and take down subfiling for a HDF5 file. + * + * This structure is created when such a file is "opened" and + * discarded when it is "closed". + * + * Presents a system of subfiles as a single file to the HDF5 library. + * + * + * `pub` (H5FD_t) + * + * Instance of H5FD_t which contains all fields common to all VFDs. + * It must be the first item in this structure, since at higher levels, + * this structure will be treated as an instance of H5FD_t. + * + * `fa` (H5FD_subfiling_config_t) + * + * Instance of `H5FD_subfiling_config_t` containing the subfiling + * configuration data needed to "open" the HDF5 file. + * + * + * Document additional subfiling fields here. + * + * Recall that the existing fields are inherited from the sec2 driver + * and should be kept or not as appropriate for the sub-filing VFD. + * + * + * Programmer: Richard Warren + * + ***************************************************************************/ + +typedef struct H5FD_subfiling_t { + H5FD_t pub; /* public stuff, must be first */ + int fd; /* the filesystem file descriptor */ + H5FD_subfiling_config_t fa; /* driver-specific file access properties */ + + /* MPI Info */ + MPI_Comm comm; + MPI_Comm ext_comm; + MPI_Info info; + int mpi_rank; + int mpi_size; + + H5FD_t *sf_file; + + int64_t context_id; /* The value used to lookup a subfiling context for the file */ + + char *file_dir; /* Directory where we find files */ + char *file_path; /* The user defined filename */ + +#ifndef H5_HAVE_WIN32_API + /* On most systems the combination of device and i-node number uniquely + * identify a file. Note that Cygwin, MinGW and other Windows POSIX + * environments have the stat function (which fakes inodes) + * and will use the 'device + inodes' scheme as opposed to the + * Windows code further below. + */ + dev_t device; /* file device number */ + ino_t inode; /* file i-node number */ +#else + /* Files in windows are uniquely identified by the volume serial + * number and the file index (both low and high parts). + * + * There are caveats where these numbers can change, especially + * on FAT file systems. On NTFS, however, a file should keep + * those numbers the same until renamed or deleted (though you + * can use ReplaceFile() on NTFS to keep the numbers the same + * while renaming). + * + * See the MSDN "BY_HANDLE_FILE_INFORMATION Structure" entry for + * more information. + * + * http://msdn.microsoft.com/en-us/library/aa363788(v=VS.85).aspx + */ + DWORD nFileIndexLow; + DWORD nFileIndexHigh; + DWORD dwVolumeSerialNumber; + + HANDLE hFile; /* Native windows file handle */ +#endif /* H5_HAVE_WIN32_API */ + + /* + * The element layouts above this point are identical with the + * H5FD_ioc_t structure. As a result, + * + * Everything which follows is unique to the H5FD_subfiling_t + */ + haddr_t eoa; /* end of allocated region */ + haddr_t eof; /* end of file; current file size */ + haddr_t last_eoa; /* Last known end-of-address marker */ + haddr_t local_eof; /* Local end-of-file address for each process */ + haddr_t pos; /* current file I/O position */ + H5FD_file_op_t op; /* last operation */ + char filename[H5FD_MAX_FILENAME_LEN]; /* Copy of file name from open operation */ +} H5FD_subfiling_t; + +/* + * These macros check for overflow of various quantities. These macros + * assume that HDoff_t is signed and haddr_t and size_t are unsigned. + * + * ADDR_OVERFLOW: Checks whether a file address of type `haddr_t' + * is too large to be represented by the second argument + * of the file seek function. + * + * SIZE_OVERFLOW: Checks whether a buffer size of type `hsize_t' is too + * large to be represented by the `size_t' type. + * + * REGION_OVERFLOW: Checks whether an address and size pair describe data + * which can be addressed entirely by the second + * argument of the file seek function. + */ +#define MAXADDR (((haddr_t)1 << (8 * sizeof(HDoff_t) - 1)) - 1) +#define ADDR_OVERFLOW(A) (HADDR_UNDEF == (A) || ((A) & ~(haddr_t)MAXADDR)) +#define SIZE_OVERFLOW(Z) ((Z) & ~(hsize_t)MAXADDR) +#define REGION_OVERFLOW(A, Z) \ + (ADDR_OVERFLOW(A) || SIZE_OVERFLOW(Z) || HADDR_UNDEF == (A) + (Z) || (HDoff_t)((A) + (Z)) < (HDoff_t)(A)) + +#define H5FD_SUBFILING_DEBUG_OP_CALLS 0 /* debugging print toggle; 0 disables */ + +#if H5FD_SUBFILING_DEBUG_OP_CALLS +#define H5FD_SUBFILING_LOG_CALL(name) \ + do { \ + HDprintf("called %s()\n", (name)); \ + HDfflush(stdout); \ + } while (0) +#else +#define H5FD_SUBFILING_LOG_CALL(name) /* no-op */ +#endif /* H5FD_SUBFILING_DEBUG_OP_CALLS */ + +/* Prototypes */ +static herr_t H5FD__subfiling_term(void); +static void * H5FD__subfiling_fapl_get(H5FD_t *_file); +static void * H5FD__subfiling_fapl_copy(const void *_old_fa); +static herr_t H5FD__subfiling_fapl_free(void *_fa); +static H5FD_t *H5FD__subfiling_open(const char *name, unsigned flags, hid_t fapl_id, haddr_t maxaddr); +static herr_t H5FD__subfiling_close(H5FD_t *_file); +static int H5FD__subfiling_cmp(const H5FD_t *_f1, const H5FD_t *_f2); +static herr_t H5FD__subfiling_query(const H5FD_t *_f1, unsigned long *flags); +static haddr_t H5FD__subfiling_get_eoa(const H5FD_t *_file, H5FD_mem_t type); +static herr_t H5FD__subfiling_set_eoa(H5FD_t *_file, H5FD_mem_t type, haddr_t addr); +static haddr_t H5FD__subfiling_get_eof(const H5FD_t *_file, H5FD_mem_t type); +static herr_t H5FD__subfiling_get_handle(H5FD_t *_file, hid_t fapl, void **file_handle); +static herr_t H5FD__subfiling_read(H5FD_t *_file, H5FD_mem_t type, hid_t fapl_id, haddr_t addr, size_t size, + void *buf); +static herr_t H5FD__subfiling_write(H5FD_t *_file, H5FD_mem_t type, hid_t dxpl_id, haddr_t addr, size_t size, + const void *buf); +static herr_t H5FD__subfiling_read_vector(H5FD_t *file, hid_t dxpl_id, uint32_t count, H5FD_mem_t types[], + haddr_t addrs[], size_t sizes[], void *bufs[] /* out */); +static herr_t H5FD__subfiling_write_vector(H5FD_t *file, hid_t dxpl_id, uint32_t count, H5FD_mem_t types[], + haddr_t addrs[], size_t sizes[], const void *bufs[] /* in */); +static herr_t H5FD__subfiling_truncate(H5FD_t *_file, hid_t dxpl_id, hbool_t closing); +static herr_t H5FD__subfiling_lock(H5FD_t *_file, hbool_t rw); +static herr_t H5FD__subfiling_unlock(H5FD_t *_file); +static herr_t H5FD__subfiling_del(const char *name, hid_t fapl); +static herr_t H5FD__subfiling_ctl(H5FD_t *_file, uint64_t op_code, uint64_t flags, const void *input, + void **output); + +static herr_t H5FD__subfiling_get_default_config(hid_t fapl_id, H5FD_subfiling_config_t *config_out); +static herr_t H5FD__subfiling_validate_config(const H5FD_subfiling_config_t *fa); +static int H5FD__copy_plist(hid_t fapl_id, hid_t *id_out_ptr); + +static herr_t H5FD__subfiling_close_int(H5FD_subfiling_t *file_ptr); + +static herr_t init_indep_io(subfiling_context_t *sf_context, int64_t file_offset, size_t io_nelemts, + size_t dtype_extent, size_t max_iovec_len, int64_t *mem_buf_offset, + int64_t *target_file_offset, int64_t *io_block_len, int *first_ioc_index, + int *n_iocs_used, int64_t *max_io_req_per_ioc); +static herr_t iovec_fill_first(subfiling_context_t *sf_context, int64_t iovec_depth, int64_t target_datasize, + int64_t start_mem_offset, int64_t start_file_offset, int64_t first_io_len, + int64_t *mem_offset_out, int64_t *target_file_offset_out, + int64_t *io_block_len_out); +static herr_t iovec_fill_last(subfiling_context_t *sf_context, int64_t iovec_depth, int64_t target_datasize, + int64_t start_mem_offset, int64_t start_file_offset, int64_t last_io_len, + int64_t *mem_offset_out, int64_t *target_file_offset_out, + int64_t *io_block_len_out); +static herr_t iovec_fill_first_last(subfiling_context_t *sf_context, int64_t iovec_depth, + int64_t target_datasize, int64_t start_mem_offset, + int64_t start_file_offset, int64_t first_io_len, int64_t last_io_len, + int64_t *mem_offset_out, int64_t *target_file_offset_out, + int64_t *io_block_len_out); +static herr_t iovec_fill_uniform(subfiling_context_t *sf_context, int64_t iovec_depth, + int64_t target_datasize, int64_t start_mem_offset, int64_t start_file_offset, + int64_t *mem_offset_out, int64_t *target_file_offset_out, + int64_t *io_block_len_out); + +void H5FD__subfiling_mpi_finalize(void); + +static const H5FD_class_t H5FD_subfiling_g = { + H5FD_CLASS_VERSION, /* VFD interface version */ + H5_VFD_SUBFILING, /* value */ + H5FD_SUBFILING_NAME, /* name */ + MAXADDR, /* maxaddr */ + H5F_CLOSE_WEAK, /* fc_degree */ + H5FD__subfiling_term, /* terminate */ + NULL, /* sb_size */ + NULL, /* sb_encode */ + NULL, /* sb_decode */ + sizeof(H5FD_subfiling_config_t), /* fapl_size */ + H5FD__subfiling_fapl_get, /* fapl_get */ + H5FD__subfiling_fapl_copy, /* fapl_copy */ + H5FD__subfiling_fapl_free, /* fapl_free */ + 0, /* dxpl_size */ + NULL, /* dxpl_copy */ + NULL, /* dxpl_free */ + H5FD__subfiling_open, /* open */ + H5FD__subfiling_close, /* close */ + H5FD__subfiling_cmp, /* cmp */ + H5FD__subfiling_query, /* query */ + NULL, /* get_type_map */ + NULL, /* alloc */ + NULL, /* free */ + H5FD__subfiling_get_eoa, /* get_eoa */ + H5FD__subfiling_set_eoa, /* set_eoa */ + H5FD__subfiling_get_eof, /* get_eof */ + H5FD__subfiling_get_handle, /* get_handle */ + H5FD__subfiling_read, /* read */ + H5FD__subfiling_write, /* write */ + H5FD__subfiling_read_vector, /* read_vector */ + H5FD__subfiling_write_vector, /* write_vector */ + NULL, /* read_selection */ + NULL, /* write_selection */ + NULL, /* flush */ + H5FD__subfiling_truncate, /* truncate */ + H5FD__subfiling_lock, /* lock */ + H5FD__subfiling_unlock, /* unlock */ + H5FD__subfiling_del, /* del */ + H5FD__subfiling_ctl, /* ctl */ + H5FD_FLMAP_DICHOTOMY /* fl_map */ +}; + +/* Declare a free list to manage the H5FD_subfiling_t struct */ +H5FL_DEFINE_STATIC(H5FD_subfiling_t); + +/* + * If this VFD initialized MPI, this routine will be registered + * as an atexit handler in order to finalize MPI before the + * application exits. + */ +void +H5FD__subfiling_mpi_finalize(void) +{ + H5close(); + MPI_Finalize(); +} + +/*------------------------------------------------------------------------- + * Function: H5FD_subfiling_init + * + * Purpose: Initialize this driver by registering the driver with the + * library. + * + * Return: Success: The driver ID for the subfiling driver + * Failure: H5I_INVALID_HID + * + * Programmer: Richard Warren + * + *------------------------------------------------------------------------- + */ +hid_t +H5FD_subfiling_init(void) +{ + hid_t ret_value = H5I_INVALID_HID; /* Return value */ + + /* Register the Subfiling VFD, if it isn't already registered */ + if (H5I_VFL != H5I_get_type(H5FD_SUBFILING_g)) { + int mpi_initialized = 0; + int provided = 0; + int mpi_code; + + if ((H5FD_SUBFILING_g = H5FD_register(&H5FD_subfiling_g, sizeof(H5FD_class_t), FALSE)) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_ID, H5E_CANTREGISTER, H5I_INVALID_HID, + "can't register subfiling VFD"); + + /* Initialize error reporting */ + if ((H5subfiling_err_stack_g = H5Ecreate_stack()) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTINIT, H5I_INVALID_HID, "can't create HDF5 error stack"); + if ((H5subfiling_err_class_g = H5Eregister_class(H5SUBFILING_ERR_CLS_NAME, H5SUBFILING_ERR_LIB_NAME, + H5SUBFILING_ERR_VER)) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTINIT, H5I_INVALID_HID, + "can't register error class with HDF5 error API"); + + /* Initialize MPI if not already initialized */ + if (MPI_SUCCESS != (mpi_code = MPI_Initialized(&mpi_initialized))) + H5_SUBFILING_MPI_GOTO_ERROR(H5I_INVALID_HID, "MPI_Initialized failed", mpi_code); + if (mpi_initialized) { + /* If MPI is initialized, validate that it was initialized with MPI_THREAD_MULTIPLE */ + if (MPI_SUCCESS != (mpi_code = MPI_Query_thread(&provided))) + H5_SUBFILING_MPI_GOTO_ERROR(H5I_INVALID_HID, "MPI_Query_thread failed", mpi_code); + if (provided != MPI_THREAD_MULTIPLE) + H5_SUBFILING_GOTO_ERROR( + H5E_VFL, H5E_CANTINIT, H5I_INVALID_HID, + "Subfiling VFD requires the use of MPI_Init_thread with MPI_THREAD_MULTIPLE"); + } + else { + char *env_var; + int required = MPI_THREAD_MULTIPLE; + + /* Ensure that Subfiling VFD has been loaded dynamically */ + env_var = HDgetenv(HDF5_DRIVER); + if (!env_var || HDstrcmp(env_var, H5FD_SUBFILING_NAME)) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTINIT, H5I_INVALID_HID, "MPI isn't initialized"); + + if (MPI_SUCCESS != (mpi_code = MPI_Init_thread(NULL, NULL, required, &provided))) + H5_SUBFILING_MPI_GOTO_ERROR(H5I_INVALID_HID, "MPI_Init_thread failed", mpi_code); + + H5FD_mpi_self_initialized = TRUE; + + if (provided != required) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTINIT, H5I_INVALID_HID, + "MPI doesn't support MPI_Init_thread with MPI_THREAD_MULTIPLE"); + + if (HDatexit(H5FD__subfiling_mpi_finalize) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTINIT, H5I_INVALID_HID, + "can't register atexit handler for MPI_Finalize"); + } + } + + /* Set return value */ + ret_value = H5FD_SUBFILING_g; + +done: + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD_subfiling_init() */ + +/*--------------------------------------------------------------------------- + * Function: H5FD__subfiling_term + * + * Purpose: Shut down the VFD + * + * Returns: SUCCEED (Can't fail) + * + *--------------------------------------------------------------------------- + */ +static herr_t +H5FD__subfiling_term(void) +{ + herr_t ret_value = SUCCEED; + + if (H5FD_SUBFILING_g >= 0) { + /* Free the subfiling application layout information */ + if (sf_app_layout) { + HDfree(sf_app_layout->layout); + sf_app_layout->layout = NULL; + + HDfree(sf_app_layout->node_ranks); + sf_app_layout->node_ranks = NULL; + + HDfree(sf_app_layout); + sf_app_layout = NULL; + } + + /* Unregister from HDF5 error API */ + if (H5subfiling_err_class_g >= 0) { + if (H5Eunregister_class(H5subfiling_err_class_g) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CLOSEERROR, FAIL, + "can't unregister error class from HDF5 error API"); + } + if (H5subfiling_err_stack_g >= 0) { + /* Print the current error stack before destroying it */ + PRINT_ERROR_STACK; + + /* Destroy the error stack */ + if (H5Eclose_stack(H5subfiling_err_stack_g) < 0) { + H5_SUBFILING_DONE_ERROR(H5E_VFL, H5E_CLOSEERROR, FAIL, "can't close HDF5 error stack"); + PRINT_ERROR_STACK; + } /* end if */ + + H5subfiling_err_stack_g = H5I_INVALID_HID; + H5subfiling_err_class_g = H5I_INVALID_HID; + } + } + +done: + /* Reset VFL ID */ + H5FD_SUBFILING_g = H5I_INVALID_HID; + + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_term() */ + +/*------------------------------------------------------------------------- + * Function: H5Pset_fapl_subfiling + * + * Purpose: Modify the file access property list to use the + * H5FD_SUBFILING driver defined in this source file. All + * driver specific properties are passed in as a pointer to + * a suitably initialized instance of H5FD_subfiling_config_t. + * If NULL is passed for the H5FD_subfiling_config_t + * structure, a default structure will be used instead. + * + * Return: SUCCEED/FAIL + * + * Programmer: John Mainzer + * 9/10/17 + * + * Changes: None. + * + *------------------------------------------------------------------------- + */ +herr_t +H5Pset_fapl_subfiling(hid_t fapl_id, H5FD_subfiling_config_t *vfd_config) +{ + H5FD_subfiling_config_t *subfiling_conf = NULL; + H5P_genplist_t * plist = NULL; + herr_t ret_value = SUCCEED; + + /*NO TRACE*/ + + if (NULL == (plist = H5P_object_verify(fapl_id, H5P_FILE_ACCESS))) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, FAIL, "not a file access property list"); + + if (vfd_config == NULL) { + if (NULL == (subfiling_conf = HDcalloc(1, sizeof(*subfiling_conf)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate subfiling VFD configuration"); + subfiling_conf->ioc_fapl_id = H5I_INVALID_HID; + + /* Get subfiling VFD defaults */ + if (H5FD__subfiling_get_default_config(fapl_id, subfiling_conf) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTSET, FAIL, + "can't get default subfiling VFD configuration"); + + vfd_config = subfiling_conf; + } + + if (H5FD__subfiling_validate_config(vfd_config) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "invalid subfiling VFD configuration"); + + ret_value = H5P_set_driver(plist, H5FD_SUBFILING, (void *)vfd_config, NULL); + +done: + if (subfiling_conf) { + if (subfiling_conf->ioc_fapl_id >= 0 && H5I_dec_ref(subfiling_conf->ioc_fapl_id) < 0) + H5_SUBFILING_DONE_ERROR(H5E_PLIST, H5E_CANTDEC, FAIL, "can't close IOC FAPL"); + HDfree(subfiling_conf); + } + + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5Pset_fapl_subfiling() */ + +/*------------------------------------------------------------------------- + * Function: H5Pget_fapl_subfiling + * + * Purpose: Returns information about the subfiling file access + * property list though the function arguments. + * + * Return: Non-negative on success/Negative on failure + * + * Programmer: John Mainzer + * 9/10/17 + * + *------------------------------------------------------------------------- + */ +herr_t +H5Pget_fapl_subfiling(hid_t fapl_id, H5FD_subfiling_config_t *config_out) +{ + const H5FD_subfiling_config_t *config_ptr = NULL; + H5P_genplist_t * plist = NULL; + hbool_t use_default_config = FALSE; + herr_t ret_value = SUCCEED; + + /*NO TRACE*/ + + if (config_out == NULL) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "config_out is NULL"); + + if (NULL == (plist = H5P_object_verify(fapl_id, H5P_FILE_ACCESS))) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, FAIL, "not a file access property list"); + + if (H5FD_SUBFILING != H5P_peek_driver(plist)) + use_default_config = TRUE; + else { + config_ptr = H5P_peek_driver_info(plist); + if (NULL == config_ptr) + use_default_config = TRUE; + } + + if (use_default_config) { + if (H5FD__subfiling_get_default_config(fapl_id, config_out) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTGET, FAIL, + "can't get default Subfiling VFD configuration"); + } + else { + /* Copy the subfiling fapl data out */ + HDmemcpy(config_out, config_ptr, sizeof(H5FD_subfiling_config_t)); + + /* Copy the driver info value */ + if (H5FD__copy_plist(config_ptr->ioc_fapl_id, &(config_out->ioc_fapl_id)) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_BADVALUE, FAIL, "can't copy IOC FAPL"); + } + +done: + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5Pget_fapl_subfiling() */ + +static herr_t +H5FD__subfiling_get_default_config(hid_t fapl_id, H5FD_subfiling_config_t *config_out) +{ + MPI_Comm comm = MPI_COMM_NULL; + MPI_Info info = MPI_INFO_NULL; + char * h5_require_ioc; + herr_t ret_value = SUCCEED; + + HDassert(config_out); + + HDmemset(config_out, 0, sizeof(*config_out)); + + config_out->magic = H5FD_SUBFILING_FAPL_MAGIC; + config_out->version = H5FD_CURR_SUBFILING_FAPL_VERSION; + config_out->ioc_fapl_id = H5I_INVALID_HID; + config_out->stripe_count = 0; + config_out->stripe_depth = H5FD_DEFAULT_STRIPE_DEPTH; + config_out->ioc_selection = SELECT_IOC_ONE_PER_NODE; + config_out->require_ioc = TRUE; + + if ((h5_require_ioc = HDgetenv("H5_REQUIRE_IOC")) != NULL) { + int value_check = HDatoi(h5_require_ioc); + if (value_check == 0) + config_out->require_ioc = FALSE; + } + + /* Check if any MPI parameters were set on the FAPL */ + if (H5Pget_mpi_params(fapl_id, &comm, &info) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTGET, FAIL, "can't get MPI Comm/Info"); + if (comm == MPI_COMM_NULL) { + comm = MPI_COMM_WORLD; + + /* Set MPI_COMM_WORLD on FAPL if no MPI parameters were set */ + if (H5Pset_mpi_params(fapl_id, comm, info) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTSET, FAIL, "can't set MPI Comm/Info"); + } + + /* Create a default FAPL and choose an appropriate underlying driver */ + if ((config_out->ioc_fapl_id = H5Pcreate(H5P_FILE_ACCESS)) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTCREATE, FAIL, "can't create default FAPL"); + + if (config_out->require_ioc) { + if (H5Pset_mpi_params(config_out->ioc_fapl_id, comm, info) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTSET, FAIL, "can't get MPI Comm/Info on IOC FAPL"); + + if (H5Pset_fapl_ioc(config_out->ioc_fapl_id, NULL) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTSET, FAIL, "can't set IOC VFD on IOC FAPL"); + } + else { + if (H5Pset_fapl_sec2(config_out->ioc_fapl_id) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTSET, FAIL, "can't set Sec2 VFD on IOC FAPL"); + } + +done: + if (H5_mpi_comm_free(&comm) < 0) + H5_SUBFILING_DONE_ERROR(H5E_PLIST, H5E_CANTFREE, FAIL, "can't free MPI Communicator"); + if (H5_mpi_info_free(&info) < 0) + H5_SUBFILING_DONE_ERROR(H5E_PLIST, H5E_CANTFREE, FAIL, "can't free MPI Info object"); + + if (ret_value < 0) { + if (config_out->ioc_fapl_id >= 0 && H5Pclose(config_out->ioc_fapl_id) < 0) + H5_SUBFILING_DONE_ERROR(H5E_PLIST, H5E_CANTCLOSEOBJ, FAIL, "can't close FAPL"); + config_out->ioc_fapl_id = H5I_INVALID_HID; + } + + H5_SUBFILING_FUNC_LEAVE; +} + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_validate_config() + * + * Purpose: Test to see if the supplied instance of + * H5FD_subfiling_config_t contains internally consistent data. + * Return SUCCEED if so, and FAIL otherwise. + * + * Note the difference between internally consistent and + * correct. As we will have to try to setup subfiling to + * determine whether the supplied data is correct, + * we will settle for internal consistency at this point + * + * Return: SUCCEED if instance of H5FD_subfiling_config_t contains + * internally consistent data, FAIL otherwise. + * + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__subfiling_validate_config(const H5FD_subfiling_config_t *fa) +{ + herr_t ret_value = SUCCEED; + + HDassert(fa != NULL); + + if (fa->version != H5FD_CURR_SUBFILING_FAPL_VERSION) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "Unknown H5FD_subfiling_config_t version"); + + if (fa->magic != H5FD_SUBFILING_FAPL_MAGIC) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "invalid H5FD_subfiling_config_t magic value"); + + /* TODO: add extra subfiling configuration validation code */ + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__subfiling_validate_config() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_fapl_get + * + * Purpose: Gets a file access property list which could be used to + * create an identical file. + * + * Return: Success: Ptr to new file access property list value. + * + * Failure: NULL + * + * Programmer: John Mainzer + * 9/8/17 + * + * Modifications: + * + *------------------------------------------------------------------------- + */ +static void * +H5FD__subfiling_fapl_get(H5FD_t *_file) +{ + H5FD_subfiling_t * file = (H5FD_subfiling_t *)_file; + H5FD_subfiling_config_t *fa = NULL; + void * ret_value = NULL; + + fa = (H5FD_subfiling_config_t *)H5MM_calloc(sizeof(H5FD_subfiling_config_t)); + + if (fa == NULL) { + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_NOSPACE, NULL, "memory allocation failed"); + } + + /* Copy the fields of the structure */ + HDmemcpy(fa, &(file->fa), sizeof(H5FD_subfiling_config_t)); + + /* Copy the driver info value */ + if (H5FD__copy_plist(file->fa.ioc_fapl_id, &(fa->ioc_fapl_id)) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_BADVALUE, NULL, "can't copy IOC FAPL"); + + /* Set return value */ + ret_value = fa; + +done: + if (ret_value == NULL) { + + if (fa != NULL) { + H5MM_xfree(fa); + } + } + + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_fapl_get() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__copy_plist + * + * Purpose: Sanity-wrapped H5P_copy_plist() for each channel. + * Utility function for operation in multiple locations. + * + * Return: 0 on success, -1 on error. + *------------------------------------------------------------------------- + */ +/* TODO: no need for this function */ +static int +H5FD__copy_plist(hid_t fapl_id, hid_t *id_out_ptr) +{ + int ret_value = 0; + H5P_genplist_t *plist_ptr = NULL; + + H5FD_SUBFILING_LOG_CALL(__func__); + + HDassert(id_out_ptr != NULL); + + if (FALSE == H5P_isa_class(fapl_id, H5P_FILE_ACCESS)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, -1, "not a file access property list"); + + plist_ptr = (H5P_genplist_t *)H5I_object(fapl_id); + if (NULL == plist_ptr) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, -1, "unable to get property list"); + + *id_out_ptr = H5P_copy_plist(plist_ptr, FALSE); + if (H5I_INVALID_HID == *id_out_ptr) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_BADTYPE, -1, "unable to copy file access property list"); + +done: + H5_SUBFILING_FUNC_LEAVE; +} /* end H5FD__copy_plist() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_fapl_copy + * + * Purpose: Copies the subfiling-specific file access properties. + * + * Return: Success: Ptr to a new property list + * + * Failure: NULL + * + * Programmer: John Mainzer + * 9/8/17 + * + * Modifications: + * + *------------------------------------------------------------------------- + */ +static void * +H5FD__subfiling_fapl_copy(const void *_old_fa) +{ + const H5FD_subfiling_config_t *old_fa = (const H5FD_subfiling_config_t *)_old_fa; + H5FD_subfiling_config_t * new_fa = NULL; + void * ret_value = NULL; + + new_fa = (H5FD_subfiling_config_t *)H5MM_malloc(sizeof(H5FD_subfiling_config_t)); + if (new_fa == NULL) { + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_NOSPACE, NULL, "memory allocation failed"); + } + + HDmemcpy(new_fa, old_fa, sizeof(H5FD_subfiling_config_t)); + + if (H5FD__copy_plist(old_fa->ioc_fapl_id, &(new_fa->ioc_fapl_id)) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_BADVALUE, NULL, "can't copy the IOC FAPL"); + + ret_value = new_fa; + +done: + if (ret_value == NULL) { + + if (new_fa != NULL) { + H5MM_xfree(new_fa); + } + } + + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_fapl_copy() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_fapl_free + * + * Purpose: Frees the subfiling-specific file access properties. + * + * Return: SUCCEED (cannot fail) + * + * Programmer: John Mainzer + * 9/8/17 + * + * Modifications: + * + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__subfiling_fapl_free(void *_fa) +{ + H5FD_subfiling_config_t *fa = (H5FD_subfiling_config_t *)_fa; + herr_t ret_value = SUCCEED; + + HDassert(fa != NULL); /* sanity check */ + + if (fa->ioc_fapl_id >= 0 && H5I_dec_ref(fa->ioc_fapl_id) < 0) + H5_SUBFILING_DONE_ERROR(H5E_PLIST, H5E_CANTDEC, FAIL, "can't close IOC FAPL"); + fa->ioc_fapl_id = H5I_INVALID_HID; + + H5MM_xfree(fa); + + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_fapl_free() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_open + * + * Purpose: Create and/or opens a file as an HDF5 file. + * + * Return: Success: A pointer to a new file data structure. The + * public fields will be initialized by the + * caller, which is always H5FD_open(). + * Failure: NULL + * + * Programmer: Richard Warren + * + *------------------------------------------------------------------------- + */ +static H5FD_t * +H5FD__subfiling_open(const char *name, unsigned flags, hid_t fapl_id, haddr_t maxaddr) +{ + H5FD_subfiling_t * file_ptr = NULL; /* Subfiling VFD info */ + const H5FD_subfiling_config_t *config_ptr = NULL; /* Driver-specific property list */ + H5FD_subfiling_config_t default_config; + H5FD_class_t * driver = NULL; /* VFD for file */ + H5P_genplist_t * plist_ptr = NULL; + H5FD_driver_prop_t driver_prop; /* Property for driver ID & info */ + hbool_t bcasted_inode = FALSE; + hbool_t bcasted_eof = FALSE; + int64_t sf_eof = -1; + int mpi_code; /* MPI return code */ + H5FD_t * ret_value = NULL; + + /* Check arguments */ + if (!name || !*name) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, NULL, "invalid file name"); + if (0 == maxaddr || HADDR_UNDEF == maxaddr) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADRANGE, NULL, "bogus maxaddr"); + if (ADDR_OVERFLOW(maxaddr)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_OVERFLOW, NULL, "bogus maxaddr"); + + if (NULL == (file_ptr = (H5FD_subfiling_t *)H5FL_CALLOC(H5FD_subfiling_t))) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTALLOC, NULL, "unable to allocate file struct"); + file_ptr->comm = MPI_COMM_NULL; + file_ptr->info = MPI_INFO_NULL; + file_ptr->context_id = -1; + file_ptr->fa.ioc_fapl_id = H5I_INVALID_HID; + file_ptr->ext_comm = MPI_COMM_NULL; + + /* Get the driver-specific file access properties */ + if (NULL == (plist_ptr = (H5P_genplist_t *)H5I_object(fapl_id))) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, NULL, "not a file access property list"); + + if (H5FD_mpi_self_initialized) { + file_ptr->comm = MPI_COMM_WORLD; + file_ptr->info = MPI_INFO_NULL; + } + else { + /* Get the MPI communicator and info object from the property list */ + if (H5P_get(plist_ptr, H5F_ACS_MPI_PARAMS_COMM_NAME, &file_ptr->comm) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTGET, NULL, "can't get MPI communicator"); + if (H5P_get(plist_ptr, H5F_ACS_MPI_PARAMS_INFO_NAME, &file_ptr->info) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTGET, NULL, "can't get MPI info object"); + + if (file_ptr->comm == MPI_COMM_NULL) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_BADVALUE, NULL, "invalid or unset MPI communicator in FAPL"); + } + + /* Get the MPI rank of this process and the total number of processes */ + if (MPI_SUCCESS != (mpi_code = MPI_Comm_rank(file_ptr->comm, &file_ptr->mpi_rank))) + H5_SUBFILING_MPI_GOTO_ERROR(NULL, "MPI_Comm_rank failed", mpi_code); + if (MPI_SUCCESS != (mpi_code = MPI_Comm_size(file_ptr->comm, &file_ptr->mpi_size))) + H5_SUBFILING_MPI_GOTO_ERROR(NULL, "MPI_Comm_size failed", mpi_code); + + /* Work around an HDF5 metadata cache bug with distributed metadata writes when MPI size == 1 */ + if (file_ptr->mpi_size == 1) { + H5AC_cache_config_t mdc_config; + + /* Get the current initial metadata cache resize configuration */ + if (H5P_get(plist_ptr, H5F_ACS_META_CACHE_INIT_CONFIG_NAME, &mdc_config) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTGET, NULL, "can't get metadata cache initial config"); + mdc_config.metadata_write_strategy = H5AC_METADATA_WRITE_STRATEGY__PROCESS_0_ONLY; + if (H5P_set(plist_ptr, H5F_ACS_META_CACHE_INIT_CONFIG_NAME, &mdc_config) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTSET, NULL, "can't set metadata cache initial config"); + } + + config_ptr = H5P_peek_driver_info(plist_ptr); + if (!config_ptr || (H5P_FILE_ACCESS_DEFAULT == fapl_id)) { + if (H5FD__subfiling_get_default_config(fapl_id, &default_config) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTGET, NULL, + "can't get default subfiling VFD configuration"); + config_ptr = &default_config; + } + + HDmemcpy(&file_ptr->fa, config_ptr, sizeof(H5FD_subfiling_config_t)); + + if (NULL != (file_ptr->file_path = HDrealpath(name, NULL))) { + char *path = NULL; + char *directory = dirname(path); + + if (NULL == (path = HDstrdup(file_ptr->file_path))) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTCOPY, NULL, "can't copy subfiling subfile path"); + if (NULL == (file_ptr->file_dir = HDstrdup(directory))) { + HDfree(path); + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTCOPY, NULL, + "can't copy subfiling subfile directory path"); + } + + HDfree(path); + } + else { + if (ENOENT == errno) { + if (NULL == (file_ptr->file_path = HDstrdup(name))) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTCOPY, NULL, "can't copy file name"); + if (NULL == (file_ptr->file_dir = HDstrdup("."))) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTOPENFILE, NULL, "can't set subfile directory path"); + } + else + H5_SUBFILING_SYS_GOTO_ERROR(H5E_VFL, H5E_CANTGET, NULL, "can't resolve subfile path"); + } + + if (H5FD__copy_plist(config_ptr->ioc_fapl_id, &(file_ptr->fa.ioc_fapl_id)) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_BADVALUE, NULL, "can't copy FAPL"); + + file_ptr->sf_file = H5FD_open(name, flags, file_ptr->fa.ioc_fapl_id, HADDR_UNDEF); + if (!file_ptr->sf_file) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTOPENFILE, NULL, "unable to open IOC file"); + + /* Check the "native" driver (IOC/sec2/etc.) */ + if (NULL == (plist_ptr = H5I_object(file_ptr->fa.ioc_fapl_id))) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_BADVALUE, NULL, "invalid IOC FAPL"); + + if (H5P_peek(plist_ptr, H5F_ACS_FILE_DRV_NAME, &driver_prop) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTGET, NULL, "can't get driver ID & info"); + if (NULL == (driver = (H5FD_class_t *)H5I_object(driver_prop.driver_id))) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_BADVALUE, NULL, + "invalid driver ID in file access property list"); + + if (driver->value != H5_VFD_IOC && driver->value != H5_VFD_SEC2) + H5_SUBFILING_GOTO_ERROR( + H5E_FILE, H5E_CANTOPENFILE, NULL, + "unable to open file '%s' - only IOC and Sec2 VFDs are currently supported for subfiles", name); + + if (driver->value == H5_VFD_IOC) { + h5_stat_t sb; + uint64_t fid; + void * file_handle = NULL; + + if (file_ptr->mpi_rank == 0) { + if (H5FDget_vfd_handle(file_ptr->sf_file, file_ptr->fa.ioc_fapl_id, &file_handle) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_CANTGET, NULL, "can't get file handle"); + + if (HDfstat(*(int *)file_handle, &sb) < 0) + H5_SUBFILING_SYS_GOTO_ERROR(H5E_FILE, H5E_BADFILE, NULL, "unable to fstat file"); + + HDcompile_assert(sizeof(uint64_t) >= sizeof(ino_t)); + fid = (uint64_t)sb.st_ino; + } + + if (MPI_SUCCESS != (mpi_code = MPI_Bcast(&fid, 1, MPI_UINT64_T, 0, file_ptr->comm))) + H5_SUBFILING_MPI_GOTO_ERROR(NULL, "MPI_Bcast failed", mpi_code); + + bcasted_inode = TRUE; + + /* Get a copy of the context ID for later use */ + file_ptr->context_id = H5_subfile_fid_to_context(fid); + file_ptr->fa.require_ioc = true; + } + else if (driver->value == H5_VFD_SEC2) { + uint64_t inode_id = (uint64_t)-1; + int ioc_flags; + + /* Translate the HDF5 file open flags into standard POSIX open flags */ + ioc_flags = (H5F_ACC_RDWR & flags) ? O_RDWR : O_RDONLY; + if (H5F_ACC_TRUNC & flags) + ioc_flags |= O_TRUNC; + if (H5F_ACC_CREAT & flags) + ioc_flags |= O_CREAT; + if (H5F_ACC_EXCL & flags) + ioc_flags |= O_EXCL; + + /* Let MPI rank 0 to the file stat operation and broadcast a result */ + if (file_ptr->mpi_rank == 0) { + if (file_ptr->sf_file) { + h5_stat_t sb; + void * file_handle = NULL; + + if (H5FDget_vfd_handle(file_ptr->sf_file, file_ptr->fa.ioc_fapl_id, &file_handle) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTGET, NULL, "can't get file handle"); + + /* We create a new file descriptor for our file structure. + * Basically, we want these separate so that sec2 can + * deal with the opened file for additional operations + * (especially close) without interfering with subfiling. + */ + file_ptr->fd = HDdup(*(int *)file_handle); + + if (HDfstat(*(int *)file_handle, &sb) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_BADFILE, NULL, "unable to fstat file"); + inode_id = sb.st_ino; + } + } + + if (MPI_SUCCESS == MPI_Bcast(&inode_id, 1, MPI_UNSIGNED_LONG_LONG, 0, file_ptr->comm)) { + file_ptr->inode = inode_id; + } + + bcasted_inode = TRUE; + + /* All ranks can now detect an error and fail. */ + if (inode_id == (uint64_t)-1) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_CANTOPENFILE, NULL, "unable to open file = %s\n", name); + + /* + * Open the subfiles for this HDF5 file. A subfiling + * context ID will be returned, which is used for + * further interactions with this file's subfiles. + */ + if (H5_open_subfiles(file_ptr->file_path, inode_id, file_ptr->fa.ioc_selection, ioc_flags, + file_ptr->comm, &file_ptr->context_id) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_CANTOPENFILE, NULL, "unable to open subfiling files = %s\n", + name); + } + + if (file_ptr->mpi_rank == 0) { + if (H5FD__subfiling__get_real_eof(file_ptr->context_id, &sf_eof) < 0) + sf_eof = -1; + } + + if (MPI_SUCCESS != (mpi_code = MPI_Bcast(&sf_eof, 1, MPI_INT64_T, 0, file_ptr->comm))) + H5_SUBFILING_MPI_GOTO_ERROR(NULL, "MPI_Bcast", mpi_code); + + bcasted_eof = TRUE; + + if (sf_eof < 0) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_CANTGET, NULL, "lead MPI process failed to get file EOF"); + + file_ptr->eof = (haddr_t)sf_eof; + file_ptr->local_eof = file_ptr->eof; + + ret_value = (H5FD_t *)file_ptr; + +done: + if (NULL == ret_value) { + if (file_ptr) { + /* Participate in possible MPI collectives on failure */ + if (file_ptr->comm != MPI_COMM_NULL) { + if (!bcasted_inode) { + uint64_t tmp_inode = UINT64_MAX; + + if (MPI_SUCCESS != + (mpi_code = MPI_Bcast(&tmp_inode, 1, MPI_UNSIGNED_LONG_LONG, 0, file_ptr->comm))) + H5_SUBFILING_MPI_DONE_ERROR(NULL, "MPI_Bcast failed", mpi_code); + } + if (!bcasted_eof) { + sf_eof = -1; + + if (MPI_SUCCESS != (mpi_code = MPI_Bcast(&sf_eof, 1, MPI_INT64_T, 0, file_ptr->comm))) + H5_SUBFILING_MPI_DONE_ERROR(NULL, "MPI_Bcast failed", mpi_code); + } + } + + if (H5FD__subfiling_close_int(file_ptr) < 0) + H5_SUBFILING_DONE_ERROR(H5E_FILE, H5E_CLOSEERROR, NULL, "couldn't close file"); + } + } + + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_open() */ + +static herr_t +H5FD__subfiling_close_int(H5FD_subfiling_t *file_ptr) +{ + herr_t ret_value = SUCCEED; + + HDassert(file_ptr); + +#if H5FD_SUBFILING_DEBUG_OP_CALLS + { + subfiling_context_t *sf_context = H5_get_subfiling_object(file_ptr->context_id); + + HDassert(sf_context); + HDassert(sf_context->topology); + + if (sf_context->topology->rank_is_ioc) + HDprintf("[%s %d] fd=%d\n", __func__, file_ptr->mpi_rank, sf_context->sf_fid); + else + HDprintf("[%s %d] fd=*\n", __func__, file_ptr->mpi_rank); + HDfflush(stdout); + } +#endif + + if (file_ptr->sf_file && H5FD_close(file_ptr->sf_file) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_CANTCLOSEFILE, FAIL, "unable to close subfile"); + + if (!file_ptr->fa.require_ioc) { + if (file_ptr->context_id >= 0 && H5_free_subfiling_object(file_ptr->context_id) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTFREE, FAIL, "can't free subfiling context object"); + } + + /* if set, close the copy of the plist for the underlying VFD. */ + if ((file_ptr->fa.ioc_fapl_id >= 0) && (H5I_dec_ref(file_ptr->fa.ioc_fapl_id) < 0)) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_ARGS, FAIL, "can't close IOC FAPL"); + file_ptr->fa.ioc_fapl_id = H5I_INVALID_HID; + + if (H5_mpi_comm_free(&file_ptr->comm) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTFREE, FAIL, "unable to free MPI Communicator"); + if (H5_mpi_info_free(&file_ptr->info) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTFREE, FAIL, "unable to free MPI Info object"); + + if (H5_mpi_comm_free(&file_ptr->ext_comm) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTFREE, FAIL, "can't free MPI communicator"); + +done: + HDfree(file_ptr->file_path); + file_ptr->file_path = NULL; + + HDfree(file_ptr->file_dir); + file_ptr->file_dir = NULL; + + /* Release the file info */ + file_ptr = H5FL_FREE(H5FD_subfiling_t, file_ptr); + + H5_SUBFILING_FUNC_LEAVE; +} + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_close + * + * Purpose: Closes an HDF5 file. + * + * Return: Success: SUCCEED + * Failure: FAIL, file not closed. + * + * Programmer: Richard Warren + * + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__subfiling_close(H5FD_t *_file) +{ + H5FD_subfiling_t *file_ptr = (H5FD_subfiling_t *)_file; + herr_t ret_value = SUCCEED; + + if (H5FD__subfiling_close_int(file_ptr) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_CANTCLOSEFILE, FAIL, "unable to close file"); + +done: + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_close() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_cmp + * + * Purpose: Compares two files belonging to this driver using an + * arbitrary (but consistent) ordering. + * + * Return: Success: A value like strcmp() + * Failure: never fails (arguments were checked by the + * caller). + * + * Programmer: Richard Warren + * + *------------------------------------------------------------------------- + */ +static int +H5FD__subfiling_cmp(const H5FD_t *_f1, const H5FD_t *_f2) +{ + const H5FD_subfiling_t *f1 = (const H5FD_subfiling_t *)_f1; + const H5FD_subfiling_t *f2 = (const H5FD_subfiling_t *)_f2; + int ret_value = 0; + + HDassert(f1); + HDassert(f2); + + ret_value = H5FD_cmp(f1->sf_file, f2->sf_file); + + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_cmp() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_query + * + * Purpose: Set the flags that this VFL driver is capable of supporting. + * (listed in H5FDpublic.h) + * + * For now, duplicate the flags used for the MPIO VFD. + * Revisit this when we have a version of the subfiling VFD + * that is usable in serial builds. + * + * Return: SUCCEED (Can't fail) + * + * Programmer: John Mainzer + * 11/15/21 + * + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__subfiling_query(const H5FD_t H5_ATTR_UNUSED *_file, unsigned long *flags /* out */) +{ + herr_t ret_value = SUCCEED; + + /* Set the VFL feature flags that this driver supports */ + if (flags) { + *flags = 0; + *flags |= H5FD_FEAT_AGGREGATE_METADATA; /* OK to aggregate metadata allocations */ + *flags |= H5FD_FEAT_AGGREGATE_SMALLDATA; /* OK to aggregate "small" raw data allocations */ + *flags |= H5FD_FEAT_HAS_MPI; /* This driver uses MPI */ + *flags |= H5FD_FEAT_ALLOCATE_EARLY; /* Allocate space early instead of late */ + } + + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_query() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_get_eoa + * + * Purpose: Gets the end-of-address marker for the file. The EOA marker + * is the first address past the last byte allocated in the + * format address space. + * + * Return: The end-of-address marker. + * + * Programmer: Richard Warren + * + *------------------------------------------------------------------------- + */ +static haddr_t +H5FD__subfiling_get_eoa(const H5FD_t *_file, H5FD_mem_t H5_ATTR_UNUSED type) +{ + const H5FD_subfiling_t *file = (const H5FD_subfiling_t *)_file; + haddr_t ret_value = HADDR_UNDEF; + + ret_value = file->eoa; + + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_get_eoa() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_set_eoa + * + * Purpose: Set the end-of-address marker for the file. This function is + * called shortly after an existing HDF5 file is opened in order + * to tell the driver where the end of the HDF5 data is located. + * + * Return: SUCCEED (Can't fail) + * + * Programmer: Richard Warren + * + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__subfiling_set_eoa(H5FD_t *_file, H5FD_mem_t H5_ATTR_UNUSED type, haddr_t addr) +{ + H5FD_subfiling_t *file_ptr = (H5FD_subfiling_t *)_file; + herr_t ret_value = SUCCEED; + + file_ptr->eoa = addr; + + ret_value = H5FD_set_eoa(file_ptr->sf_file, type, addr); + + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_set_eoa() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_get_eof + * + * Purpose: Returns the end-of-file marker from the filesystem + * perspective. + * + * Return: End of file address, the first address past the end of the + * "file", either the filesystem file or the HDF5 file. + * + * SUBFILING NOTE: + * The EOF calculation for subfiling is somewhat different + * than for the more traditional HDF5 file implementations. + * This statement derives from the fact that unlike "normal" + * HDF5 files, subfiling introduces a multi-file representation + * of a single HDF5 file. The plurality of sub-files represents + * a software RAID-0 based HDF5 file. As such, each sub-file + * contains a designated portion of the address space of the + * virtual HDF5 storage. We have no notion of HDF5 datatypes, + * datasets, metadata, or other HDF5 structures; only BYTES. + * + * The organization of the bytes within sub-files is consistent + * with the RAID-0 striping, i.e. there are IO Concentrators + * (IOCs) which correspond to a stripe-count (in Lustre) as + * well as a stripe_size. The combination of these two + * variables determines the "address" (a combination of IOC + * and a file offset) of any storage operation. + * + * Having a defined storage layout, the virtual file EOF + * calculation should be the MAXIMUM value returned by the + * collection of IOCs. Every MPI rank which hosts an IOC + * maintains its own EOF by updating that value for each + * WRITE operation that completes, i.e. if a new local EOF + * is greater than the existing local EOF, the new EOF + * will replace the old. The local EOF calculation is as + * follows. + * 1. At file creation, each IOC is assigned a rank value + * (0 to N-1, where N is the total number of IOCs) and + * a 'sf_base_addr' = 'subfile_rank' * 'sf_stripe_size') + * we also determine the 'sf_blocksize_per_stripe' which + * is simply the 'sf_stripe_size' * 'n_ioc_concentrators' + * + * 2. For every write operation, the IOC receives a message + * containing a file_offset and the data_size. + * + * 3. The file_offset + data_size are in turn used to + * create a stripe_id: + * IOC-(ioc_rank) IOC-(ioc_rank+1) + * |<- sf_base_address |<- sf_base_address | + * ID +--------------------+--------------------+ + * 0:|<- sf_stripe_size ->|<- sf_stripe_size ->| + * 1:|<- sf_stripe_size ->|<- sf_stripe_size ->| + * ~ ~ ~ + * N:|<- sf_stripe_size ->|<- sf_stripe_size ->| + * +--------------------+--------------------+ + * + * The new 'stripe_id' is then used to calculate a + * potential new EOF: + * sf_eof = (stripe_id * sf_blocksize_per_stripe) + sf_base_addr + * + ((file_offset + data_size) % sf_stripe_size) + * + * 4. If (sf_eof > current_sf_eof), then current_sf_eof = sf_eof. + * + * + * Programmer: Richard Warren + * + *------------------------------------------------------------------------- + */ +static haddr_t +H5FD__subfiling_get_eof(const H5FD_t *_file, H5FD_mem_t H5_ATTR_UNUSED type) +{ + const H5FD_subfiling_t *file = (const H5FD_subfiling_t *)_file; +#if 0 + int64_t logical_eof = -1; +#endif + haddr_t ret_value = HADDR_UNDEF; + +#if 0 + /* + * TODO: this is a heavy weight implementation. We need something like this + * for file open, and probably for file close. However, in between, something + * similar to the current solution in the MPIIO VFD might be more appropriate. + */ + if (H5FD__subfiling__get_real_eof(file->fa.context_id, &logical_eof) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_INTERNAL, H5E_CANTGET, HADDR_UNDEF, "can't get EOF") + + /* Return the global max of all the subfile EOF values */ + ret_value = (haddr_t)(logical_eof); + +done: +#endif + + ret_value = file->eof; + + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_get_eof() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_get_handle + * + * Purpose: Returns the file handle of subfiling file driver. + * + * Returns: SUCCEED/FAIL + * + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__subfiling_get_handle(H5FD_t *_file, hid_t H5_ATTR_UNUSED fapl, void **file_handle) +{ + H5FD_subfiling_t *file = (H5FD_subfiling_t *)_file; + herr_t ret_value = SUCCEED; + + if (!file_handle) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "file handle not valid"); + + *file_handle = &(file->fd); + +done: + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_get_handle() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_read + * + * Purpose: Reads SIZE bytes of data from FILE beginning at address ADDR + * into buffer BUF according to data transfer properties in + * DXPL_ID. + * + * Return: Success: SUCCEED. Result is stored in caller-supplied + * buffer BUF. + * Failure: FAIL, Contents of buffer BUF are undefined. + * + * Programmer: Richard Warren + * + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__subfiling_read(H5FD_t *_file, H5FD_mem_t type, hid_t dxpl_id, haddr_t addr, size_t size, + void *buf /*out*/) +{ + subfiling_context_t *sf_context = NULL; + H5FD_subfiling_t * file_ptr = (H5FD_subfiling_t *)_file; + H5FD_mem_t * io_types = NULL; + haddr_t * io_addrs = NULL; + size_t * io_sizes = NULL; + void ** io_bufs = NULL; + int64_t * source_data_offset = NULL; + int64_t * sf_data_size = NULL; + int64_t * sf_offset = NULL; + hbool_t rank0_bcast = FALSE; + int ioc_total; + herr_t ret_value = SUCCEED; + + HDassert(file_ptr && file_ptr->pub.cls); + HDassert(buf); + + /* Check for overflow conditions */ + if (!H5F_addr_defined(addr)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "addr undefined, addr = %" PRIuHADDR, addr); + if (REGION_OVERFLOW(addr, size)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_OVERFLOW, FAIL, + "addr overflow, addr = %" PRIuHADDR ", size = %" PRIuHADDR, addr, size); + + /* TODO: Temporarily reject collective I/O until support is implemented (unless types are simple MPI_BYTE) + */ + { + H5FD_mpio_xfer_t xfer_mode; + + if (H5CX_get_io_xfer_mode(&xfer_mode) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_CONTEXT, H5E_CANTGET, FAIL, + "can't determine I/O collectivity setting"); + + if (xfer_mode == H5FD_MPIO_COLLECTIVE) { + MPI_Datatype btype, ftype; + + if (H5CX_get_mpi_coll_datatypes(&btype, &ftype) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTGET, FAIL, "can't get MPI-I/O datatypes"); + if (MPI_BYTE != btype || MPI_BYTE != ftype) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_UNSUPPORTED, FAIL, + "collective I/O is currently unsupported"); + } + + /* Determine whether a rank 0 bcast approach has been requested */ + rank0_bcast = H5CX_get_mpio_rank0_bcast(); + + /* + * If we reached here, we're still doing independent I/O regardless + * of collectivity setting, so set that. + */ + H5CX_set_io_xfer_mode(H5FD_MPIO_INDEPENDENT); + } + +#if H5FD_SUBFILING_DEBUG_OP_CALLS + HDprintf("[%s %d] addr=%ld, size=%ld\n", __func__, file_ptr->mpi_rank, addr, size); + HDfflush(stdout); +#endif + + /* + * Retrieve the subfiling context object and the number + * of I/O concentrators. + * + * Given the current I/O and the I/O concentrator info, + * we can determine some I/O transaction parameters. + * In particular, for large I/O operations, each IOC + * may require multiple I/Os to fulfill the user I/O + * request. The block size and number of IOCs are used + * to size the vectors that will be used to invoke the + * underlying I/O operations. + */ + sf_context = (subfiling_context_t *)H5_get_subfiling_object(file_ptr->context_id); + HDassert(sf_context); + HDassert(sf_context->topology); + + ioc_total = sf_context->topology->n_io_concentrators; + +#if H5FD_SUBFILING_DEBUG_OP_CALLS + if (sf_context->topology->rank_is_ioc) + HDprintf("[%s %d] fd=%d\n", __func__, file_ptr->mpi_rank, sf_context->sf_fid); + else + HDprintf("[%s %d] fd=*\n", __func__, file_ptr->mpi_rank); + HDfflush(stdout); +#endif + + if (ioc_total == 0) { + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "invalid number of I/O concentrators (%d)", + ioc_total); + } + else if (ioc_total == 1) { + /*********************************** + * No striping - just a single IOC * + ***********************************/ + + /* Make vector read call to subfile */ + if (H5FDread_vector(file_ptr->sf_file, dxpl_id, 1, &type, &addr, &size, &buf) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_READERROR, FAIL, "read from subfile failed"); + } + else { + int64_t max_io_req_per_ioc; + int64_t file_offset; + int64_t block_size; + size_t max_depth; + herr_t status; + int ioc_count = 0; + int ioc_start = -1; + + /********************************* + * Striping across multiple IOCs * + *********************************/ + + block_size = sf_context->sf_blocksize_per_stripe; + max_depth = (size / (size_t)block_size) + 2; + + /* + * Given the number of I/O concentrators, allocate vectors (one per IOC) + * to contain the translation of the I/O request into a collection of I/O + * requests. + */ + if (NULL == + (source_data_offset = HDcalloc(1, (size_t)ioc_total * max_depth * sizeof(*source_data_offset)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate source data offset I/O vector"); + if (NULL == (sf_data_size = HDcalloc(1, (size_t)ioc_total * max_depth * sizeof(*sf_data_size)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate subfile data size I/O vector"); + if (NULL == (sf_offset = HDcalloc(1, (size_t)ioc_total * max_depth * sizeof(*sf_offset)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate subfile offset I/O vector"); + + H5_CHECKED_ASSIGN(file_offset, int64_t, addr, haddr_t); + + /* + * Get the potential set of IOC transactions; e.g., data sizes, + * offsets and datatypes. These can all be used by either the + * underlying IOC or by the sec2 driver. + * + * For now, assume we're dealing with contiguous datasets. Vector + * I/O will probably handle the non-contiguous case. + */ + status = init_indep_io(sf_context, /* IN: Context used to look up config info */ + file_offset, /* IN: Starting file offset */ + size, /* IN: I/O size */ + 1, /* IN: Data extent of the 'type' assumes byte */ + max_depth, /* IN: Maximum stripe depth */ + source_data_offset, /* OUT: Memory offset */ + sf_offset, /* OUT: File offset */ + sf_data_size, /* OUT: Length of this contiguous block */ + &ioc_start, /* OUT: IOC index corresponding to starting offset */ + &ioc_count, /* OUT: Number of actual IOCs used */ + &max_io_req_per_ioc); /* OUT: Maximum number of requests to any IOC */ + + if (status < 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_CANTINIT, FAIL, "can't initialize IOC transactions"); + + if (max_io_req_per_ioc > 0) { + uint32_t vector_len; + + H5_CHECKED_ASSIGN(vector_len, uint32_t, ioc_count, int); + + /* Allocate I/O vectors */ + if (NULL == (io_types = HDmalloc(vector_len * sizeof(*io_types)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate subfile I/O types vector"); + if (NULL == (io_addrs = HDmalloc(vector_len * sizeof(*io_addrs)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate subfile I/O addresses vector"); + if (NULL == (io_sizes = HDmalloc(vector_len * sizeof(*io_sizes)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate subfile I/O sizes vector"); + if (NULL == (io_bufs = HDmalloc(vector_len * sizeof(*io_bufs)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate subfile I/O buffers vector"); + + /* TODO: The following is left for future work */ + /* + * Set ASYNC MODE + * H5FD_class_aio_t *async_file_ptr = (H5FD_class_aio_t *)file_ptr->sf_file; + * uint64_t op_code_begin = OPC_BEGIN; + * uint64_t op_code_complete = OPC_COMPLETE; + * const void *input = NULL; + * void *output = NULL; + * H5FDctl(file_ptr->sf_file, op_code_begin, flags, input, &output); + * (*async_file_ptr->h5fdctl)(file_ptr->sf_file, op_code_begin, flags, input, &output); + */ + + for (int64_t i = 0; i < max_io_req_per_ioc; i++) { + uint32_t final_vec_len = vector_len; + int next_ioc = ioc_start; + + /* Fill in I/O types, offsets, sizes and buffers vectors */ + for (uint32_t k = 0, vec_idx = 0; k < vector_len; k++) { + size_t idx = (size_t)next_ioc * max_depth + (size_t)i; + + io_types[vec_idx] = type; + H5_CHECKED_ASSIGN(io_addrs[vec_idx], haddr_t, sf_offset[idx], int64_t); + H5_CHECKED_ASSIGN(io_sizes[vec_idx], size_t, sf_data_size[idx], int64_t); + io_bufs[vec_idx] = ((char *)buf + source_data_offset[idx]); + + next_ioc = (next_ioc + 1) % ioc_total; + + /* Skip 0-sized I/Os */ + if (io_sizes[vec_idx] == 0) { + final_vec_len--; + continue; + } + + vec_idx++; + } + + if (!rank0_bcast || (file_ptr->mpi_rank == 0)) { + /* Make vector read call to subfile */ + if (H5FDread_vector(file_ptr->sf_file, dxpl_id, final_vec_len, io_types, io_addrs, + io_sizes, io_bufs) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_READERROR, FAIL, "read from subfile failed"); + } + } + + if (rank0_bcast) { + H5_CHECK_OVERFLOW(size, size_t, int); + if (MPI_SUCCESS != MPI_Bcast(buf, (int)size, MPI_BYTE, 0, file_ptr->comm)) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_READERROR, FAIL, "can't broadcast data from rank 0"); + } + + /* TODO: The following is left for future work */ + /* H5FDctl(file_ptr->sf_file, op_code_complete, flags, input, &output); */ + } + } + + /* Point to the end of the current I/O */ + addr += (haddr_t)size; + + /* Update current file position and EOF */ + file_ptr->pos = addr; + file_ptr->op = OP_READ; + +done: + HDfree(io_bufs); + HDfree(io_sizes); + HDfree(io_addrs); + HDfree(io_types); + HDfree(sf_offset); + HDfree(sf_data_size); + HDfree(source_data_offset); + + if (ret_value < 0) { + /* Reset last file I/O information */ + file_ptr->pos = HADDR_UNDEF; + file_ptr->op = OP_UNKNOWN; + } /* end if */ + + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_read() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_write + * + * Purpose: Writes SIZE bytes of data to FILE beginning at address ADDR + * from buffer BUF according to data transfer properties in + * DXPL_ID. + * + * Return: SUCCEED/FAIL + * + * Programmer: Richard Warren + * + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__subfiling_write(H5FD_t *_file, H5FD_mem_t type, hid_t dxpl_id, haddr_t addr, size_t size, + const void *buf /*in*/) +{ + subfiling_context_t *sf_context = NULL; + H5FD_subfiling_t * file_ptr = (H5FD_subfiling_t *)_file; + const void ** io_bufs = NULL; + H5FD_mem_t * io_types = NULL; + haddr_t * io_addrs = NULL; + size_t * io_sizes = NULL; + int64_t * source_data_offset = NULL; + int64_t * sf_data_size = NULL; + int64_t * sf_offset = NULL; + int ioc_total; + herr_t ret_value = SUCCEED; + + HDassert(file_ptr && file_ptr->pub.cls); + HDassert(buf); + + /* Check for overflow conditions */ + if (!H5F_addr_defined(addr)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "addr undefined, addr = %" PRIuHADDR, addr); + if (REGION_OVERFLOW(addr, size)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_OVERFLOW, FAIL, + "addr overflow, addr = %" PRIuHADDR ", size = %" PRIuHADDR, addr, size); + + /* TODO: Temporarily reject collective I/O until support is implemented (unless types are simple MPI_BYTE) + */ + { + H5FD_mpio_xfer_t xfer_mode; + + if (H5CX_get_io_xfer_mode(&xfer_mode) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_CONTEXT, H5E_CANTGET, FAIL, + "can't determine I/O collectivity setting"); + + if (xfer_mode == H5FD_MPIO_COLLECTIVE) { + MPI_Datatype btype, ftype; + + if (H5CX_get_mpi_coll_datatypes(&btype, &ftype) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTGET, FAIL, "can't get MPI-I/O datatypes"); + if (MPI_BYTE != btype || MPI_BYTE != ftype) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_UNSUPPORTED, FAIL, + "collective I/O is currently unsupported"); + } + + /* + * If we reached here, we're still doing independent I/O regardless + * of collectivity setting, so set that. + */ + H5CX_set_io_xfer_mode(H5FD_MPIO_INDEPENDENT); + } + +#if H5FD_SUBFILING_DEBUG_OP_CALLS + HDprintf("[%s %d] addr=%ld, size=%ld\n", __func__, file_ptr->mpi_rank, addr, size); + HDfflush(stdout); +#endif + + /* + * Retrieve the subfiling context object and the number + * of I/O concentrators. + * + * Given the current I/O and the I/O concentrator info, + * we can determine some I/O transaction parameters. + * In particular, for large I/O operations, each IOC + * may require multiple I/Os to fulfill the user I/O + * request. The block size and number of IOCs are used + * to size the vectors that will be used to invoke the + * underlying I/O operations. + */ + sf_context = (subfiling_context_t *)H5_get_subfiling_object(file_ptr->context_id); + HDassert(sf_context); + HDassert(sf_context->topology); + + ioc_total = sf_context->topology->n_io_concentrators; + +#if H5FD_SUBFILING_DEBUG_OP_CALLS + if (sf_context->topology->rank_is_ioc) + HDprintf("[%s %d] fd=%d\n", __func__, file_ptr->mpi_rank, sf_context->sf_fid); + else + HDprintf("[%s %d] fd=*\n", __func__, file_ptr->mpi_rank); + HDfflush(stdout); +#endif + + if (ioc_total == 0) { + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "invalid number of I/O concentrators (%d)", + ioc_total); + } + else if (ioc_total == 1) { + /*********************************** + * No striping - just a single IOC * + ***********************************/ + + /* Make vector write call to subfile */ + if (H5FDwrite_vector(file_ptr->sf_file, dxpl_id, 1, &type, &addr, &size, &buf) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_WRITEERROR, FAIL, "write to subfile failed"); + } + else { + int64_t max_io_req_per_ioc; + int64_t file_offset; + int64_t block_size; + size_t max_depth; + herr_t status; + int ioc_count = 0; + int ioc_start = -1; + + /********************************* + * Striping across multiple IOCs * + *********************************/ + + block_size = sf_context->sf_blocksize_per_stripe; + max_depth = (size / (size_t)block_size) + 2; + + /* + * Given the number of I/O concentrators, allocate vectors (one per IOC) + * to contain the translation of the I/O request into a collection of I/O + * requests. + */ + if (NULL == + (source_data_offset = HDcalloc(1, (size_t)ioc_total * max_depth * sizeof(*source_data_offset)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate source data offset I/O vector"); + if (NULL == (sf_data_size = HDcalloc(1, (size_t)ioc_total * max_depth * sizeof(*sf_data_size)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate subfile data size I/O vector"); + if (NULL == (sf_offset = HDcalloc(1, (size_t)ioc_total * max_depth * sizeof(*sf_offset)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate subfile offset I/O vector"); + + H5_CHECKED_ASSIGN(file_offset, int64_t, addr, haddr_t); + + /* + * Get the potential set of IOC transactions; e.g., data sizes, + * offsets and datatypes. These can all be used by either the + * underlying IOC or by the sec2 driver. + * + * For now, assume we're dealing with contiguous datasets. Vector + * I/O will probably handle the non-contiguous case. + */ + status = init_indep_io(sf_context, /* IN: Context used to look up config info */ + file_offset, /* IN: Starting file offset */ + size, /* IN: I/O size */ + 1, /* IN: Data extent of the 'type' assumes byte */ + max_depth, /* IN: Maximum stripe depth */ + source_data_offset, /* OUT: Memory offset */ + sf_offset, /* OUT: File offset */ + sf_data_size, /* OUT: Length of this contiguous block */ + &ioc_start, /* OUT: IOC index corresponding to starting offset */ + &ioc_count, /* OUT: Number of actual IOCs used */ + &max_io_req_per_ioc); /* OUT: Maximum number of requests to any IOC */ + + if (status < 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_CANTINIT, FAIL, "can't initialize IOC transactions"); + + if (max_io_req_per_ioc > 0) { + uint32_t vector_len; + + H5_CHECKED_ASSIGN(vector_len, uint32_t, ioc_count, int); + + /* Allocate I/O vectors */ + if (NULL == (io_types = HDmalloc(vector_len * sizeof(*io_types)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate subfile I/O types vector"); + if (NULL == (io_addrs = HDmalloc(vector_len * sizeof(*io_addrs)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate subfile I/O addresses vector"); + if (NULL == (io_sizes = HDmalloc(vector_len * sizeof(*io_sizes)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate subfile I/O sizes vector"); + if (NULL == (io_bufs = HDmalloc(vector_len * sizeof(*io_bufs)))) + H5_SUBFILING_GOTO_ERROR(H5E_RESOURCE, H5E_CANTALLOC, FAIL, + "can't allocate subfile I/O buffers vector"); + + /* TODO: The following is left for future work */ + /* + * Set ASYNC MODE + * H5FD_class_aio_t *async_file_ptr = (H5FD_class_aio_t *)file_ptr->sf_file; + * uint64_t op_code_begin = OPC_BEGIN; + * uint64_t op_code_complete = OPC_COMPLETE; + * const void *input = NULL; + * void *output = NULL; + * H5FDctl(file_ptr->sf_file, op_code_begin, flags, input, &output); + * (*async_file_ptr->h5fdctl)(file_ptr->sf_file, op_code_begin, flags, input, &output); + */ + + for (int64_t i = 0; i < max_io_req_per_ioc; i++) { + uint32_t final_vec_len = vector_len; + int next_ioc = ioc_start; + + /* Fill in I/O types, offsets, sizes and buffers vectors */ + for (uint32_t k = 0, vec_idx = 0; k < vector_len; k++) { + size_t idx = (size_t)next_ioc * max_depth + (size_t)i; + + io_types[vec_idx] = type; + H5_CHECKED_ASSIGN(io_addrs[vec_idx], haddr_t, sf_offset[idx], int64_t); + H5_CHECKED_ASSIGN(io_sizes[vec_idx], size_t, sf_data_size[idx], int64_t); + io_bufs[vec_idx] = ((const char *)buf + source_data_offset[idx]); + + next_ioc = (next_ioc + 1) % ioc_total; + + /* Skip 0-sized I/Os */ + if (io_sizes[vec_idx] == 0) { + final_vec_len--; + continue; + } + + vec_idx++; + } + + /* Make vector write call to subfile */ + if (H5FDwrite_vector(file_ptr->sf_file, dxpl_id, final_vec_len, io_types, io_addrs, io_sizes, + io_bufs) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_WRITEERROR, FAIL, "write to subfile failed"); + } + + /* TODO: The following is left for future work */ + /* H5FDctl(file_ptr->sf_file, op_code_complete, flags, input, &output); */ + } + } + + /* Point to the end of the current I/O */ + addr += (haddr_t)size; + + /* Update current file position and EOF */ + file_ptr->pos = addr; + file_ptr->op = OP_WRITE; + +#if 1 /* Mimic the MPI I/O VFD */ + file_ptr->eof = HADDR_UNDEF; + + if (file_ptr->pos > file_ptr->local_eof) + file_ptr->local_eof = file_ptr->pos; +#else + if (file_ptr->pos > file_ptr->eof) + file_ptr->eof = file_ptr->pos; +#endif + +done: + HDfree(io_bufs); + HDfree(io_sizes); + HDfree(io_addrs); + HDfree(io_types); + HDfree(sf_offset); + HDfree(sf_data_size); + HDfree(source_data_offset); + + if (ret_value < 0) { + /* Reset last file I/O information */ + file_ptr->pos = HADDR_UNDEF; + file_ptr->op = OP_UNKNOWN; + } /* end if */ + + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_write() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfile_read_vector (internal function) + * + * Purpose: Vector Read function for the sub-filing VFD. + * + * Perform count reads from the specified file at the offsets + * provided in the addrs array, with the lengths and memory + * types provided in the sizes and types arrays. Data read + * is returned in the buffers provided in the bufs array. + * + * All reads are done according to the data transfer property + * list dxpl_id (which may be the constant H5P_DEFAULT). + * + * Return: Success: SUCCEED + * All reads have completed successfully, and + * the results havce been into the supplied + * buffers. + * + * Failure: FAIL + * The contents of supplied buffers are undefined. + * + * Programmer: RAW -- ??/??/21 + * + * Changes: None. + * + * Notes: Thus function doesn't actually implement vector read. + * Instead, it comverts the vector read call into a series + * of scalar read calls. Fix this when time permits. + * + * Also, it didn't support the sizes and types optimization. + * I implemented a version of this which is more generous + * than that currently defined in the RFC. This is good + * enough for now, but the final version should follow + * the RFC. + * JRM -- 10/5/21 + * + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__subfiling_read_vector(H5FD_t *_file, hid_t dxpl_id, uint32_t count, H5FD_mem_t types[], haddr_t addrs[], + size_t sizes[], void *bufs[] /* out */) +{ + H5FD_subfiling_t *file_ptr = (H5FD_subfiling_t *)_file; + H5FD_mpio_xfer_t xfer_mode = H5FD_MPIO_INDEPENDENT; + herr_t ret_value = SUCCEED; /* Return value */ + + /* Check arguments + * RAW - Do we really need to check arguments once again? + * These have already been checked in H5FD__subfiling_read_vector (see below)! + */ + if (!file_ptr) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "file pointer cannot be NULL"); + + if ((!types) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "types parameter can't be NULL if count is positive"); + + if ((!addrs) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "addrs parameter can't be NULL if count is positive"); + + if ((!sizes) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "sizes parameter can't be NULL if count is positive"); + + if ((!bufs) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "bufs parameter can't be NULL if count is positive"); + + /* Get the default dataset transfer property list if the user didn't provide one */ + if (H5P_DEFAULT == dxpl_id) { + dxpl_id = H5P_DATASET_XFER_DEFAULT; + } + else { + if (TRUE != H5P_isa_class(dxpl_id, H5P_DATASET_XFER)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, FAIL, "not a data transfer property list"); + } + + /* Set DXPL for operation */ + H5CX_set_dxpl(dxpl_id); + + /* TODO: setup real support for vector I/O */ + if (file_ptr->fa.require_ioc) { + + hbool_t extend_sizes = FALSE; + hbool_t extend_types = FALSE; + int k; + size_t size; + H5FD_mem_t type; + haddr_t eoa; + + HDassert((count == 0) || (sizes[0] != 0)); + HDassert((count == 0) || (types[0] != H5FD_MEM_NOLIST)); + + if (H5CX_get_io_xfer_mode(&xfer_mode) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_CONTEXT, H5E_CANTGET, FAIL, + "can't determine I/O collectivity setting"); + + /* Currently, treat collective calls as independent */ + if (xfer_mode != H5FD_MPIO_INDEPENDENT) + if (H5CX_set_io_xfer_mode(H5FD_MPIO_INDEPENDENT) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_CONTEXT, H5E_CANTSET, FAIL, "can't set I/O collectivity setting"); + + /* Note that the following code does not let the sub-filing VFD participate + * in collective calls when there is no data to write. This is not an issue + * now, as we don't do anything special with collective operations. However + * this needs to be fixed. + */ + for (k = 0; k < (int)count; k++) { + + if (!extend_sizes) { + + if (sizes[k] == 0) { + + extend_sizes = TRUE; + size = sizes[k - 1]; + } + else { + + size = sizes[k]; + } + } + + if (!extend_types) { + + if (types[k] == H5FD_MEM_NOLIST) { + + extend_types = TRUE; + type = types[k - 1]; + } + else { + + type = types[k]; + } + } + + if (HADDR_UNDEF == (eoa = H5FD__subfiling_get_eoa(_file, type))) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTINIT, FAIL, "driver get_eoa request failed"); + + if ((addrs[k] + size) > eoa) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_OVERFLOW, FAIL, + "addr overflow, addrs[%d] = %llu, sizes[%d] = %llu, eoa = %llu", + (int)k, (unsigned long long)(addrs[k]), (int)k, + (unsigned long long)size, (unsigned long long)eoa); + + if (H5FD__subfiling_read(_file, type, dxpl_id, addrs[k], size, bufs[k]) != SUCCEED) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_READERROR, FAIL, "file vector read request failed"); + } + } + else { + /* sec2 driver.. + * Call the subfiling 'direct write' version + * of subfiling. + */ + if (H5FD_read_vector(_file, count, types, addrs, sizes, bufs) != SUCCEED) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_READERROR, FAIL, "file vector read request failed"); + } + +done: + if (xfer_mode != H5FD_MPIO_INDEPENDENT) + if (H5CX_set_io_xfer_mode(xfer_mode) < 0) + H5_SUBFILING_DONE_ERROR(H5E_CONTEXT, H5E_CANTSET, FAIL, "can't set I/O collectivity setting"); + + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_read_vector() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfile_write_vector (internal function) + * + * Purpose: Perform count writes to the specified file at the offsets + * provided in the addrs array. Lengths and memory + * types provided in the sizes and types arrays. Data to be + * written is referenced by the bufs array. + * + * All writes are done according to the data transfer property + * list dxpl_id (which may be the constant H5P_DEFAULT). + * + * Return: Success: SUCCEED + * All writes have completed successfully. + * + * Failure: FAIL + * An internal error was encountered, e.g the + * input arguments are not valid, or the actual + * subfiling writes have failed for some reason. + * + * Programmer: RAW -- ??/??/21 + * + * Changes: None. + * + * Notes: Thus function doesn't actually implement vector write. + * Instead, it comverts the vector write call into a series + * of scalar read calls. Fix this when time permits. + * + * Also, it didn't support the sizes and types optimization. + * I implemented a version of this which is more generous + * than that currently defined in the RFC. This is good + * enough for now, but the final version should follow + * the RFC. + * JRM -- 10/5/21 + * + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__subfiling_write_vector(H5FD_t *_file, hid_t dxpl_id, uint32_t count, H5FD_mem_t types[], + haddr_t addrs[], size_t sizes[], const void *bufs[] /* in */) +{ + H5FD_subfiling_t *file_ptr = (H5FD_subfiling_t *)_file; + H5FD_mpio_xfer_t xfer_mode = H5FD_MPIO_INDEPENDENT; + herr_t ret_value = SUCCEED; /* Return value */ + + HDassert(file_ptr != NULL); /* sanity check */ + + /* Check arguments + * RAW - Do we really need to check arguments once again? + * These have already been checked in H5FD__subfiling_write_vector (see below)! + */ + if (!file_ptr) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "file pointer cannot be NULL"); + + if ((!types) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "types parameter can't be NULL if count is positive"); + + if ((!addrs) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "addrs parameter can't be NULL if count is positive"); + + if ((!sizes) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "sizes parameter can't be NULL if count is positive"); + + if ((!bufs) && (count > 0)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, + "bufs parameter can't be NULL if count is positive"); + + /* Get the default dataset transfer property list if the user didn't provide one */ + if (H5P_DEFAULT == dxpl_id) { + dxpl_id = H5P_DATASET_XFER_DEFAULT; + } + else { + if (TRUE != H5P_isa_class(dxpl_id, H5P_DATASET_XFER)) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, FAIL, "not a data transfer property list"); + } + /* Call the subfiling IOC write*/ + if (file_ptr->fa.require_ioc) { + + hbool_t extend_sizes = FALSE; + hbool_t extend_types = FALSE; + int k; + size_t size; + H5FD_mem_t type; + haddr_t eoa; + + HDassert((count == 0) || (sizes[0] != 0)); + HDassert((count == 0) || (types[0] != H5FD_MEM_NOLIST)); + + if (H5CX_get_io_xfer_mode(&xfer_mode) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_CONTEXT, H5E_CANTGET, FAIL, + "can't determine I/O collectivity setting"); + + /* Currently, treat collective calls as independent */ + if (xfer_mode != H5FD_MPIO_INDEPENDENT) + if (H5CX_set_io_xfer_mode(H5FD_MPIO_INDEPENDENT) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_CONTEXT, H5E_CANTSET, FAIL, "can't set I/O collectivity setting"); + + /* Note that the following code does not let the sub-filing VFD participate + * in collective calls when there is no data to write. This is not an issue + * now, as we don't do anything special with collective operations. However + * this needs to be fixed. + */ + for (k = 0; k < (int)count; k++) { + + if (!extend_sizes) { + + if (sizes[k] == 0) { + + extend_sizes = TRUE; + size = sizes[k - 1]; + } + else { + + size = sizes[k]; + } + } + + if (!extend_types) { + + if (types[k] == H5FD_MEM_NOLIST) { + + extend_types = TRUE; + type = types[k - 1]; + } + else { + + type = types[k]; + } + } + + if (HADDR_UNDEF == (eoa = H5FD__subfiling_get_eoa(_file, type))) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTINIT, FAIL, "driver get_eoa request failed"); + + if ((addrs[k] + size) > eoa) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_OVERFLOW, FAIL, + "addr overflow, addrs[%d] = %llu, sizes[%d] = %llu, eoa = %llu", + (int)k, (unsigned long long)(addrs[k]), (int)k, + (unsigned long long)size, (unsigned long long)eoa); + + if (H5FD__subfiling_write(_file, type, dxpl_id, addrs[k], size, bufs[k]) != SUCCEED) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_WRITEERROR, FAIL, "file vector write request failed"); + } + } + else { + /* sec2 driver.. + * Call the subfiling 'direct write' version + * of subfiling. + */ + if (H5FD_write_vector(_file, count, types, addrs, sizes, bufs) != SUCCEED) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_WRITEERROR, FAIL, "file vector write request failed"); + } + +done: + if (xfer_mode != H5FD_MPIO_INDEPENDENT) + if (H5CX_set_io_xfer_mode(xfer_mode) < 0) + H5_SUBFILING_DONE_ERROR(H5E_CONTEXT, H5E_CANTSET, FAIL, "can't set I/O collectivity setting"); + + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FDsubfile__write_vector() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_truncate + * + * Purpose: Makes sure that the true file size is the same as + * the end-of-allocation. + * + * Return: SUCCEED/FAIL + * + * Programmer: Richard Warren + * + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__subfiling_truncate(H5FD_t *_file, hid_t H5_ATTR_UNUSED dxpl_id, hbool_t H5_ATTR_UNUSED closing) +{ + H5FD_subfiling_t *file = (H5FD_subfiling_t *)_file; + herr_t ret_value = SUCCEED; /* Return value */ + + HDassert(file); + + /* Extend the file to make sure it's large enough */ +#if 1 /* Mimic the MPI I/O VFD */ + if (!H5F_addr_eq(file->eoa, file->last_eoa)) { + int64_t sf_eof; + int64_t eoa; + int mpi_code; + + if (!H5CX_get_mpi_file_flushing()) + if (MPI_SUCCESS != (mpi_code = MPI_Barrier(file->comm))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Barrier failed", mpi_code); + + if (0 == file->mpi_rank) { + if (H5FD__subfiling__get_real_eof(file->context_id, &sf_eof) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_CANTGET, FAIL, "can't get EOF"); + } + + if (MPI_SUCCESS != (mpi_code = MPI_Bcast(&sf_eof, 1, MPI_INT64_T, 0, file->comm))) + H5_SUBFILING_MPI_GOTO_ERROR(FAIL, "MPI_Bcast failed", mpi_code); + + if (sf_eof < 0) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_BADVALUE, FAIL, "invalid EOF"); + + H5_CHECKED_ASSIGN(eoa, int64_t, file->eoa, haddr_t); + + /* truncate sub-files */ + /* This is a hack. We should be doing the truncate of the sub-files via calls to + * H5FD_truncate() with the IOC. However, that system is messed up at present. + * thus the following hack. + * JRM -- 12/18/21 + */ + if (H5FD__subfiling__truncate_sub_files(file->context_id, eoa, file->comm) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTUPDATE, FAIL, "sub-file truncate request failed"); + + /* Reset last file I/O information */ + file->pos = HADDR_UNDEF; + file->op = OP_UNKNOWN; + + /* Update the 'last' eoa value */ + file->last_eoa = file->eoa; + } +#else + if (!H5F_addr_eq(file->eoa, file->eof)) { + + /* Update the eof value */ + file->eof = file->eoa; + + /* Reset last file I/O information */ + file->pos = HADDR_UNDEF; + file->op = OP_UNKNOWN; + + /* Update the 'last' eoa value */ + file->last_eoa = file->eoa; + } /* end if */ + + /* truncate sub-files */ + /* This is a hack. We should be doing the truncate of the sub-files via calls to + * H5FD_truncate() with the IOC. However, that system is messed up at present. + * thus the following hack. + * JRM -- 12/18/21 + */ + if (H5FD__subfiling__truncate_sub_files(file->context_id, file->eof, file->comm) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTUPDATE, FAIL, "sub-file truncate request failed"); +#endif + +done: + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_truncate() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_lock + * + * Purpose: To place an advisory lock on a file. + * The lock type to apply depends on the parameter "rw": + * TRUE--opens for write: an exclusive lock + * FALSE--opens for read: a shared lock + * + * Return: SUCCEED/FAIL + * + * Programmer: Vailin Choi; May 2013 + * + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__subfiling_lock(H5FD_t *_file, hbool_t rw) +{ + H5FD_subfiling_t *file = (H5FD_subfiling_t *)_file; /* VFD file struct */ + herr_t ret_value = SUCCEED; /* Return value */ + + HDassert(file); + + /* TODO: Consider lock only on IOC ranks for one IOC per subfile case */ + if (file->fa.require_ioc) { +#ifdef VERBOSE + HDputs("Subfiling driver doesn't support file locking"); +#endif + } + else { + if (H5FD_lock(file->sf_file, rw) < 0) + H5_SUBFILING_SYS_GOTO_ERROR(H5E_FILE, H5E_BADFILE, FAIL, "unable to lock file"); + } /* end if */ + +done: + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_lock() */ + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_unlock + * + * Purpose: To remove the existing lock on the file + * + * Return: SUCCEED/FAIL + * + * Programmer: Vailin Choi; May 2013 + * + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__subfiling_unlock(H5FD_t *_file) +{ + H5FD_subfiling_t *file = (H5FD_subfiling_t *)_file; /* VFD file struct */ + herr_t ret_value = SUCCEED; /* Return value */ + + HDassert(file); + + if (H5FD_unlock(file->sf_file) < 0) + H5_SUBFILING_SYS_GOTO_ERROR(H5E_FILE, H5E_BADFILE, FAIL, "unable to lock file"); + +done: + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_unlock() */ + +static herr_t +H5FD__subfiling_del(const char *name, hid_t fapl) +{ + const H5FD_subfiling_config_t *subfiling_config = NULL; + H5FD_subfiling_config_t default_config; + H5P_genplist_t * plist = NULL; + herr_t ret_value = SUCCEED; + + if (NULL == (plist = H5P_object_verify(fapl, H5P_FILE_ACCESS))) + H5_SUBFILING_GOTO_ERROR(H5E_ARGS, H5E_BADTYPE, FAIL, "not a file access property list"); + + if (H5FD_SUBFILING != H5P_peek_driver(plist)) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_BADVALUE, FAIL, "incorrect driver set on FAPL"); + + if (NULL == (subfiling_config = H5P_peek_driver_info(plist))) { + if (H5FD__subfiling_get_default_config(fapl, &default_config) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_PLIST, H5E_CANTGET, FAIL, + "can't get default Subfiling VFD configuration"); + subfiling_config = &default_config; + } + + if (H5FD_delete(name, subfiling_config->ioc_fapl_id) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_CANTDELETE, FAIL, "unable to delete file"); + +done: + H5_SUBFILING_FUNC_LEAVE_API; +} + +/*------------------------------------------------------------------------- + * Function: H5FD__subfiling_ctl + * + * Purpose: Subfiling version of the ctl callback. + * + * The desired operation is specified by the op_code + * parameter. + * + * The flags parameter controls management of op_codes that + * are unknown to the callback + * + * The input and output parameters allow op_code specific + * input and output + * + * At present, the supported op codes are: + * + * H5FD_CTL_GET_MPI_COMMUNICATOR_OPCODE + * H5FD_CTL_GET_MPI_RANK_OPCODE + * H5FD_CTL_GET_MPI_SIZE_OPCODE + * + * Note that these opcodes must be supported by all VFDs that + * support MPI. + * + * Return: Non-negative on success/Negative on failure + * + * Programmer: JRM -- 8/3/21 + * + *------------------------------------------------------------------------- + */ +static herr_t +H5FD__subfiling_ctl(H5FD_t *_file, uint64_t op_code, uint64_t flags, const void H5_ATTR_UNUSED *input, + void **output) +{ + H5FD_subfiling_t *file = (H5FD_subfiling_t *)_file; + herr_t ret_value = SUCCEED; /* Return value */ + + /* Sanity checks */ + HDassert(file); + HDassert(H5FD_SUBFILING == file->pub.driver_id); + + switch (op_code) { + + case H5FD_CTL_GET_MPI_COMMUNICATOR_OPCODE: + HDassert(output); + HDassert(*output); + + /* + * Return a separate MPI communicator to the caller so + * that our own MPI calls won't have a chance to conflict + */ + if (file->ext_comm == MPI_COMM_NULL) { + if (H5_mpi_comm_dup(file->comm, &file->ext_comm) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_CANTGET, FAIL, "can't duplicate MPI communicator"); + } + + **((MPI_Comm **)output) = file->ext_comm; + break; + + case H5FD_CTL_GET_MPI_RANK_OPCODE: + HDassert(output); + HDassert(*output); + **((int **)output) = file->mpi_rank; + break; + + case H5FD_CTL_GET_MPI_SIZE_OPCODE: + HDassert(output); + HDassert(*output); + **((int **)output) = file->mpi_size; + break; + + default: /* unknown op code */ + if (flags & H5FD_CTL_FAIL_IF_UNKNOWN_FLAG) { + H5_SUBFILING_GOTO_ERROR(H5E_VFL, H5E_FCNTL, FAIL, "unknown op_code and fail if unknown"); + } + break; + } + +done: + H5_SUBFILING_FUNC_LEAVE_API; +} /* end H5FD__subfiling_ctl() */ + +/*------------------------------------------------------------------------- + * Function: init_indep_io + * + * Purpose: Utility function to initialize the set of I/O transactions + * used to communicate with I/O concentrators for read and + * write I/O operations. + * + * Fills the I/O vectors contained in the output arrays + * `mem_buf_offset`, `target_file_offset` and `io_block_len`. + * As a consequence of not allowing use of MPI derived + * datatypes in the VFD layer, we need to accommodate the + * possibility that large I/O transactions will be required to + * use multiple I/Os per IOC. + * + * Example: Using 4 IOCs, each with 1M stripe-depth; when + * presented an I/O request for 8MB then at a minimum each IOC + * will require 2 I/Os of 1MB each. Depending on the starting + * file offset, the 2 I/Os can instead be 3... + * + * To fully describe the I/O transactions for reads and writes + * the output arrays are therefore arrays of I/O vectors, + * where each vector has a length of which corresponds to the + * max number of I/O transactions per IOC. In the example + * above, these vector lengths can be 2 or 3. The actual + * length is determined by the 'container_depth' variable. + * + * For I/O operations which involve a subset of I/O + * concentrators, the vector entries for the unused I/O + * concentrators IOCs will have lengths of zero and be empty. + * The 'container_depth' in this case will always be 1. + * + * sf_context (IN) + * - the subfiling context for the file + * + * file_offset (IN) + * - the starting file offset for I/O + * + * io_nelemts (IN) + * - the number of data elements for the I/O operation + * + * dtype_extent (IN) + * - the extent of the datatype of each data element for + * the I/O operation + * + * max_iovec_len (IN) + * - the maximum size for a single I/O vector in each of + * the output arrays `mem_buf_offset`, `io_block_len` + * and `sf_offset`. NOTE that this routine expects each + * of these output arrays to have enough space allocated + * for one I/O vector PER I/O concentrator. Therefore, + * the total size of each output array should be at least + * `max_iovec_len * n_io_concentrators`. + * + * mem_buf_offset (OUT) + * - output array of vectors (one vector for each IOC) + * containing the set of offsets into the memory buffer + * for I/O + * + * target_file_offset (OUT) + * - output array of vectors (one vector for each IOC) + * containing the set of offsets into the target file + * + * io_block_len (OUT) + * - output array of vectors (one vector for each IOC) + * containing the set of block lengths for each source + * buffer/target file offset. + * + * first_ioc_index (OUT) + * - the index of the first I/O concentrator that this I/O + * operation begins at + * + * n_iocs_used (OUT) + * - the number of I/O concentrators actually used for this + * I/O operation, which may be different from the total + * number of I/O concentrators for the file + * + * max_io_req_per_ioc (OUT) + * - the maximum number of I/O requests to any particular + * I/O concentrator, or the maximum "depth" of each I/O + * vector in the output arrays. + * + * Return: Non-negative on success/Negative on failure + * + *------------------------------------------------------------------------- + */ +static herr_t +init_indep_io(subfiling_context_t *sf_context, int64_t file_offset, size_t io_nelemts, size_t dtype_extent, + size_t max_iovec_len, int64_t *mem_buf_offset, int64_t *target_file_offset, + int64_t *io_block_len, int *first_ioc_index, int *n_iocs_used, int64_t *max_io_req_per_ioc) +{ + int64_t stripe_size = 0; + int64_t block_size = 0; + int64_t data_size = 0; + int64_t stripe_idx = 0; + int64_t final_stripe_idx = 0; + int64_t curr_stripe_idx = 0; + int64_t offset_in_stripe = 0; + int64_t offset_in_block = 0; + int64_t final_offset = 0; + int64_t start_length = 0; + int64_t final_length = 0; + int64_t ioc_start = 0; + int64_t ioc_final = 0; + int64_t start_row = 0; + int64_t row_offset = 0; + int64_t row_stripe_idx_start = 0; + int64_t row_stripe_idx_final = 0; + int64_t max_iovec_depth = 0; + int64_t curr_max_iovec_depth = 0; + int64_t total_bytes = 0; + int64_t mem_offset = 0; + int ioc_count = 0; + herr_t ret_value = SUCCEED; + + HDassert(sf_context); + HDassert(sf_context->topology); + HDassert(sf_context->topology->n_io_concentrators > 0); + HDassert(sf_context->sf_stripe_size > 0); + HDassert(sf_context->sf_blocksize_per_stripe > 0); + HDassert(mem_buf_offset); + HDassert(target_file_offset); + HDassert(io_block_len); + HDassert(first_ioc_index); + HDassert(n_iocs_used); + HDassert(max_io_req_per_ioc); + + *first_ioc_index = 0; + *n_iocs_used = 0; + *max_io_req_per_ioc = 0; + + /* + * Retrieve the needed fields from the subfiling context. + * + * ioc_count + * - the total number of I/O concentrators in the + * application topology + * stripe_size + * - the size of the data striping across the file's subfiles + * block_size + * - the size of a "block" across the IOCs, as calculated + * by the stripe size multiplied by the number of I/O + * concentrators + */ + ioc_count = sf_context->topology->n_io_concentrators; + stripe_size = sf_context->sf_stripe_size; + block_size = sf_context->sf_blocksize_per_stripe; + + H5_CHECKED_ASSIGN(data_size, int64_t, (io_nelemts * dtype_extent), size_t); + + /* + * Calculate the following from the starting file offset: + * + * stripe_idx + * - a stripe "index" given by the file offset divided by the + * stripe size. Note that when the file offset equals or exceeds + * the block size, we simply wrap around. So, for example, if 4 + * I/O concentrators are being used with a stripe size of 1MiB, + * the block size would be 4MiB and file offset 4096 would have + * a stripe index of 4 and reside in the same subfile as stripe + * index 0 (offsets 0-1023) + * offset_in_stripe + * - the relative offset in the stripe that the starting file + * offset resides in + * offset_in_block + * - the relative offset in the "block" of stripes across the I/O + * concentrators + * final_offset + * - the last offset in the virtual file covered by this I/O + * operation. Simply the I/O size added to the starting file + * offset. + */ + stripe_idx = file_offset / stripe_size; + offset_in_stripe = file_offset % stripe_size; + offset_in_block = file_offset % block_size; + final_offset = file_offset + data_size; + + /* Determine the size of data written to the first and last stripes */ + start_length = MIN(data_size, (stripe_size - offset_in_stripe)); + final_length = (start_length == data_size ? 0 : final_offset % stripe_size); + HDassert(start_length <= stripe_size); + HDassert(final_length <= stripe_size); + + /* + * Determine which I/O concentrator the I/O request begins + * in and which "row" the I/O request begins in within the + * "block" of stripes across the I/O concentrators. Note that + * "row" here is just a conceptual way to think of how a block + * of data stripes is laid out across the I/O concentrator + * subfiles. A block's "column" size in bytes is equal to the + * stripe size multiplied the number of I/O concentrators. + * Therefore, file offsets that are multiples of the block size + * begin a new "row". + */ + start_row = stripe_idx / ioc_count; + ioc_start = stripe_idx % ioc_count; + H5_CHECK_OVERFLOW(ioc_start, int64_t, int); + + /* + * Set initial file offset for starting "row" + * based on the start row index + */ + row_offset = start_row * block_size; + + /* + * Determine the stripe "index" of the last offset in the + * virtual file and, from that, determine the I/O concentrator + * that the I/O request ends in. + */ + final_stripe_idx = final_offset / stripe_size; + ioc_final = final_stripe_idx % ioc_count; + + /* + * Determine how "deep" the resulting I/O vectors are at + * most by calculating the maximum number of "rows" spanned + * for any particular subfile; e.g. the maximum number of + * I/O requests for any particular I/O concentrator + */ + row_stripe_idx_start = stripe_idx - ioc_start; + row_stripe_idx_final = final_stripe_idx - ioc_final; + max_iovec_depth = ((row_stripe_idx_final - row_stripe_idx_start) / ioc_count) + 1; + + if (ioc_final < ioc_start) + max_iovec_depth--; + + /* Set returned parameters early */ + *first_ioc_index = (int)ioc_start; + *n_iocs_used = ioc_count; + *max_io_req_per_ioc = max_iovec_depth; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: FILE OFFSET = %" PRId64 ", DATA SIZE = %zu, STRIPE SIZE = %" PRId64, __func__, + file_offset, io_nelemts, stripe_size); + H5_subfiling_log(sf_context->sf_context_id, + "%s: IOC START = %" PRId64 ", IOC FINAL = %" PRId64 ", " + "MAX IOVEC DEPTH = %" PRId64 ", START LENGTH = %" PRId64 ", FINAL LENGTH = %" PRId64, + __func__, ioc_start, ioc_final, max_iovec_depth, start_length, final_length); +#endif + + /* + * Loop through the set of I/O concentrators to determine + * the various vector components for each. I/O concentrators + * whose data size is zero will not have I/O requests passed + * to them. + */ + curr_stripe_idx = stripe_idx; + curr_max_iovec_depth = max_iovec_depth; + for (int i = 0, k = (int)ioc_start; i < ioc_count; i++) { + int64_t *_mem_buf_offset; + int64_t *_target_file_offset; + int64_t *_io_block_len; + int64_t ioc_bytes = 0; + int64_t iovec_depth; + hbool_t is_first = FALSE; + hbool_t is_last = FALSE; + size_t output_offset; + + iovec_depth = curr_max_iovec_depth; + + /* + * Setup the pointers to the next set of I/O vectors in + * the output arrays and clear those vectors + */ + output_offset = (size_t)(k)*max_iovec_len; + _mem_buf_offset = mem_buf_offset + output_offset; + _target_file_offset = target_file_offset + output_offset; + _io_block_len = io_block_len + output_offset; + + HDmemset(_mem_buf_offset, 0, (max_iovec_len * sizeof(*_mem_buf_offset))); + HDmemset(_target_file_offset, 0, (max_iovec_len * sizeof(*_target_file_offset))); + HDmemset(_io_block_len, 0, (max_iovec_len * sizeof(*_io_block_len))); + + if (total_bytes == data_size) { + *n_iocs_used = i; + goto done; + } + + if (total_bytes < data_size) { + int64_t num_full_stripes = iovec_depth; + + if (k == ioc_start) { + is_first = TRUE; + + /* + * Add partial segment length if not + * starting on a stripe boundary + */ + if (start_length < stripe_size) { + ioc_bytes += start_length; + num_full_stripes--; + } + } + + if (k == ioc_final) { + is_last = TRUE; + + /* + * Add partial segment length if not + * ending on a stripe boundary + */ + if (final_length < stripe_size) { + ioc_bytes += final_length; + if (num_full_stripes) + num_full_stripes--; + } + } + + /* Account for IOCs with uniform segments */ + if (!is_first && !is_last) { + hbool_t thin_uniform_section = FALSE; + + if (ioc_final >= ioc_start) { + /* + * When an IOC has an index value that is greater + * than both the starting IOC and ending IOC indices, + * it is a "thinner" section with a smaller I/O vector + * depth. + */ + thin_uniform_section = (k > ioc_start) && (k > ioc_final); + } + + if (ioc_final < ioc_start) { + /* + * This can also happen when the IOC with the final + * data segment has a smaller IOC index than the IOC + * with the first data segment and the current IOC + * index falls between the two. + */ + thin_uniform_section = thin_uniform_section || ((ioc_final < k) && (k < ioc_start)); + } + + if (thin_uniform_section) { + HDassert(iovec_depth > 1); + HDassert(num_full_stripes > 1); + + iovec_depth--; + num_full_stripes--; + } + } + + /* + * After accounting for the length of the initial + * and/or final data segments, add the combined + * size of the fully selected I/O stripes to the + * running bytes total + */ + ioc_bytes += num_full_stripes * stripe_size; + total_bytes += ioc_bytes; + } + + _mem_buf_offset[0] = mem_offset; + _target_file_offset[0] = row_offset + offset_in_block; + _io_block_len[0] = ioc_bytes; + + if (ioc_count > 1) { + int64_t curr_file_offset = row_offset + offset_in_block; + + /* Fill the I/O vectors */ + if (is_first) { + if (is_last) { /* First + Last */ + if (iovec_fill_first_last(sf_context, iovec_depth, ioc_bytes, mem_offset, + curr_file_offset, start_length, final_length, _mem_buf_offset, + _target_file_offset, _io_block_len) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_CANTINIT, FAIL, "can't fill I/O vectors"); + } + else { /* First ONLY */ + if (iovec_fill_first(sf_context, iovec_depth, ioc_bytes, mem_offset, curr_file_offset, + start_length, _mem_buf_offset, _target_file_offset, + _io_block_len) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_CANTINIT, FAIL, "can't fill I/O vectors"); + } + /* Move the memory pointer to the starting location + * for next IOC request. + */ + mem_offset += start_length; + } + else if (is_last) { /* Last ONLY */ + if (iovec_fill_last(sf_context, iovec_depth, ioc_bytes, mem_offset, curr_file_offset, + final_length, _mem_buf_offset, _target_file_offset, _io_block_len) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_CANTINIT, FAIL, "can't fill I/O vectors"); + + mem_offset += stripe_size; + } + else { /* Everything else (uniform) */ + if (iovec_fill_uniform(sf_context, iovec_depth, ioc_bytes, mem_offset, curr_file_offset, + _mem_buf_offset, _target_file_offset, _io_block_len) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_CANTINIT, FAIL, "can't fill I/O vectors"); + + mem_offset += stripe_size; + } + } + + offset_in_block += _io_block_len[0]; + + k++; + curr_stripe_idx++; + + if (k == ioc_count) { + k = 0; + offset_in_block = 0; + curr_max_iovec_depth = ((final_stripe_idx - curr_stripe_idx) / ioc_count) + 1; + + row_offset += block_size; + } + + HDassert(offset_in_block <= block_size); + } + + if (total_bytes != data_size) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_CANTINIT, FAIL, + "total bytes (%" PRId64 ") didn't match data size (%" PRId64 ")!", + total_bytes, data_size); + +done: + return ret_value; +} + +/*------------------------------------------------------------------------- + * Function: iovec_fill_first + * + * Purpose: Fills I/O vectors for the case where the IOC has the first + * data segment of the I/O operation. + * + * If the 'first_io_len' is sufficient to complete the I/O to + * the IOC, then the first entry in the I/O vectors is simply + * filled out with the given starting memory/file offsets and + * the first I/O size. Otherwise, the remaining entries in the + * I/O vectors are filled out as data segments with size equal + * to the stripe size. Each data segment is separated from a + * previous or following segment by 'sf_blocksize_per_stripe' + * bytes of data. + * + * Return: Non-negative on success/Negative on failure + * + * Programmer: Richard Warren + * 7/17/2020 + * + *------------------------------------------------------------------------- + */ +static herr_t +iovec_fill_first(subfiling_context_t *sf_context, int64_t iovec_depth, int64_t target_datasize, + int64_t start_mem_offset, int64_t start_file_offset, int64_t first_io_len, + int64_t *mem_offset_out, int64_t *target_file_offset_out, int64_t *io_block_len_out) +{ + int64_t stripe_size; + int64_t block_size; + int64_t total_bytes = 0; + herr_t ret_value = SUCCEED; + + HDassert(sf_context); + HDassert(mem_offset_out); + HDassert(target_file_offset_out); + HDassert(io_block_len_out); + HDassert(iovec_depth > 0); + + stripe_size = sf_context->sf_stripe_size; + block_size = sf_context->sf_blocksize_per_stripe; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: start_mem_offset = %" PRId64 ", start_file_offset = %" PRId64 + ", first_io_len = %" PRId64, + __func__, start_mem_offset, start_file_offset, first_io_len); +#endif + + mem_offset_out[0] = start_mem_offset; + target_file_offset_out[0] = start_file_offset; + io_block_len_out[0] = first_io_len; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: mem_offset[0] = %" PRId64 ", file_offset[0] = %" PRId64 + ", io_block_len[0] = %" PRId64, + __func__, mem_offset_out[0], target_file_offset_out[0], io_block_len_out[0]); +#endif + + if (first_io_len == target_datasize) + H5_SUBFILING_GOTO_DONE(SUCCEED); + + if (first_io_len > 0) { + int64_t offset_in_stripe = start_file_offset % stripe_size; + int64_t next_mem_offset = block_size - offset_in_stripe; + int64_t next_file_offset = start_file_offset + (block_size - offset_in_stripe); + + total_bytes = first_io_len; + + for (int64_t i = 1; i < iovec_depth; i++) { + mem_offset_out[i] = next_mem_offset; + target_file_offset_out[i] = next_file_offset; + io_block_len_out[i] = stripe_size; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: mem_offset[%" PRId64 "] = %" PRId64 ", file_offset[%" PRId64 "] = %" PRId64 + ", io_block_len[%" PRId64 "] = %" PRId64, + __func__, i, mem_offset_out[i], i, target_file_offset_out[i], i, + io_block_len_out[i]); +#endif + + next_mem_offset += block_size; + next_file_offset += block_size; + total_bytes += stripe_size; + } + + if (total_bytes != target_datasize) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_CANTINIT, FAIL, + "total bytes (%" PRId64 ") didn't match target data size (%" PRId64 ")!", + total_bytes, target_datasize); + } + +done: + return ret_value; +} + +/*------------------------------------------------------------------------- + * Function: iovec_fill_last + * + * Purpose: Fills I/O vectors for the case where the IOC has the last + * data segment of the I/O operation. + * + * If the 'last_io_len' is sufficient to complete the I/O to + * the IOC, then the first entry in the I/O vectors is simply + * filled out with the given starting memory/file offsets and + * the last I/O size. Otherwise, all entries in the I/O + * vectors except the last entry are filled out as data + * segments with size equal to the stripe size. Each data + * segment is separated from a previous or following segment + * by 'sf_blocksize_per_stripe' bytes of data. Then, the last + * entry in the I/O vectors is filled out with the final + * memory/file offsets and the last I/O size. + * + * Return: Non-negative on success/Negative on failure + * + * Programmer: Richard Warren + * 7/17/2020 + * + *------------------------------------------------------------------------- + */ +static herr_t +iovec_fill_last(subfiling_context_t *sf_context, int64_t iovec_depth, int64_t target_datasize, + int64_t start_mem_offset, int64_t start_file_offset, int64_t last_io_len, + int64_t *mem_offset_out, int64_t *target_file_offset_out, int64_t *io_block_len_out) +{ + int64_t stripe_size; + int64_t block_size; + int64_t total_bytes = 0; + herr_t ret_value = SUCCEED; + + HDassert(sf_context); + HDassert(mem_offset_out); + HDassert(target_file_offset_out); + HDassert(io_block_len_out); + HDassert(iovec_depth > 0); + + stripe_size = sf_context->sf_stripe_size; + block_size = sf_context->sf_blocksize_per_stripe; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: start_mem_offset = %" PRId64 ", start_file_offset = %" PRId64 + ", last_io_len = %" PRId64, + __func__, start_mem_offset, start_file_offset, last_io_len); +#endif + + mem_offset_out[0] = start_mem_offset; + target_file_offset_out[0] = start_file_offset; + io_block_len_out[0] = last_io_len; + + if (last_io_len == target_datasize) { +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: mem_offset[0] = %" PRId64 ", file_offset[0] = %" PRId64 + ", io_block_len[0] = %" PRId64, + __func__, mem_offset_out[0], target_file_offset_out[0], io_block_len_out[0]); +#endif + + H5_SUBFILING_GOTO_DONE(SUCCEED); + } + else { + int64_t next_mem_offset = start_mem_offset + block_size; + int64_t next_file_offset = start_file_offset + block_size; + int64_t i; + + /* + * If the last I/O size doesn't cover the target data + * size, there is at least one full stripe preceding + * the last I/O block + */ + io_block_len_out[0] = stripe_size; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: mem_offset[0] = %" PRId64 ", file_offset[0] = %" PRId64 + ", io_block_len[0] = %" PRId64, + __func__, mem_offset_out[0], target_file_offset_out[0], io_block_len_out[0]); +#endif + + total_bytes = stripe_size; + + for (i = 1; i < iovec_depth - 1;) { + mem_offset_out[i] = next_mem_offset; + target_file_offset_out[i] = next_file_offset; + io_block_len_out[i] = stripe_size; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: mem_offset[%" PRId64 "] = %" PRId64 ", file_offset[%" PRId64 "] = %" PRId64 + ", io_block_len[%" PRId64 "] = %" PRId64, + __func__, i, mem_offset_out[i], i, target_file_offset_out[i], i, + io_block_len_out[i]); +#endif + + next_mem_offset += block_size; + next_file_offset += block_size; + total_bytes += stripe_size; + + i++; + } + + mem_offset_out[i] = next_mem_offset; + target_file_offset_out[i] = next_file_offset; + io_block_len_out[i] = last_io_len; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: mem_offset[%" PRId64 "] = %" PRId64 ", file_offset[%" PRId64 "] = %" PRId64 + ", io_block_len[%" PRId64 "] = %" PRId64, + __func__, i, mem_offset_out[i], i, target_file_offset_out[i], i, + io_block_len_out[i]); +#endif + + total_bytes += last_io_len; + + if (total_bytes != target_datasize) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_CANTINIT, FAIL, + "total bytes (%" PRId64 ") didn't match target data size (%" PRId64 ")!", + total_bytes, target_datasize); + } + +done: + return ret_value; +} + +/*------------------------------------------------------------------------- + * Function: iovec_fill_first_last + * + * Purpose: Fills I/O vectors for the case where the IOC has the first + * and last data segments of the I/O operation. This function + * is essentially a merge of the iovec_fill_first and + * iovec_fill_last functions. + * + * If the 'first_io_len' is sufficient to complete the I/O to + * the IOC, then the first entry in the I/O vectors is simply + * filled out with the given starting memory/file offsets and + * the first I/O size. Otherwise, the remaining entries in the + * I/O vectors except the last are filled out as data segments + * with size equal to the stripe size. Each data segment is + * separated from a previous or following segment by + * 'sf_blocksize_per_stripe' bytes of data. Then, the last + * entry in the I/O vectors is filled out with the final + * memory/file offsets and the last I/O size. + * + * Return: Non-negative on success/Negative on failure + * + * Programmer: Richard Warren + * 7/17/2020 + * + *------------------------------------------------------------------------- + */ +static herr_t +iovec_fill_first_last(subfiling_context_t *sf_context, int64_t iovec_depth, int64_t target_datasize, + int64_t start_mem_offset, int64_t start_file_offset, int64_t first_io_len, + int64_t last_io_len, int64_t *mem_offset_out, int64_t *target_file_offset_out, + int64_t *io_block_len_out) +{ + int64_t stripe_size; + int64_t block_size; + int64_t total_bytes = 0; + herr_t ret_value = SUCCEED; + + HDassert(sf_context); + HDassert(mem_offset_out); + HDassert(target_file_offset_out); + HDassert(io_block_len_out); + HDassert(iovec_depth > 0); + + stripe_size = sf_context->sf_stripe_size; + block_size = sf_context->sf_blocksize_per_stripe; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: start_mem_offset = %" PRId64 ", start_file_offset = %" PRId64 + ", first_io_len = %" PRId64 ", last_io_len = %" PRId64, + __func__, start_mem_offset, start_file_offset, first_io_len, last_io_len); +#endif + + mem_offset_out[0] = start_mem_offset; + target_file_offset_out[0] = start_file_offset; + io_block_len_out[0] = first_io_len; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: mem_offset[0] = %" PRId64 ", file_offset[0] = %" PRId64 + ", io_block_len[0] = %" PRId64, + __func__, mem_offset_out[0], target_file_offset_out[0], io_block_len_out[0]); +#endif + + if (first_io_len == target_datasize) + H5_SUBFILING_GOTO_DONE(SUCCEED); + + if (first_io_len > 0) { + int64_t offset_in_stripe = start_file_offset % stripe_size; + int64_t next_mem_offset = block_size - offset_in_stripe; + int64_t next_file_offset = start_file_offset + (block_size - offset_in_stripe); + int64_t i; + + total_bytes = first_io_len; + + for (i = 1; i < iovec_depth - 1;) { + mem_offset_out[i] = next_mem_offset; + target_file_offset_out[i] = next_file_offset; + io_block_len_out[i] = stripe_size; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: mem_offset[%" PRId64 "] = %" PRId64 ", file_offset[%" PRId64 "] = %" PRId64 + ", io_block_len[%" PRId64 "] = %" PRId64, + __func__, i, mem_offset_out[i], i, target_file_offset_out[i], i, + io_block_len_out[i]); +#endif + + next_mem_offset += block_size; + next_file_offset += block_size; + total_bytes += stripe_size; + + i++; + } + + mem_offset_out[i] = next_mem_offset; + target_file_offset_out[i] = next_file_offset; + io_block_len_out[i] = last_io_len; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: mem_offset[%" PRId64 "] = %" PRId64 ", file_offset[%" PRId64 "] = %" PRId64 + ", io_block_len[%" PRId64 "] = %" PRId64, + __func__, i, mem_offset_out[i], i, target_file_offset_out[i], i, + io_block_len_out[i]); +#endif + + total_bytes += last_io_len; + + if (total_bytes != target_datasize) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_CANTINIT, FAIL, + "total bytes (%" PRId64 ") didn't match target data size (%" PRId64 ")!", + total_bytes, target_datasize); + } + +done: + return ret_value; +} + +/*------------------------------------------------------------------------- + * Function: iovec_fill_uniform + * + * Purpose: Fills I/O vectors for the typical I/O operation when + * reading data from or writing data to an I/O Concentrator + * (IOC). + * + * Each data segment is of 'stripe_size' length and will be + * separated from a previous or following segment by + * 'sf_blocksize_per_stripe' bytes of data. + * + * Return: Non-negative on success/Negative on failure + * + * Programmer: Richard Warren + * 7/17/2020 + * + *------------------------------------------------------------------------- + */ +static herr_t +iovec_fill_uniform(subfiling_context_t *sf_context, int64_t iovec_depth, int64_t target_datasize, + int64_t start_mem_offset, int64_t start_file_offset, int64_t *mem_offset_out, + int64_t *target_file_offset_out, int64_t *io_block_len_out) +{ + int64_t stripe_size; + int64_t block_size; + int64_t total_bytes = 0; + herr_t ret_value = SUCCEED; + + HDassert(sf_context); + HDassert(mem_offset_out); + HDassert(target_file_offset_out); + HDassert(io_block_len_out); + HDassert((iovec_depth > 0) || (target_datasize == 0)); + + stripe_size = sf_context->sf_stripe_size; + block_size = sf_context->sf_blocksize_per_stripe; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: start_mem_offset = %" PRId64 ", start_file_offset = %" PRId64 + ", segment size = %" PRId64, + __func__, start_mem_offset, start_file_offset, stripe_size); +#endif + + mem_offset_out[0] = start_mem_offset; + target_file_offset_out[0] = start_file_offset; + io_block_len_out[0] = stripe_size; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: mem_offset[0] = %" PRId64 ", file_offset[0] = %" PRId64 + ", io_block_len[0] = %" PRId64, + __func__, mem_offset_out[0], target_file_offset_out[0], io_block_len_out[0]); +#endif + + if (target_datasize == 0) { +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, "%s: target_datasize = 0", __func__); +#endif + + io_block_len_out[0] = 0; + H5_SUBFILING_GOTO_DONE(SUCCEED); + } + + if (target_datasize > stripe_size) { + int64_t next_mem_offset = start_mem_offset + block_size; + int64_t next_file_offset = start_file_offset + block_size; + + total_bytes = stripe_size; + + for (int64_t i = 1; i < iovec_depth; i++) { + mem_offset_out[i] = next_mem_offset; + target_file_offset_out[i] = next_file_offset; + io_block_len_out[i] = stripe_size; + +#ifdef H5_SUBFILING_DEBUG + H5_subfiling_log(sf_context->sf_context_id, + "%s: mem_offset[%" PRId64 "] = %" PRId64 ", file_offset[%" PRId64 "] = %" PRId64 + ", io_block_len[%" PRId64 "] = %" PRId64, + __func__, i, mem_offset_out[i], i, target_file_offset_out[i], i, + io_block_len_out[i]); +#endif + + next_mem_offset += block_size; + next_file_offset += block_size; + total_bytes += stripe_size; + } + + if (total_bytes != target_datasize) + H5_SUBFILING_GOTO_ERROR(H5E_IO, H5E_CANTINIT, FAIL, + "total bytes (%" PRId64 ") didn't match target data size (%" PRId64 ")!", + total_bytes, target_datasize); + } + +done: + return ret_value; +} diff --git a/src/H5FDsubfiling/H5FDsubfiling.h b/src/H5FDsubfiling/H5FDsubfiling.h new file mode 100644 index 0000000..3de5155 --- /dev/null +++ b/src/H5FDsubfiling/H5FDsubfiling.h @@ -0,0 +1,183 @@ +/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * + * Copyright by The HDF Group. * + * All rights reserved. * + * * + * This file is part of HDF5. The full HDF5 copyright notice, including * + * terms governing use, modification, and redistribution, is contained in * + * the COPYING file, which can be found at the root of the source code * + * distribution tree, or in https://support.hdfgroup.org/ftp/HDF5/releases. * + * If you do not have access to either file, you may request a copy from * + * help@hdfgroup.org. * + * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ + +/* Purpose: The public header file for the subfiling driver. */ +#ifndef H5FDsubfiling_H +#define H5FDsubfiling_H + +#include "H5FDioc.h" + +#ifdef H5_HAVE_SUBFILING_VFD +#define H5FD_SUBFILING (H5FDperform_init(H5FD_subfiling_init)) +#else +#define H5FD_SUBFILING (H5I_INVALID_HID) +#endif + +#define H5FD_SUBFILING_NAME "subfiling" + +#ifdef H5_HAVE_SUBFILING_VFD + +#ifndef H5FD_SUBFILING_FAPL_MAGIC +#define H5FD_CURR_SUBFILING_FAPL_VERSION 1 +#define H5FD_SUBFILING_FAPL_MAGIC 0xFED01331 +#endif + +/**************************************************************************** + * + * Structure: H5FD_subfiling_config_t + * + * Purpose: + * + * H5FD_subfiling_config_t is a public structure that is used to pass + * subfiling configuration data to the appropriate subfiling VFD via + * the FAPL. A pointer to an instance of this structure is a parameter + * to H5Pset_fapl_subfiling() and H5Pget_fapl_subfiling(). + * + * `magic` (uint32_t) + * + * Magic is a somewhat unique number which identifies this VFD from + * other VFDs. Used in combination with a version number, we can + * validate a user generated file access property list (fapl). + * This field should be set to H5FD_SUBFILING_FAPL_MAGIC. + * + * `version` (uint32_t) + * + * Version number of the H5FD_subfiling_config_t structure. Any instance + * passed to the above calls must have a recognized version number, or + * an error will be flagged. + * + * This field should be set to H5FD_CURR_SUBFILING_FAPL_VERSION. + * + *** IO Concentrator Info *** + *** These fields will be replicated in the stacked IOC VFD which + *** provides the extended support for aggregating reads and writes + *** and allows global file access to node-local storage containers. + * + * `stripe_count` (int32_t) + * + * The integer value which identifies the total number of + * subfiles that have been algorithmically been selected to + * to contain the segments of raw data which make up an HDF5 + * file. This value is used to implement the RAID-0 functionality + * when reading or writing datasets. + * + * `stripe_depth` (int64_t) + * + * The stripe depth defines a limit on the maximum number of contiguous + * bytes that can be read or written in a single operation on any + * selected subfile. Larger IO operations can exceed this limit + * by utilizing MPI derived types to construct an IO request which + * gathers additional data segments from memory for the IO request. + * + * `ioc_selection` (enum io_selection datatype) + * + * The io_selection_t defines a specific algorithm by which IO + * concentrators (IOCs) and sub-files are identified. The available + * algorithms are: SELECT_IOC_ONE_PER_NODE, SELECT_IOC_EVERY_NTH_RANK, + * SELECT_IOC_WITH_CONFIG, and SELECT_IOC_TOTAL. + * + *** STACKING and other VFD support + *** i.e. FAPL caching + *** + * + * `ioc_fapl_id` (hid_t) + * + * A valid file access property list (fapl) is cached on each + * process and thus enables selection of an alternative provider + * for subsequent file operations. + * By default, Sub-filing employs an additional support VFD that + * provides file IO proxy capabilities to all MPI ranks in a + * distributed parallel application. This IO indirection + * thus allows application access all sub-files even while + * these may actually be node-local and thus not directly + * accessible to remote ranks. + * + ****************************************************************************/ + +/* + * In addition to the common configuration fields, we can have + * VFD specific fields. Here's one for the subfiling VFD. + * + * `require_ioc` (hbool_t) + * + * Require_IOC is a boolean flag with a default value of TRUE. + * This flag indicates that the stacked H5FDioc VFD should be + * employed for sub-filing operations. The default flag can be + * overridden with an environment variable: H5_REQUIRE_IOC=0 + * + */ + +//! <!-- [H5FD_subfiling_config_t_snip] --> +/** + * Configuration structure for H5Pset_fapl_subfiling() / H5Pget_fapl_subfiling() + */ +typedef struct H5FD_subfiling_config_t { + uint32_t magic; /* set to H5FD_SUBFILING_FAPL_MAGIC */ + uint32_t version; /* set to H5FD_CURR_SUBFILING_FAPL_VERSION */ + int32_t stripe_count; /* How many io concentrators */ + int64_t stripe_depth; /* Max # of bytes in contiguous IO to an IOC */ + ioc_selection_t ioc_selection; /* Method to select IO Concentrators */ + hid_t ioc_fapl_id; /* The hid_t value of the stacked VFD */ + hbool_t require_ioc; +} H5FD_subfiling_config_t; +//! <!-- [H5FD_subfiling_config_t_snip] --> + +#ifdef __cplusplus +extern "C" { +#endif + +H5_DLL hid_t H5FD_subfiling_init(void); +/** + * \ingroup FAPL + * + * \brief Modifies the file access property list to use the #H5FD_SUBFILING driver + * + * \fapl_id + * \param[in] vfd_config #H5FD_SUBFILING driver specific properties. If NULL, then + * the IO concentrator VFD will be used. + * \returns \herr_t + * + * \details H5Pset_fapl_core() modifies the file access property list to use the + * #H5FD_SUBFILING driver. + * + * \todo Expand details! + * + * \since 1.14.0 + * + */ +H5_DLL herr_t H5Pset_fapl_subfiling(hid_t fapl_id, H5FD_subfiling_config_t *vfd_config); +/** + * \ingroup FAPL + * + * \brief Queries subfiling file driver properties + * + * \fapl_id + * \param[out] config_out The subfiling fapl data. + * + * \returns \herr_t + * + * \details H5Pget_fapl_subfiling() queries the #H5FD_SUBFILING driver properties as set + * by H5Pset_fapl_subfiling(). If the #H5FD_SUBFILING driver has not been set on + * the File Access Property List, a default configuration is returned. + * + * \since 1.14.0 + * + */ +H5_DLL herr_t H5Pget_fapl_subfiling(hid_t fapl_id, H5FD_subfiling_config_t *config_out); + +#ifdef __cplusplus +} +#endif + +#endif /* H5_HAVE_SUBFILING_VFD */ + +#endif /* H5FDsubfiling_H */ diff --git a/src/H5FDsubfiling/H5FDsubfiling_priv.h b/src/H5FDsubfiling/H5FDsubfiling_priv.h new file mode 100644 index 0000000..86507a6 --- /dev/null +++ b/src/H5FDsubfiling/H5FDsubfiling_priv.h @@ -0,0 +1,72 @@ +/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * + * Copyright by The HDF Group. * + * All rights reserved. * + * * + * This file is part of HDF5. The full HDF5 copyright notice, including * + * terms governing use, modification, and redistribution, is contained in * + * the COPYING file, which can be found at the root of the source code * + * distribution tree, or in https://support.hdfgroup.org/ftp/HDF5/releases. * + * If you do not have access to either file, you may request a copy from * + * help@hdfgroup.org. * + * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ + +/* + * Private definitions for HDF5 Subfiling VFD + */ + +#ifndef H5FDsubfiling_priv_H +#define H5FDsubfiling_priv_H + +/********************/ +/* Standard Headers */ +/********************/ + +#include <stdatomic.h> +#include <libgen.h> + +/**************/ +/* H5 Headers */ +/**************/ + +#include "H5private.h" /* Generic Functions */ +#include "H5CXprivate.h" /* API Contexts */ +#include "H5Dprivate.h" /* Datasets */ +#include "H5Eprivate.h" /* Error handling */ +#include "H5FDsubfiling.h" /* Subfiling VFD */ +#include "H5FDioc.h" /* IOC VFD */ +#include "H5Iprivate.h" /* IDs */ +#include "H5MMprivate.h" /* Memory management */ +#include "H5Pprivate.h" /* Property lists */ + +#include "H5subfiling_common.h" +#include "H5subfiling_err.h" + +/* + * Some definitions for debugging the Subfiling VFD + */ +/* #define H5FD_SUBFILING_DEBUG */ + +#define DRIVER_INFO_MESSAGE_MAX_INFO 65536 +#define DRIVER_INFO_MESSAGE_MAX_LENGTH 65552 /* MAX_INFO + sizeof(info_header_t) */ + +typedef struct _info_header { /* Header for a driver info message */ + uint8_t version; + uint8_t unused_1; + uint8_t unused_2; + uint8_t unused_3; /* Actual info message length, but */ + int32_t info_length; /* CANNOT exceed 64k (65552) bytes */ + char vfd_key[8]; /* 's' 'u' 'b' 'f' 'i' 'l' 'i' 'n' */ +} info_header_t; + +#ifdef __cplusplus +extern "C" { +#endif + +H5_DLL herr_t H5FD__subfiling__truncate_sub_files(hid_t context_id, int64_t logical_file_eof, MPI_Comm comm); +H5_DLL herr_t H5FD__subfiling__get_real_eof(hid_t context_id, int64_t *logical_eof_ptr); + +#ifdef __cplusplus +} +#endif + +#endif /* H5FDsubfiling_priv_H */ diff --git a/src/H5FDsubfiling/H5subfiling_common.c b/src/H5FDsubfiling/H5subfiling_common.c new file mode 100644 index 0000000..980a1b3 --- /dev/null +++ b/src/H5FDsubfiling/H5subfiling_common.c @@ -0,0 +1,2896 @@ +/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * + * Copyright by The HDF Group. * + * All rights reserved. * + * * + * This file is part of HDF5. The full HDF5 copyright notice, including * + * terms governing use, modification, and redistribution, is contained in * + * the COPYING file, which can be found at the root of the source code * + * distribution tree, or in https://support.hdfgroup.org/ftp/HDF5/releases. * + * If you do not have access to either file, you may request a copy from * + * help@hdfgroup.org. * + * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ + +/* + * Generic code for integrating an HDF5 VFD with the subfiling feature + */ + +#include <libgen.h> + +#include "H5subfiling_common.h" +#include "H5subfiling_err.h" + +typedef struct { /* Format of a context map entry */ + uint64_t h5_file_id; /* key value (linear search of the cache) */ + int64_t sf_context_id; /* The return value if matching h5_file_id */ +} file_map_to_context_t; + +typedef struct stat_record { + int64_t op_count; /* How many ops in total */ + double min; /* minimum (time) */ + double max; /* maximum (time) */ + double total; /* average (time) */ +} stat_record_t; + +/* Stat (OP) Categories */ +typedef enum stat_category { + WRITE_STAT = 0, + WRITE_WAIT, + READ_STAT, + READ_WAIT, + FOPEN_STAT, + FCLOSE_STAT, + QUEUE_STAT, + TOTAL_STAT_COUNT +} stat_category_t; + +/* Identifiers for HDF5's error API */ +hid_t H5subfiling_err_stack_g = H5I_INVALID_HID; +hid_t H5subfiling_err_class_g = H5I_INVALID_HID; +char H5subfiling_mpi_error_str[MPI_MAX_ERROR_STRING]; +int H5subfiling_mpi_error_str_len; + +static subfiling_context_t *sf_context_cache = NULL; +static sf_topology_t * sf_topology_cache = NULL; + +static size_t sf_context_cache_limit = 16; +static size_t sf_topology_cache_limit = 4; + +app_layout_t *sf_app_layout = NULL; + +static file_map_to_context_t *sf_open_file_map = NULL; +static int sf_file_map_size = 0; +#define DEFAULT_FILE_MAP_ENTRIES 8 + +/* Definitions for recording subfiling statistics */ +static stat_record_t subfiling_stats[TOTAL_STAT_COUNT]; +#define SF_WRITE_OPS (subfiling_stats[WRITE_STAT].op_count) +#define SF_WRITE_TIME (subfiling_stats[WRITE_STAT].total / (double)subfiling_stats[WRITE_STAT].op_count) +#define SF_WRITE_WAIT_TIME (subfiling_stats[WRITE_WAIT].total / (double)subfiling_stats[WRITE_WAIT].op_count) +#define SF_READ_OPS (subfiling_stats[READ_STAT].op_count) +#define SF_READ_TIME (subfiling_stats[READ_STAT].total / (double)subfiling_stats[READ_STAT].op_count) +#define SF_READ_WAIT_TIME (subfiling_stats[READ_WAIT].total / (double)subfiling_stats[READ_WAIT].op_count) +#define SF_QUEUE_DELAYS (subfiling_stats[QUEUE_STAT].total) + +int sf_verbose_flag = 0; + +#ifdef H5_SUBFILING_DEBUG +char sf_logile_name[PATH_MAX]; +FILE *sf_logfile = NULL; + +static int sf_open_file_count = 0; +#endif + +static herr_t H5_free_subfiling_object_int(subfiling_context_t *sf_context); +static herr_t H5_free_subfiling_topology(sf_topology_t *topology); + +static herr_t init_subfiling(ioc_selection_t ioc_selection_type, MPI_Comm comm, int64_t *context_id_out); +static herr_t init_app_topology(ioc_selection_t ioc_selection_type, MPI_Comm comm, + sf_topology_t **app_topology_out); +static herr_t init_subfiling_context(subfiling_context_t *sf_context, sf_topology_t *app_topology, + MPI_Comm file_comm); +static herr_t open_subfile_with_context(subfiling_context_t *sf_context, int file_acc_flags); +static herr_t record_fid_to_subfile(uint64_t h5_file_id, int64_t subfile_context_id, int *next_index); +static herr_t ioc_open_file(sf_work_request_t *msg, int file_acc_flags); +static herr_t generate_subfile_name(subfiling_context_t *sf_context, int file_acc_flags, char *filename_out, + size_t filename_out_len, char **filename_basename_out, + char **subfile_dir_out); +static herr_t create_config_file(subfiling_context_t *sf_context, const char *base_filename, + const char *subfile_dir, hbool_t truncate_if_exists); +static herr_t open_config_file(subfiling_context_t *sf_context, const char *base_filename, + const char *subfile_dir, const char *mode, FILE **config_file_out); + +static void initialize_statistics(void); +static int numDigits(int n); +static int get_next_fid_map_index(void); +static void clear_fid_map_entry(uint64_t sf_fid, int64_t sf_context_id); +static int compare_hostid(const void *h1, const void *h2); +static herr_t get_ioc_selection_criteria_from_env(ioc_selection_t *ioc_selection_type, + char ** ioc_sel_info_str); +static int count_nodes(sf_topology_t *info, MPI_Comm comm); +static herr_t gather_topology_info(sf_topology_t *info, MPI_Comm comm); +static int identify_ioc_ranks(sf_topology_t *info, int node_count, int iocs_per_node); +static inline void assign_ioc_ranks(sf_topology_t *app_topology, int ioc_count, int rank_multiple); + +static void +initialize_statistics(void) +{ + HDmemset(subfiling_stats, 0, sizeof(subfiling_stats)); +} + +static int +numDigits(int n) +{ + if (n < 0) + n = (n == INT_MIN) ? INT_MAX : -n; + if (n < 10) + return 1; + if (n < 100) + return 2; + if (n < 1000) + return 3; + if (n < 10000) + return 4; + if (n < 100000) + return 5; + if (n < 1000000) + return 6; + if (n < 10000000) + return 7; + if (n < 100000000) + return 8; + if (n < 1000000000) + return 9; + return 10; +} + +/*------------------------------------------------------------------------- + * Function: set_verbose_flag + * + * Purpose: For debugging purposes, I allow a verbose setting to + * have printing of relevant information into an IOC specific + * file that is opened as a result of enabling the flag + * and closed when the verbose setting is disabled. + * + * Return: None + * Errors: None + * + * Programmer: Richard Warren + * + * Changes: Initial Version/None. + *------------------------------------------------------------------------- + */ +void +set_verbose_flag(int subfile_rank, int new_value) +{ +#ifdef H5_SUBFILING_DEBUG + sf_verbose_flag = (int)(new_value & 0x0FF); + if (sf_verbose_flag) { + char logname[64]; + HDsnprintf(logname, sizeof(logname), "ioc_%d.log", subfile_rank); + if (sf_open_file_count > 1) + sf_logfile = fopen(logname, "a+"); + else + sf_logfile = fopen(logname, "w+"); + } + else if (sf_logfile) { + fclose(sf_logfile); + sf_logfile = NULL; + } +#else + (void)subfile_rank; + (void)new_value; +#endif + + return; +} + +static int +get_next_fid_map_index(void) +{ + int index = 0; + + HDassert(sf_open_file_map || (sf_file_map_size == 0)); + + for (int i = 0; i < sf_file_map_size; i++) { + if (sf_open_file_map[i].h5_file_id == UINT64_MAX) { + index = i; + break; + } + } + + /* A valid index should always be found here */ + HDassert(index >= 0); + HDassert((sf_file_map_size == 0) || (index < sf_file_map_size)); + + return index; +} + +/*------------------------------------------------------------------------- + * Function: clear_fid_map_entry + * + * Purpose: Remove the map entry associated with the file->inode. + * This is done at file close. + * + * Return: None + * Errors: Cannot fail. + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + * + *------------------------------------------------------------------------- + */ +static void +clear_fid_map_entry(uint64_t sf_fid, int64_t sf_context_id) +{ + if (sf_open_file_map) { + int i; + for (i = 0; i < sf_file_map_size; i++) { + if ((sf_open_file_map[i].h5_file_id == sf_fid) && + (sf_open_file_map[i].sf_context_id == sf_context_id)) { + sf_open_file_map[i].h5_file_id = UINT64_MAX; + sf_open_file_map[i].sf_context_id = -1; + return; + } + } + } +} /* end clear_fid_map_entry() */ + +/* + * --------------------------------------------------- + * Topology discovery related functions for choosing + * I/O Concentrator (IOC) ranks. + * Currently, the default approach for assigning an IOC + * is select the lowest MPI rank on each node. + * + * The approach collectively generates N tuples + * consisting of the MPI rank and hostid. This + * collection is then sorted by hostid and scanned + * to identify the IOC ranks. + * + * As time permits, addition assignment methods will + * be implemented, e.g. 1-per-Nranks or via a config + * option. Additional selection methodologies can + * be included as users get more experience using the + * subfiling implementation. + * --------------------------------------------------- + */ + +/*------------------------------------------------------------------------- + * Function: compare_hostid + * + * Purpose: qsort sorting function. + * Compares tuples of 'layout_t'. The sorting is based on + * the long hostid values. + * + * Return: result of: (hostid1 > hostid2) + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + * + *------------------------------------------------------------------------- + */ +static int +compare_hostid(const void *h1, const void *h2) +{ + const layout_t *host1 = (const layout_t *)h1; + const layout_t *host2 = (const layout_t *)h2; + return (host1->hostid > host2->hostid); +} + +/* +------------------------------------------------------------------------- + Programmer: Richard Warren + Purpose: Return a character string which represents either the + default selection method: SELECT_IOC_ONE_PER_NODE; or + if the user has selected a method via the environment + variable (H5_IOC_SELECTION_CRITERIA), we return that + along with any optional qualifier with for that method. + + Errors: None. + + Revision History -- Initial implementation +------------------------------------------------------------------------- +*/ +static herr_t +get_ioc_selection_criteria_from_env(ioc_selection_t *ioc_selection_type, char **ioc_sel_info_str) +{ + char *opt_value = NULL; + char *env_value = HDgetenv(H5_IOC_SELECTION_CRITERIA); + + HDassert(ioc_selection_type); + HDassert(ioc_sel_info_str); + + *ioc_sel_info_str = NULL; + + if (env_value) { + long check_value; + + /* + * For non-default options, the environment variable + * should have the following form: integer:[integer|string] + * In particular, EveryNthRank == 1:64 or every 64 ranks assign an IOC + * or WithConfig == 2:/<full_path_to_config_file> + */ + if ((opt_value = HDstrchr(env_value, ':'))) + *opt_value++ = '\0'; + + errno = 0; + check_value = HDstrtol(env_value, NULL, 0); + + if (errno == ERANGE) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't parse value from " H5_IOC_SELECTION_CRITERIA " environment variable\n", + __func__); +#endif + + return FAIL; + } + + if ((check_value < 0) || (check_value >= ioc_selection_options)) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: invalid IOC selection type value %ld from " H5_IOC_SELECTION_CRITERIA + " environment variable\n", + __func__, check_value); +#endif + + return FAIL; + } + + *ioc_selection_type = (ioc_selection_t)check_value; + *ioc_sel_info_str = opt_value; + } + + return SUCCEED; +} + +/*------------------------------------------------------------------------- + * Function: count_nodes + * + * Purpose: Initializes the sorted collection of hostid+mpi_rank + * tuples. After initialization, the collection is scanned + * to determine the number of unique hostid entries. This + * value will determine the number of actual I/O concentrators + * that available to the application. A side effect is to + * identify the 'node_index' of the current process. + * + * Return: The number of unique hostid's (nodes). + * Errors: MPI_Abort if memory cannot be allocated. + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + * + *------------------------------------------------------------------------- + */ +static int +count_nodes(sf_topology_t *info, MPI_Comm comm) +{ + app_layout_t *app_layout = NULL; + long nextid; + int node_count; + int hostid_index = -1; + int my_rank; + int mpi_code; + + HDassert(info); + HDassert(info->app_layout); + HDassert(info->app_layout->layout); + HDassert(info->app_layout->node_ranks); + HDassert(MPI_COMM_NULL != comm); + + if (MPI_SUCCESS != (mpi_code = MPI_Comm_rank(comm, &my_rank))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't get MPI communicator rank; rc = %d\n", __func__, mpi_code); +#endif + + return -1; + } + + app_layout = info->app_layout; + node_count = app_layout->node_count; + + if (node_count == 0) + gather_topology_info(info, comm); + + nextid = app_layout->layout[0].hostid; + /* Possibly record my hostid_index */ + if (app_layout->layout[0].rank == my_rank) { + hostid_index = 0; + } + + app_layout->node_ranks[0] = 0; /* Add index */ + node_count = 1; + + /* Recall that the topology array has been sorted! */ + for (int k = 1; k < app_layout->world_size; k++) { + /* Possibly record my hostid_index */ + if (app_layout->layout[k].rank == my_rank) + hostid_index = k; + if (app_layout->layout[k].hostid != nextid) { + nextid = app_layout->layout[k].hostid; + /* Record the index of new hostid */ + app_layout->node_ranks[node_count++] = k; + } + } + + /* Mark the end of the node_ranks */ + app_layout->node_ranks[node_count] = app_layout->world_size; + /* Save the index where we first located my hostid */ + app_layout->node_index = hostid_index; + + app_layout->node_count = node_count; + + return node_count; +} + +/*------------------------------------------------------------------------- + * Function: gather_topology_info + * + * Purpose: Collectively generate a sorted collection of hostid+mpi_rank + * tuples. The result is returned in the 'topology' field + * of the sf_topology_t structure. + * + * Return: Non-negative on success/Negative on failure + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + * + *------------------------------------------------------------------------- + */ +static herr_t +gather_topology_info(sf_topology_t *info, MPI_Comm comm) +{ + app_layout_t *app_layout = NULL; + layout_t my_hostinfo; + long hostid; + int sf_world_size; + int sf_world_rank; + + HDassert(info); + HDassert(info->app_layout); + HDassert(info->app_layout->layout); + HDassert(MPI_COMM_NULL != comm); + + app_layout = info->app_layout; + sf_world_size = app_layout->world_size; + sf_world_rank = app_layout->world_rank; + + hostid = gethostid(); + + my_hostinfo.hostid = hostid; + my_hostinfo.rank = sf_world_rank; + + app_layout->hostid = hostid; + app_layout->layout[sf_world_rank] = my_hostinfo; + + if (sf_world_size > 1) { + int mpi_code; + + if (MPI_SUCCESS != + (mpi_code = MPI_Allgather(&my_hostinfo, 2, MPI_LONG, app_layout->layout, 2, MPI_LONG, comm))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: MPI_Allgather failed with rc %d\n", __func__, mpi_code); +#endif + + return FAIL; + } + + qsort(app_layout->layout, (size_t)sf_world_size, sizeof(layout_t), compare_hostid); + } + + return SUCCEED; +} + +/*------------------------------------------------------------------------- + * Function: identify_ioc_ranks + * + * Purpose: We've already identified the number of unique nodes and + * have a sorted list layout_t structures. Under normal + * conditions, we only utilize a single IOC per node. Under + * that circumstance, we only need to fill the io_concentrator + * vector from the node_ranks array (which contains the index + * into the layout array of lowest MPI rank on each node) into + * the io_concentrator vector; + * Otherwise, while determining the number of local_peers per + * node, we can also select one or more additional IOCs. + * + * As a side effect, we fill the 'ioc_concentrator' vector + * and set the 'rank_is_ioc' flag to TRUE if our rank is + * identified as owning an I/O Concentrator (IOC). + * + *------------------------------------------------------------------------- + */ +static int +identify_ioc_ranks(sf_topology_t *info, int node_count, int iocs_per_node) +{ + app_layout_t *app_layout = NULL; + int total_ioc_count = 0; + + HDassert(info); + HDassert(info->app_layout); + + app_layout = info->app_layout; + + for (int n = 0; n < node_count; n++) { + int node_index = app_layout->node_ranks[n]; + int local_peer_count = app_layout->node_ranks[n + 1] - app_layout->node_ranks[n]; + + info->io_concentrators[total_ioc_count++] = (int)(app_layout->layout[node_index++].rank); + + if (app_layout->layout[node_index - 1].rank == app_layout->world_rank) { + info->subfile_rank = total_ioc_count - 1; + info->rank_is_ioc = TRUE; + } + + for (int k = 1; k < iocs_per_node; k++) { + if (k < local_peer_count) { + if (app_layout->layout[node_index].rank == app_layout->world_rank) { + info->rank_is_ioc = TRUE; + info->subfile_rank = total_ioc_count; + } + info->io_concentrators[total_ioc_count++] = (int)(app_layout->layout[node_index++].rank); + } + } + } + + info->n_io_concentrators = total_ioc_count; + + return total_ioc_count; +} /* end identify_ioc_ranks() */ + +static inline void +assign_ioc_ranks(sf_topology_t *app_topology, int ioc_count, int rank_multiple) +{ + app_layout_t *app_layout = NULL; + int * io_concentrators = NULL; + + HDassert(app_topology); + HDassert(app_topology->app_layout); + HDassert(app_topology->io_concentrators); + + app_layout = app_topology->app_layout; + io_concentrators = app_topology->io_concentrators; + + /* fill the io_concentrators values based on the application layout */ + if (io_concentrators) { + int ioc_index; + for (int k = 0, ioc_next = 0; ioc_next < ioc_count; ioc_next++) { + ioc_index = rank_multiple * k++; + io_concentrators[ioc_next] = (int)(app_layout->layout[ioc_index].rank); + if (io_concentrators[ioc_next] == app_layout->world_rank) + app_topology->rank_is_ioc = TRUE; + } + app_topology->n_io_concentrators = ioc_count; + } +} /* end assign_ioc_ranks() */ + +/*------------------------------------------------------------------------- + * Function: H5_new_subfiling_object_id + * + * Purpose: Given a subfiling object type and an index value, generates + * a new subfiling object ID. + * + * Return: Non-negative object ID on success/Negative on failure + * + *------------------------------------------------------------------------- + */ +int64_t +H5_new_subfiling_object_id(sf_obj_type_t obj_type, int64_t index_val) +{ + if (obj_type != SF_CONTEXT && obj_type != SF_TOPOLOGY) + return -1; + if (index_val < 0) + return -1; + + return (((int64_t)obj_type << 32) | index_val); +} + +/*------------------------------------------------------------------------- + * Function: H5_get_subfiling_object + * + * Purpose: Given a subfiling object ID, returns a pointer to the + * underlying object, which can be either a subfiling context + * object (subfiling_context_t) or a subfiling topology + * object (sf_topology_t). + * + * A subfiling object ID contains the object type in the upper + * 32 bits and an index value in the lower 32 bits. + * + * Subfiling contexts are 1 per open file. If only one file is + * open at a time, then only a single subfiling context cache + * entry will be used. + * + * Topologies are static, e.g. for any one I/O concentrator + * allocation strategy, the results should always be the same. + * + * TODO: The one exception to this being the 1 IOC per N MPI + * ranks strategy. The value of N can be changed on a per-file + * basis, so we need to address that at some point. + * + * Return: Pointer to underlying subfiling object if subfiling object + * ID is valid + * + * NULL if subfiling object ID is invalid or an internal + * failure occurs + * + *------------------------------------------------------------------------- + */ +/* + * TODO: we don't appear to ever use this for retrieving a subfile topology + * object. Might be able to refactor to just return a subfile context + * object. + */ +/* TODO: no way of freeing caches on close currently */ +void * +H5_get_subfiling_object(int64_t object_id) +{ + int64_t obj_type = (object_id >> 32) & 0x0FFFF; + int64_t obj_index = object_id & 0x0FFFF; + + if (obj_index < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: invalid object index for subfiling object ID %" PRId64 "\n", __func__, object_id); +#endif + + return NULL; + } + + if (obj_type == SF_CONTEXT) { + /* Contexts provide information principally about + * the application and how the data layout is managed + * over some number of sub-files. The important + * parameters are the number of subfiles (or in the + * context of IOCs, the MPI ranks and counts of the + * processes which host an I/O Concentrator. We + * also provide a map of IOC rank to MPI rank + * to facilitate the communication of I/O requests. + */ + + /* Create subfiling context cache if it doesn't exist */ + if (!sf_context_cache) { + if (NULL == (sf_context_cache = HDcalloc(sf_context_cache_limit, sizeof(subfiling_context_t)))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate space for subfiling context cache\n", __func__); +#endif + + return NULL; + } + } + + /* Make more space in context cache if needed */ + if ((size_t)obj_index == sf_context_cache_limit) { + size_t old_num_entries; + void * tmp_realloc; + + old_num_entries = sf_context_cache_limit; + + sf_context_cache_limit *= 2; + + if (NULL == (tmp_realloc = HDrealloc(sf_context_cache, + sf_context_cache_limit * sizeof(subfiling_context_t)))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate space for subfiling context cache\n", __func__); +#endif + + return NULL; + } + + sf_context_cache = tmp_realloc; + + /* Clear newly-allocated entries */ + HDmemset(&sf_context_cache[obj_index], 0, + (sf_context_cache_limit - old_num_entries) * sizeof(subfiling_context_t)); + } + + /* Return direct pointer to the context cache entry */ + return (void *)&sf_context_cache[obj_index]; + } + else if (obj_type == SF_TOPOLOGY) { + /* Create subfiling topology cache if it doesn't exist */ + if (!sf_topology_cache) { + if (NULL == (sf_topology_cache = HDcalloc(sf_topology_cache_limit, sizeof(sf_topology_t)))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate space for subfiling topology cache\n", __func__); +#endif + + return NULL; + } + } + + /* We will likely only cache a single topology + * which is that of the original parallel application. + * In that context, we will identify the number of + * nodes along with the number of MPI ranks on a node. + */ + if ((size_t)obj_index >= sf_topology_cache_limit) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: invalid object index for subfiling topology object ID\n", __func__); +#endif + + return NULL; + } + + /* Return direct pointer to the topology cache entry */ + return (void *)&sf_topology_cache[obj_index]; + } + +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: Unknown subfiling object type for ID %" PRId64 "\n", __func__, object_id); +#endif + + return NULL; +} + +/*------------------------------------------------------------------------- + * Function: H5_free_subfiling_object + * + * Purpose: Frees the underlying subfiling object for a given subfiling + * object ID. + * + * Return: Non-negative on success/Negative on failure + * + *------------------------------------------------------------------------- + */ +herr_t +H5_free_subfiling_object(int64_t object_id) +{ + subfiling_context_t *sf_context = NULL; + int64_t obj_type = (object_id >> 32) & 0x0FFFF; + + if (obj_type != SF_CONTEXT) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: invalid subfiling object type for ID %" PRId64 "\n", __func__, object_id); +#endif + + return FAIL; + } + + sf_context = H5_get_subfiling_object(object_id); + if (!sf_context) + return FAIL; + + if (H5_free_subfiling_object_int(sf_context) < 0) + return FAIL; + + return SUCCEED; +} + +static herr_t +H5_free_subfiling_object_int(subfiling_context_t *sf_context) +{ + HDassert(sf_context); + +#ifdef H5_SUBFILING_DEBUG + if (sf_context->sf_logfile) { + struct tm *tm = NULL; + time_t cur_time; + + cur_time = time(NULL); + tm = localtime(&cur_time); + + H5_subfiling_log(sf_context->sf_context_id, "\n-- LOGGING FINISH - %s", asctime(tm)); + + HDfclose(sf_context->sf_logfile); + sf_context->sf_logfile = NULL; + } +#endif + + sf_context->sf_context_id = -1; + sf_context->h5_file_id = UINT64_MAX; + sf_context->sf_fid = -1; + sf_context->sf_write_count = 0; + sf_context->sf_read_count = 0; + sf_context->sf_eof = HADDR_UNDEF; + sf_context->sf_stripe_size = -1; + sf_context->sf_blocksize_per_stripe = -1; + sf_context->sf_base_addr = -1; + + if (sf_context->sf_msg_comm != MPI_COMM_NULL) { + if (H5_mpi_comm_free(&sf_context->sf_msg_comm) < 0) + return FAIL; + sf_context->sf_msg_comm = MPI_COMM_NULL; + } + if (sf_context->sf_data_comm != MPI_COMM_NULL) { + if (H5_mpi_comm_free(&sf_context->sf_data_comm) < 0) + return FAIL; + sf_context->sf_data_comm = MPI_COMM_NULL; + } + if (sf_context->sf_eof_comm != MPI_COMM_NULL) { + if (H5_mpi_comm_free(&sf_context->sf_eof_comm) < 0) + return FAIL; + sf_context->sf_eof_comm = MPI_COMM_NULL; + } + if (sf_context->sf_barrier_comm != MPI_COMM_NULL) { + if (H5_mpi_comm_free(&sf_context->sf_barrier_comm) < 0) + return FAIL; + sf_context->sf_barrier_comm = MPI_COMM_NULL; + } + if (sf_context->sf_group_comm != MPI_COMM_NULL) { + if (H5_mpi_comm_free(&sf_context->sf_group_comm) < 0) + return FAIL; + sf_context->sf_group_comm = MPI_COMM_NULL; + } + if (sf_context->sf_intercomm != MPI_COMM_NULL) { + if (H5_mpi_comm_free(&sf_context->sf_intercomm) < 0) + return FAIL; + sf_context->sf_intercomm = MPI_COMM_NULL; + } + + sf_context->sf_group_size = -1; + sf_context->sf_group_rank = -1; + sf_context->sf_intercomm_root = -1; + + HDfree(sf_context->subfile_prefix); + sf_context->subfile_prefix = NULL; + + HDfree(sf_context->sf_filename); + sf_context->sf_filename = NULL; + + HDfree(sf_context->h5_filename); + sf_context->h5_filename = NULL; + + if (H5_free_subfiling_topology(sf_context->topology) < 0) + return FAIL; + sf_context->topology = NULL; + + return SUCCEED; +} + +static herr_t +H5_free_subfiling_topology(sf_topology_t *topology) +{ + HDassert(topology); + + topology->subfile_rank = -1; + topology->n_io_concentrators = 0; + + HDfree(topology->subfile_fd); + topology->subfile_fd = NULL; + + /* + * The below assumes that the subfiling application layout + * is retrieved once and used for subsequent file opens for + * the duration that the Subfiling VFD is in use + */ + HDassert(topology->app_layout == sf_app_layout); + +#if 0 + if (topology->app_layout && (topology->app_layout != sf_app_layout)) { + HDfree(topology->app_layout->layout); + topology->app_layout->layout = NULL; + + HDfree(topology->app_layout->node_ranks); + topology->app_layout->node_ranks = NULL; + + HDfree(topology->app_layout); + } +#endif + + topology->app_layout = NULL; + + HDfree(topology->io_concentrators); + topology->io_concentrators = NULL; + + HDfree(topology); + + return SUCCEED; +} + +/*------------------------------------------------------------------------- + * Function: H5_open_subfiles + * + * Purpose: Wrapper for the internal 'open__subfiles' function + * Similar to the other public wrapper functions, we + * discover (via the sf_context) the number of io concentrators + * and pass that to the internal function so that vector + * storage arrays can be stack based rather than explicitly + * allocated and freed. + * + * The Internal function is responsible for sending all IOC + * instances, the (sub)file open requests. + * + * Prior to calling the internal open function, we initialize + * a new subfiling context that contains topology info and + * new MPI communicators that facilitate messaging between + * HDF5 clients and the IOCs. + * + * Return: Success (0) or Faiure (non-zero) + * Errors: If MPI operations fail for some reason. + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + *------------------------------------------------------------------------- + */ +/* TODO: revise description */ +herr_t +H5_open_subfiles(const char *base_filename, uint64_t h5_file_id, ioc_selection_t ioc_selection_type, + int file_acc_flags, MPI_Comm file_comm, int64_t *context_id_out) +{ + subfiling_context_t *sf_context = NULL; + int64_t context_id = -1; + int l_errors = 0; + int g_errors = 0; + int mpi_code; + herr_t ret_value = SUCCEED; + + if (!base_filename) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: invalid base filename\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + if (!context_id_out) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: context_id_out is NULL\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + initialize_statistics(); + +#if 0 /* TODO */ + /* Maybe set the verbose flag for more debugging info */ + envValue = HDgetenv("H5_SF_VERBOSE_FLAG"); + if (envValue != NULL) { + int check_value = atoi(envValue); + if (check_value > 0) + sf_verbose_flag = 1; + } +#endif + + /* Initialize new subfiling context ID based on configuration information */ + if (init_subfiling(ioc_selection_type, file_comm, &context_id) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't initialize subfiling context\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* Retrieve the subfiling object for the newly-created context ID */ + if (NULL == (sf_context = H5_get_subfiling_object(context_id))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't get subfiling object from context ID\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* Save some basic things in the new subfiling context */ + sf_context->h5_file_id = h5_file_id; + + if (NULL == (sf_context->h5_filename = HDstrdup(base_filename))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't copy base HDF5 filename\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* + * If we're actually using the IOCs, we will + * start the service threads on the identified + * ranks as part of the subfile opening. + */ + if (open_subfile_with_context(sf_context, file_acc_flags) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't open subfiles\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + +#ifdef H5_SUBFILING_DEBUG + { + struct tm *tm = NULL; + time_t cur_time; + int mpi_rank; + + /* Open debugging logfile */ + + if (MPI_SUCCESS != MPI_Comm_rank(file_comm, &mpi_rank)) { + HDprintf("%s: couldn't get MPI rank\n", __func__); + ret_value = FAIL; + goto done; + } + + HDsnprintf(sf_context->sf_logfile_name, PATH_MAX, "%s.log.%d", sf_context->h5_filename, mpi_rank); + + if (NULL == (sf_context->sf_logfile = HDfopen(sf_context->sf_logfile_name, "a"))) { + HDprintf("%s: couldn't open subfiling debug logfile\n", __func__); + ret_value = FAIL; + goto done; + } + + cur_time = time(NULL); + tm = localtime(&cur_time); + + H5_subfiling_log(context_id, "-- LOGGING BEGIN - %s", asctime(tm)); + } +#endif + + *context_id_out = context_id; + +done: + if (ret_value < 0) { + l_errors = 1; + } + + /* + * Form consensus on whether opening subfiles was + * successful + */ + if (MPI_SUCCESS != (mpi_code = MPI_Allreduce(&l_errors, &g_errors, 1, MPI_INT, MPI_SUM, file_comm))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("[%s %d]: MPI_Allreduce failed with rc %d\n", __func__, + sf_context->topology->app_layout->world_rank, mpi_code); +#endif + + ret_value = FAIL; + } + + if (g_errors > 0) { +#ifdef H5_SUBFILING_DEBUG + if (sf_context->topology->app_layout->world_rank == 0) { + HDprintf("%s: one or more IOC ranks couldn't open subfiles\n", __func__); + } +#endif + + ret_value = FAIL; + } + + if (ret_value < 0) { + clear_fid_map_entry(h5_file_id, context_id); + + if (context_id >= 0 && H5_free_subfiling_object(context_id) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't free subfiling object\n", __func__); +#endif + } + + *context_id_out = -1; + } + + return ret_value; +} + +/* +------------------------------------------------------------------------- + Programmer: Richard Warren + Purpose: Called as part of a file open operation, we initialize a + subfiling context which includes the application topology + along with other relevant info such as the MPI objects + (communicators) for communicating with IO concentrators. + We also identify which MPI ranks will have IOC threads + started on them. + + We return a context ID via the 'sf_context' variable. + + Errors: returns an error if we detect any initialization errors, + including malloc failures or any resource allocation + problems. + + Revision History -- Initial implementation +------------------------------------------------------------------------- +*/ +static herr_t +init_subfiling(ioc_selection_t ioc_selection_type, MPI_Comm comm, int64_t *context_id_out) +{ + subfiling_context_t *new_context = NULL; + sf_topology_t * app_topology = NULL; + int64_t context_id = -1; + int file_index = -1; + herr_t ret_value = SUCCEED; + + HDassert(context_id_out); + + file_index = get_next_fid_map_index(); + HDassert(file_index >= 0); + + /* Use the file's index to create a new subfiling context ID */ + if ((context_id = H5_new_subfiling_object_id(SF_CONTEXT, file_index)) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't create new subfiling context ID\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* Create a new subfiling context object with the created context ID */ + if (NULL == (new_context = H5_get_subfiling_object(context_id))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't create new subfiling object\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* + * Setup the application topology information, including the computed + * number and distribution map of the set of I/O concentrators + */ + if (init_app_topology(ioc_selection_type, comm, &app_topology) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't initialize application topology\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + new_context->sf_context_id = context_id; + + if (init_subfiling_context(new_context, app_topology, comm) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't initialize subfiling topology object\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + new_context->sf_base_addr = 0; + if (new_context->topology->rank_is_ioc) { + new_context->sf_base_addr = + (int64_t)(new_context->topology->subfile_rank * new_context->sf_stripe_size); + } + + *context_id_out = context_id; + +done: + if (ret_value < 0) { + HDfree(app_topology); + + if (context_id >= 0 && H5_free_subfiling_object(context_id) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't free subfiling object\n", __func__); +#endif + } + } + + return ret_value; +} + +/*------------------------------------------------------------------------- + * Function: init_app_topology + * + * Purpose: Once a sorted collection of hostid/mpi_rank tuples has been + * created and the number of unique hostids (nodes) has + * been determined, we may modify this "default" value for + * the number of IO Concentrators for this application. + * + * The default of one(1) IO concentrator per node can be + * changed (principally for testing) by environment variable. + * if IOC_COUNT_PER_NODE is defined, then that integer value + * is utilized as a multiplier to modify the set of + * IO Concentrator ranks. + * + * The cached results will be replicated within the + * subfiling_context_t structure and is utilized as a map from + * io concentrator rank to MPI communicator rank for message + * sends and receives. + * + * Return: The number of IO Concentrator ranks. We also cache + * the MPI ranks in the 'io_concentrator' vector variable. + * The length of this vector is cached as 'n_io_concentrators'. + * Errors: MPI_Abort if memory cannot be allocated. + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: - Initial Version/None. + * - Updated the API to allow a variety of methods for + * determining the number and MPI ranks that will have + * IO Concentrators. The default approach will define + * a single IOC per node. + * + *------------------------------------------------------------------------- + */ +static herr_t +init_app_topology(ioc_selection_t ioc_selection_type, MPI_Comm comm, sf_topology_t **app_topology_out) +{ + sf_topology_t *app_topology = NULL; + app_layout_t * app_layout = sf_app_layout; + char * env_value = NULL; + char * ioc_sel_str = NULL; + int * io_concentrators = NULL; + long ioc_select_val = -1; + long iocs_per_node = 1; + int ioc_count = 0; + int comm_rank; + int comm_size; + int mpi_code; + herr_t ret_value = SUCCEED; + + HDassert(MPI_COMM_NULL != comm); + HDassert(app_topology_out); + HDassert(!*app_topology_out); + + if (MPI_SUCCESS != (mpi_code = MPI_Comm_rank(comm, &comm_rank))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't get MPI communicator rank; rc = %d\n", __func__, mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + + if (MPI_SUCCESS != (mpi_code = MPI_Comm_size(comm, &comm_size))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't get MPI communicator size; rc = %d\n", __func__, mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + + /* Check if an IOC selection type was specified by environment variable */ + if (get_ioc_selection_criteria_from_env(&ioc_selection_type, &ioc_sel_str) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't get IOC selection type from environment\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* Sanity checking on different IOC selection strategies */ + switch (ioc_selection_type) { + case SELECT_IOC_EVERY_NTH_RANK: { + errno = 0; + + ioc_select_val = 1; + if (ioc_sel_str) { + ioc_select_val = HDstrtol(ioc_sel_str, NULL, 0); + if ((ERANGE == errno) || (ioc_select_val <= 0)) { + HDprintf("invalid IOC selection strategy string '%s' for strategy " + "SELECT_IOC_EVERY_NTH_RANK; defaulting to SELECT_IOC_ONE_PER_NODE\n", + ioc_sel_str); + ioc_select_val = 1; + ioc_selection_type = SELECT_IOC_ONE_PER_NODE; + } + } + + break; + } + + case SELECT_IOC_WITH_CONFIG: + HDprintf("SELECT_IOC_WITH_CONFIG IOC selection strategy not supported yet; defaulting to " + "SELECT_IOC_ONE_PER_NODE\n"); + ioc_selection_type = SELECT_IOC_ONE_PER_NODE; + break; + + case SELECT_IOC_TOTAL: { + errno = 0; + + ioc_select_val = 1; + if (ioc_sel_str) { + ioc_select_val = HDstrtol(ioc_sel_str, NULL, 0); + if ((ERANGE == errno) || (ioc_select_val <= 0) || (ioc_select_val >= comm_size)) { + HDprintf("invalid IOC selection strategy string '%s' for strategy SELECT_IOC_TOTAL; " + "defaulting to SELECT_IOC_ONE_PER_NODE\n", + ioc_sel_str); + ioc_select_val = 1; + ioc_selection_type = SELECT_IOC_ONE_PER_NODE; + } + } + + break; + } + + default: + break; + } + + /* Allocate new application topology information object */ + if (NULL == (app_topology = HDcalloc(1, sizeof(*app_topology)))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't create new subfiling topology object\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + app_topology->subfile_rank = -1; + app_topology->selection_type = ioc_selection_type; + + if (NULL == (app_topology->io_concentrators = HDcalloc((size_t)comm_size, sizeof(int)))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate array of I/O concentrator ranks\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + io_concentrators = app_topology->io_concentrators; + HDassert(io_concentrators); + + if (!app_layout) { + /* TODO: this is dangerous if a new comm size is greater than what + * was allocated. Can't reuse app layout. + */ + + if (NULL == (app_layout = HDcalloc(1, sizeof(*app_layout)))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate application layout structure\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + if (NULL == (app_layout->node_ranks = HDcalloc(1, ((size_t)comm_size + 1) * sizeof(int)))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate application layout node rank array\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + if (NULL == (app_layout->layout = HDcalloc(1, ((size_t)comm_size + 1) * sizeof(layout_t)))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate application layout array\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* + * Once the application layout has been filled once, any additional + * file open operations won't be required to gather that information. + */ + sf_app_layout = app_layout; + } + + app_layout->world_size = comm_size; + app_layout->world_rank = comm_rank; + + app_topology->app_layout = app_layout; + + /* + * Determine which ranks are I/O concentrator ranks, based on the + * given IOC selection strategy and MPI information. + */ + switch (ioc_selection_type) { + case SELECT_IOC_ONE_PER_NODE: { + int node_count; + + app_topology->selection_type = SELECT_IOC_ONE_PER_NODE; + + if ((node_count = count_nodes(app_topology, comm)) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't determine number of nodes used\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* Check for an IOC-per-node value set in the environment */ + /* TODO: should this env. var. be interpreted for other selection types? */ + if ((env_value = HDgetenv(H5_IOC_COUNT_PER_NODE))) { + errno = 0; + ioc_select_val = HDstrtol(env_value, NULL, 0); + if ((ERANGE == errno)) { + HDprintf("invalid value '%s' for " H5_IOC_COUNT_PER_NODE "\n", env_value); + ioc_select_val = 1; + } + + if (ioc_select_val > 0) + iocs_per_node = ioc_select_val; + } + + H5_CHECK_OVERFLOW(iocs_per_node, long, int); + ioc_count = identify_ioc_ranks(app_topology, node_count, (int)iocs_per_node); + + break; + } + + case SELECT_IOC_EVERY_NTH_RANK: { + /* + * User specifies a rank multiple value. Selection starts + * with rank 0 and then the user-specified stride is applied\ + * to identify other IOC ranks. + */ + + H5_CHECK_OVERFLOW(ioc_select_val, long, int); + ioc_count = (comm_size / (int)ioc_select_val); + + if ((comm_size % ioc_select_val) != 0) { + ioc_count++; + } + + assign_ioc_ranks(app_topology, ioc_count, (int)ioc_select_val); + + break; + } + + case SELECT_IOC_TOTAL: { + int rank_multiple = 0; + + /* + * User specifies a total number of I/O concentrators. + * Starting with rank 0, a stride of (mpi_size / total) + * is applied to identify other IOC ranks. + */ + + H5_CHECK_OVERFLOW(ioc_select_val, long, int); + ioc_count = (int)ioc_select_val; + + rank_multiple = (comm_size / ioc_count); + + assign_ioc_ranks(app_topology, ioc_count, rank_multiple); + + break; + } + + case SELECT_IOC_WITH_CONFIG: + default: +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: invalid IOC selection strategy\n", __func__); +#endif + ret_value = FAIL; + goto done; + break; + } + + HDassert(ioc_count > 0); + app_topology->n_io_concentrators = ioc_count; + + /* + * Create a vector of "potential" file descriptors + * which can be indexed by the IOC ID + */ + if (NULL == (app_topology->subfile_fd = HDcalloc((size_t)ioc_count, sizeof(int)))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate subfile file descriptor array\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + *app_topology_out = app_topology; + +done: + if (ret_value < 0) { + if (app_layout) { + HDfree(app_layout->layout); + HDfree(app_layout->node_ranks); + HDfree(app_layout); + } + if (app_topology) { + HDfree(app_topology->subfile_fd); + HDfree(app_topology->io_concentrators); + HDfree(app_topology); + } + } + + return ret_value; +} + +/*------------------------------------------------------------------------- + * Function: init_subfile_context + * + * Purpose: Called as part of the HDF5 file + subfiling opening. + * This initializes the subfiling context and associates + * this context with the specific HDF5 file. + * + * Return: Non-negative on success/Negative on failure + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + *------------------------------------------------------------------------- + */ +static herr_t +init_subfiling_context(subfiling_context_t *sf_context, sf_topology_t *app_topology, MPI_Comm file_comm) +{ + char * env_value = NULL; + int comm_rank; + int mpi_code; + herr_t ret_value = SUCCEED; + + HDassert(sf_context); + HDassert(sf_context->topology == NULL); + HDassert(app_topology); + HDassert(app_topology->n_io_concentrators > 0); + HDassert(MPI_COMM_NULL != file_comm); + + sf_context->topology = app_topology; + sf_context->sf_msg_comm = MPI_COMM_NULL; + sf_context->sf_data_comm = MPI_COMM_NULL; + sf_context->sf_eof_comm = MPI_COMM_NULL; + sf_context->sf_barrier_comm = MPI_COMM_NULL; + sf_context->sf_group_comm = MPI_COMM_NULL; + sf_context->sf_intercomm = MPI_COMM_NULL; + sf_context->sf_stripe_size = H5FD_DEFAULT_STRIPE_DEPTH; + sf_context->sf_write_count = 0; + sf_context->sf_read_count = 0; + sf_context->sf_eof = HADDR_UNDEF; + sf_context->sf_fid = -1; + sf_context->sf_group_size = 1; + sf_context->sf_group_rank = 0; + sf_context->h5_filename = NULL; + sf_context->sf_filename = NULL; + sf_context->subfile_prefix = NULL; + sf_context->ioc_data = NULL; + +#ifdef H5_SUBFILING_DEBUG + sf_context->sf_logfile = NULL; +#endif + + /* Check for an IOC stripe size setting in the environment */ + if ((env_value = HDgetenv(H5_IOC_STRIPE_SIZE))) { + long long stripe_size = -1; + + errno = 0; + + stripe_size = HDstrtoll(env_value, NULL, 0); + if (ERANGE == errno) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: invalid stripe size setting '%s' for " H5_IOC_STRIPE_SIZE "\n", __func__, + env_value); +#endif + + ret_value = FAIL; + goto done; + } + + if (stripe_size > 0) { + sf_context->sf_stripe_size = (int64_t)stripe_size; + } + } + + /* + * Set blocksize per stripe value after possibly adjusting + * for user-specified subfile stripe size + */ + sf_context->sf_blocksize_per_stripe = sf_context->sf_stripe_size * app_topology->n_io_concentrators; + + /* Check for a subfile name prefix setting in the environment */ + if ((env_value = HDgetenv(H5_IOC_SUBFILE_PREFIX))) { + if (NULL == (sf_context->subfile_prefix = HDstrdup(env_value))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't copy subfile prefix value\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + } + + /* + * Set up various MPI sub-communicators for MPI operations + * to/from IOC ranks + */ + + if (MPI_SUCCESS != (mpi_code = MPI_Comm_rank(file_comm, &comm_rank))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't get MPI communicator rank; rc = %d\n", __func__, mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + + if (MPI_SUCCESS != (mpi_code = MPI_Comm_dup(file_comm, &sf_context->sf_msg_comm))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't create sub-communicator for IOC messages; rc = %d\n", __func__, mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + + if (MPI_SUCCESS != (mpi_code = MPI_Comm_set_errhandler(sf_context->sf_msg_comm, MPI_ERRORS_RETURN))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't set MPI error handler on IOC message sub-communicator; rc = %d\n", __func__, + mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + + if (MPI_SUCCESS != (mpi_code = MPI_Comm_dup(file_comm, &sf_context->sf_data_comm))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't create sub-communicator for IOC data; rc = %d\n", __func__, mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + + if (MPI_SUCCESS != (mpi_code = MPI_Comm_set_errhandler(sf_context->sf_data_comm, MPI_ERRORS_RETURN))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't set MPI error handler on IOC data sub-communicator; rc = %d\n", __func__, + mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + + if (MPI_SUCCESS != (mpi_code = MPI_Comm_dup(file_comm, &sf_context->sf_eof_comm))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't create sub-communicator for IOC EOF; rc = %d\n", __func__, mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + + if (MPI_SUCCESS != (mpi_code = MPI_Comm_set_errhandler(sf_context->sf_eof_comm, MPI_ERRORS_RETURN))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't set MPI error handler on IOC EOF sub-communicator; rc = %d\n", __func__, + mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + + if (MPI_SUCCESS != (mpi_code = MPI_Comm_dup(file_comm, &sf_context->sf_barrier_comm))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't create sub-communicator for barriers; rc = %d\n", __func__, mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + + if (MPI_SUCCESS != (mpi_code = MPI_Comm_set_errhandler(sf_context->sf_barrier_comm, MPI_ERRORS_RETURN))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't set MPI error handler on barrier sub-communicator; rc = %d\n", __func__, + mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + + /* Create an MPI sub-communicator for IOC ranks */ + if (app_topology->n_io_concentrators > 1) { + if (MPI_SUCCESS != (mpi_code = MPI_Comm_split(file_comm, app_topology->rank_is_ioc, comm_rank, + &sf_context->sf_group_comm))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't create sub-communicator for IOC ranks; rc = %d\n", __func__, mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + + if (MPI_SUCCESS != + (mpi_code = MPI_Comm_rank(sf_context->sf_group_comm, &sf_context->sf_group_rank))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't get MPI rank from IOC rank sub-communicator; rc = %d\n", __func__, + mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + + if (MPI_SUCCESS != + (mpi_code = MPI_Comm_size(sf_context->sf_group_comm, &sf_context->sf_group_size))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't get MPI comm size from IOC rank sub-communicator; rc = %d\n", __func__, + mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + } + +done: + if (ret_value < 0) { + H5_free_subfiling_object_int(sf_context); + } + + return ret_value; +} + +/*------------------------------------------------------------------------- + * Function: open_subfile_with_context + * + * Purpose: While we cannot know a priori, whether an HDF client will + * need to access data across the entirety of a file, e.g. + * an individual MPI rank may read or write only small + * segments of the entire file space; this function sends + * a file OPEN_OP to every IO concentrator. + * + * Prior to opening any subfiles, the H5FDopen will have + * created an HDF5 file with the user specified naming. + * A path prefix will be selected and is available as + * an input argument. + * + * The opened HDF5 file handle will contain device and + * inode values, these being constant for all processes + * opening the shared file. The inode value is utilized + * as a key value and is associated with the sf_context + * which we receive as one of the input arguments. + * + * IO Concentrator threads will be initialized on MPI ranks + * which have been identified via application toplogy + * discovery. The number and mapping of IOC to MPI_rank + * is part of the sf_context->topology structure. + * + * Return: Success (0) or Faiure (non-zero) + * Errors: If MPI operations fail for some reason. + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + *------------------------------------------------------------------------- + */ +static herr_t +open_subfile_with_context(subfiling_context_t *sf_context, int file_acc_flags) +{ + double start_time; + herr_t ret_value = SUCCEED; + + HDassert(sf_context); + + start_time = MPI_Wtime(); + + /* + * Save the HDF5 file ID (fid) to subfile context mapping. + * There shouldn't be any issue, but check the status and + * return if there was a problem. + */ + if (record_fid_to_subfile(sf_context->h5_file_id, sf_context->sf_context_id, NULL) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't record HDF5 file ID to subfile context mapping\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* + * If this rank is an I/O concentrator, actually open + * the subfile belonging to this IOC rank + */ + if (sf_context->topology->rank_is_ioc) { + sf_work_request_t msg = {{file_acc_flags, (int64_t)sf_context->h5_file_id, sf_context->sf_context_id}, + OPEN_OP, + sf_context->topology->app_layout->world_rank, + sf_context->topology->subfile_rank, + sf_context->sf_context_id, + start_time, + NULL, + 0, + 0, + 0, + 0}; + + if (ioc_open_file(&msg, file_acc_flags) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("[%s %d]: couldn't open subfile\n", __func__, + sf_context->topology->app_layout->world_rank); +#endif + + ret_value = FAIL; + goto done; + } + } + +done: + if (ret_value < 0) { + clear_fid_map_entry(sf_context->h5_file_id, sf_context->sf_context_id); + } + + return ret_value; +} + +/*------------------------------------------------------------------------- + * Function: record_fid_to_subfile + * + * Purpose: Every opened HDF5 file will have (if utilizing subfiling) + * a subfiling context associated with it. It is important that + * the HDF5 file index is a constant rather than utilizing a + * posix file handle since files can be opened multiple times + * and with each file open, a new file handle will be assigned. + * Note that in such a case, the actual filesystem id will be + * retained. + * + * We utilize that filesystem id (ino_t inode) so that + * irrespective of what process opens a common file, the + * subfiling system will generate a consistent context for this + * file across all parallel ranks. + * + * This function simply records the filesystem handle to + * subfiling context mapping. + * + * Return: SUCCEED or FAIL. + * Errors: FAILs ONLY if storage for the mapping entry cannot + * be allocated. + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + * + *------------------------------------------------------------------------- + */ +static herr_t +record_fid_to_subfile(uint64_t h5_file_id, int64_t subfile_context_id, int *next_index) +{ + int index; + herr_t ret_value = SUCCEED; + + if (sf_file_map_size == 0) { + if (NULL == + (sf_open_file_map = HDmalloc((size_t)DEFAULT_FILE_MAP_ENTRIES * sizeof(*sf_open_file_map)))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate open file map\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + sf_file_map_size = DEFAULT_FILE_MAP_ENTRIES; + for (int i = 0; i < sf_file_map_size; i++) { + sf_open_file_map[i].h5_file_id = UINT64_MAX; + sf_open_file_map[i].sf_context_id = -1; + } + } + + for (index = 0; index < sf_file_map_size; index++) { + if (sf_open_file_map[index].h5_file_id == h5_file_id) + goto done; + + if (sf_open_file_map[index].h5_file_id == UINT64_MAX) { + sf_open_file_map[index].h5_file_id = h5_file_id; + sf_open_file_map[index].sf_context_id = subfile_context_id; + + if (next_index) { + *next_index = index; + } + + goto done; + } + } + + if (index == sf_file_map_size) { + void *tmp_realloc; + + if (NULL == (tmp_realloc = HDrealloc(sf_open_file_map, + ((size_t)(sf_file_map_size * 2) * sizeof(*sf_open_file_map))))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't reallocate open file map\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + sf_open_file_map = tmp_realloc; + sf_file_map_size *= 2; + + for (int i = index; i < sf_file_map_size; i++) { + sf_open_file_map[i].h5_file_id = UINT64_MAX; + } + + if (next_index) { + *next_index = index; + } + + sf_open_file_map[index].h5_file_id = h5_file_id; + sf_open_file_map[index++].sf_context_id = subfile_context_id; + } + +done: + return ret_value; +} + +/*------------------------------------------------------------------------- + * Function: ioc_open_file + * + * Purpose: This function is called by an I/O concentrator in order to + * open the subfile it is responsible for. + * + * The name of the subfile to be opened is generated based on + * values from either: + * + * - The corresponding subfiling configuration file, if one + * exists and the HDF5 file isn't being truncated + * - The current subfiling context object for the file, if a + * subfiling configuration file doesn't exist or the HDF5 + * file is being truncated + * + * After the subfile has been opened, a subfiling + * configuration file will be created if this is a file + * creation operation. If the truncate flag is specified, the + * subfiling configuration file will be re-created in order to + * account for any possible changes in the subfiling + * configuration. + * + * Note that the HDF5 file opening protocol may attempt to + * open a file twice. A first open attempt is made without any + * truncate or other flags which would modify the file state + * if it already exists. Then, if this tentative open wasn't + * sufficient, the file is closed and a second file open using + * the user supplied open flags is invoked. + * + * Return: Non-negative on success/Negative on failure + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + * + *------------------------------------------------------------------------- + */ +static herr_t +ioc_open_file(sf_work_request_t *msg, int file_acc_flags) +{ + subfiling_context_t *sf_context = NULL; + int64_t file_context_id; + hbool_t mutex_locked = FALSE; + mode_t mode = S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH; + char * filepath = NULL; + char * subfile_dir = NULL; + char * base = NULL; + int fd = -1; + herr_t ret_value = SUCCEED; + + HDassert(msg); + + /* Retrieve subfiling context ID from RPC message */ + file_context_id = msg->header[2]; + + if (NULL == (sf_context = H5_get_subfiling_object(file_context_id))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't get subfiling object from context ID\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* Only IOC ranks should be here */ + HDassert(sf_context->topology); + HDassert(sf_context->topology->subfile_rank >= 0); + + if (NULL == (filepath = HDcalloc(1, PATH_MAX))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate space for subfile filename\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* Generate the name of the subfile that this IOC rank will open */ + if (generate_subfile_name(sf_context, file_acc_flags, filepath, PATH_MAX, &base, &subfile_dir) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't generate name for subfile\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + if (NULL == (sf_context->sf_filename = HDstrdup(filepath))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't copy subfile name\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + begin_thread_exclusive(); + mutex_locked = TRUE; + + /* Attempt to create/open the subfile for this IOC rank */ + if ((fd = HDopen(filepath, file_acc_flags, mode)) < 0) + H5_SUBFILING_SYS_GOTO_ERROR(H5E_FILE, H5E_CANTOPENFILE, FAIL, "failed to open subfile"); + + sf_context->sf_fid = fd; + if (file_acc_flags & O_CREAT) + sf_context->sf_eof = 0; + + /* + * If subfiles were created (rather than simply opened), + * check if we also need to create a config file. + */ + if ((file_acc_flags & O_CREAT) && (sf_context->topology->subfile_rank == 0)) { + if (create_config_file(sf_context, base, subfile_dir, (file_acc_flags & O_TRUNC)) < 0) + H5_SUBFILING_GOTO_ERROR(H5E_FILE, H5E_CANTCREATE, FAIL, + "couldn't create subfiling configuration file"); + } + +done: + if (mutex_locked) { + end_thread_exclusive(); + mutex_locked = FALSE; + } + + if (ret_value < 0) { + if (sf_context) { + HDfree(sf_context->sf_filename); + sf_context->sf_filename = NULL; + + if (sf_context->sf_fid >= 0) { + HDclose(sf_context->sf_fid); + sf_context->sf_fid = -1; + } + } + } + + HDfree(base); + HDfree(subfile_dir); + HDfree(filepath); + + return ret_value; +} + +/* + * Generate the name of the subfile this IOC rank will open, + * based on available information. + * + * This may include: + * - the subfiling configuration (from a subfiling configuration + * file if one exists, or from the subfiling context object + * otherwise) + * - the base file's name and ID (inode or similar) + * - the IOC's rank value within the set of I/O concentrators + * - an optional filename prefix specified by the user + */ +static herr_t +generate_subfile_name(subfiling_context_t *sf_context, int file_acc_flags, char *filename_out, + size_t filename_out_len, char **filename_basename_out, char **subfile_dir_out) +{ + FILE * config_file = NULL; + char * config_buf = NULL; + char * subfile_dir = NULL; + char * prefix = NULL; + char * base = NULL; + int n_io_concentrators; + int num_digits; + herr_t ret_value = SUCCEED; + + HDassert(sf_context); + HDassert(sf_context->h5_filename); + HDassert(filename_out); + HDassert(filename_basename_out); + HDassert(subfile_dir_out); + + *filename_basename_out = NULL; + *subfile_dir_out = NULL; + + /* + * Initially use the number of I/O concentrators specified in the + * subfiling context. However, if there's an existing subfiling + * configuration file (and we aren't truncating it) we will use + * the number specified there instead, as that should be the actual + * number that the subfile names were originally generated with. + * The current subfiling context may have a different number of I/O + * concentrators specified; e.g. a simple serial file open for + * reading purposes (think h5dump) might only be using 1 I/O + * concentrator, whereas the file was created with several I/O + * concentrators. + */ + n_io_concentrators = sf_context->topology->n_io_concentrators; + + if (NULL == (prefix = HDmalloc(PATH_MAX))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate space for subfile prefix\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* Under normal operation, we co-locate subfiles with the HDF5 file */ + HDstrncpy(prefix, sf_context->h5_filename, PATH_MAX); + + base = basename(prefix); + + if (NULL == (*filename_basename_out = HDstrdup(base))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate space for subfile basename\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + if (sf_context->subfile_prefix) { + /* Note: Users may specify a directory name which is inaccessible + * from where the current is running. In particular, "node-local" + * storage is not uniformly available to all processes. + * We would like to check if the user pathname unavailable and + * if so, we could default to creating the subfiles in the + * current directory. (?) + */ + if (NULL == (*subfile_dir_out = HDstrdup(sf_context->subfile_prefix))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't copy subfile prefix\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + subfile_dir = *subfile_dir_out; + } + else { + subfile_dir = dirname(prefix); + + if (NULL == (*subfile_dir_out = HDstrdup(subfile_dir))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't copy subfile prefix\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + } + + /* + * Open the file's subfiling configuration file, if it exists and + * we aren't truncating the file. + */ + if (0 == (file_acc_flags & O_TRUNC)) { + if (open_config_file(sf_context, base, subfile_dir, "r", &config_file) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't open existing subfiling configuration file\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + } + + /* + * If a subfiling configuration file exists and we aren't truncating + * it, read the number of I/O concentrators used at file creation time + * in order to generate the correct subfile names. + */ + if (config_file) { + char *ioc_substr = NULL; + long config_file_len = 0; + + if (HDfseek(config_file, 0, SEEK_END) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't seek to end of subfiling configuration file; errno = %d\n", __func__, + errno); +#endif + + ret_value = FAIL; + goto done; + } + + if ((config_file_len = HDftell(config_file)) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't get size of subfiling configuration file; errno = %d\n", __func__, errno); +#endif + + ret_value = FAIL; + goto done; + } + + if (HDfseek(config_file, 0, SEEK_SET) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't seek to end of subfiling configuration file; errno = %d\n", __func__, + errno); +#endif + + ret_value = FAIL; + goto done; + } + + if (NULL == (config_buf = HDmalloc((size_t)config_file_len + 1))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate space for reading subfiling configuration file\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + if (HDfread(config_buf, (size_t)config_file_len, 1, config_file) != 1) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't read from subfiling configuration file; errno = %d\n", __func__, errno); +#endif + + ret_value = FAIL; + goto done; + } + + config_buf[config_file_len] = '\0'; + + if (NULL == (ioc_substr = HDstrstr(config_buf, "aggregator_count"))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: malformed subfiling configuration file - no aggregator_count entry\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + if (EOF == HDsscanf(ioc_substr, "aggregator_count=%d", &n_io_concentrators)) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't get number of I/O concentrators from subfiling configuration file\n", + __func__); +#endif + + ret_value = FAIL; + goto done; + } + + if (n_io_concentrators <= 0) { + HDprintf("%s: invalid number of I/O concentrators (%d) read from subfiling configuration file\n", + __func__, n_io_concentrators); + ret_value = FAIL; + goto done; + } + } + + /* + * Generate the name of the subfile. The subfile naming should + * produce files of the following form: + * If we assume the HDF5 file is named ABC.h5, and 20 I/O + * concentrators are used, then the subfiles will have names: + * ABC.h5.subfile_<file-number>_01_of_20, + * ABC.h5.subfile_<file-number>_02_of_20, etc. + * + * and the configuration file will be named: + * ABC.h5.subfile_<file-number>.config + */ + num_digits = numDigits(n_io_concentrators); + HDsnprintf(filename_out, filename_out_len, "%s/%s" SF_FILENAME_TEMPLATE, subfile_dir, base, + sf_context->h5_file_id, num_digits, sf_context->topology->subfile_rank + 1, + n_io_concentrators); + +done: + if (config_file && (EOF == HDfclose(config_file))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: fclose failed to close subfiling configuration file\n", __func__); +#endif + + ret_value = FAIL; + } + + if (ret_value < 0) { + if (*filename_basename_out) { + HDfree(*filename_basename_out); + *filename_basename_out = NULL; + } + if (*subfile_dir_out) { + HDfree(*subfile_dir_out); + *subfile_dir_out = NULL; + } + } + + HDfree(config_buf); + HDfree(prefix); + + return ret_value; +} + +/*------------------------------------------------------------------------- + * Function: create_config_file + * + * Purpose: Creates a configuration file that contains + * subfiling-related information for a file. This file + * includes information such as: + * + * - the stripe size for the file's subfiles + * - the number of I/O concentrators used for I/O to the file's subfiles + * - the base HDF5 filename + * - the optional directory prefix where the file's subfiles are placed + * - the names of each of the file's subfiles + * + * Return: Non-negative on success/Negative on failure + *------------------------------------------------------------------------- + */ +static herr_t +create_config_file(subfiling_context_t *sf_context, const char *base_filename, const char *subfile_dir, + hbool_t truncate_if_exists) +{ + hbool_t config_file_exists = FALSE; + FILE * config_file = NULL; + char * config_filename = NULL; + char * line_buf = NULL; + int ret = 0; + herr_t ret_value = SUCCEED; + + HDassert(sf_context); + HDassert(base_filename); + HDassert(subfile_dir); + + if (sf_context->h5_file_id == UINT64_MAX) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: invalid HDF5 file ID %" PRIu64 "\n", __func__, sf_context->h5_file_id); +#endif + + ret_value = FAIL; + goto done; + } + if (*base_filename == '\0') { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: invalid base HDF5 filename %s\n", __func__, base_filename); +#endif + + ret_value = FAIL; + goto done; + } + if (*subfile_dir == '\0') + subfile_dir = "."; + + if (NULL == (config_filename = HDmalloc(PATH_MAX))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate space for subfiling configuration file filename\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + HDsnprintf(config_filename, PATH_MAX, "%s/%s" SF_CONFIG_FILENAME_TEMPLATE, subfile_dir, base_filename, + sf_context->h5_file_id); + + /* Determine whether a subfiling configuration file exists */ + errno = 0; + ret = HDaccess(config_filename, F_OK); + + config_file_exists = (ret == 0) || ((ret < 0) && (ENOENT != errno)); + + if (config_file_exists && (ret != 0)) { +#ifdef H5_SUBFILING_DEBUG + HDperror("couldn't check existence of configuration file"); +#endif + + ret_value = FAIL; + goto done; + } + + /* + * If a config file doesn't exist, create one. If a + * config file does exist, don't touch it unless the + * O_TRUNC flag was specified. In this case, truncate + * the existing config file and create a new one. + */ + /* TODO: if truncating, consider removing old stale config files. */ + if (!config_file_exists || truncate_if_exists) { + int n_io_concentrators = sf_context->topology->n_io_concentrators; + int num_digits; + + if (NULL == (config_file = HDfopen(config_filename, "w+"))) { +#ifdef H5_SUBFILING_DEBUG + HDperror("couldn't open subfiling configuration file"); +#endif + + ret_value = FAIL; + goto done; + } + + if (NULL == (line_buf = HDmalloc(PATH_MAX))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate buffer for writing to subfiling configuration file\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* Write the subfiling stripe size to the configuration file */ + HDsnprintf(line_buf, PATH_MAX, "stripe_size=%" PRId64 "\n", sf_context->sf_stripe_size); + if (HDfwrite(line_buf, HDstrlen(line_buf), 1, config_file) != 1) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: fwrite failed to write to subfiling configuration file\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* Write the number of I/O concentrators to the configuration file */ + HDsnprintf(line_buf, PATH_MAX, "aggregator_count=%d\n", n_io_concentrators); + if (HDfwrite(line_buf, HDstrlen(line_buf), 1, config_file) != 1) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: fwrite failed to write to subfiling configuration file\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* Write the base HDF5 filename to the configuration file */ + HDsnprintf(line_buf, PATH_MAX, "hdf5_file=%s\n", sf_context->h5_filename); + if (HDfwrite(line_buf, HDstrlen(line_buf), 1, config_file) != 1) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: fwrite failed to write to subfiling configuration file\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* Write the optional subfile directory prefix to the configuration file */ + HDsnprintf(line_buf, PATH_MAX, "subfile_dir=%s\n", subfile_dir); + if (HDfwrite(line_buf, HDstrlen(line_buf), 1, config_file) != 1) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: fwrite failed to write to subfiling configuration file\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* Write out each subfile name to the configuration file */ + num_digits = numDigits(n_io_concentrators); + for (int k = 0; k < n_io_concentrators; k++) { + HDsnprintf(line_buf, PATH_MAX, "%s" SF_FILENAME_TEMPLATE "\n", base_filename, + sf_context->h5_file_id, num_digits, k + 1, n_io_concentrators); + + if (HDfwrite(line_buf, HDstrlen(line_buf), 1, config_file) != 1) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: fwrite failed to write to subfiling configuration file\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + } + } + +done: + if (config_file) { + if (EOF == HDfclose(config_file)) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: fclose failed to close subfiling configuration file\n", __func__); +#endif + + ret_value = FAIL; + } + } + + HDfree(line_buf); + HDfree(config_filename); + + return ret_value; +} + +/*------------------------------------------------------------------------- + * Function: open_config_file + * + * Purpose: Opens the subfiling configuration file for a given HDF5 + * file and sets `config_file_out`, if a configuration file + * exists. Otherwise, `config_file_out` is set to NULL. + * + * It is the caller's responsibility to check + * `config_file_out` on success and close an opened file as + * necessary. + * + * Return: Non-negative on success/Negative on failure + *------------------------------------------------------------------------- + */ +static herr_t +open_config_file(subfiling_context_t *sf_context, const char *base_filename, const char *subfile_dir, + const char *mode, FILE **config_file_out) +{ + hbool_t config_file_exists = FALSE; + FILE * config_file = NULL; + char * config_filename = NULL; + int ret = 0; + herr_t ret_value = SUCCEED; + + HDassert(sf_context); + HDassert(base_filename); + HDassert(subfile_dir); + HDassert(mode); + HDassert(config_file_out); + + *config_file_out = NULL; + + if (sf_context->h5_file_id == UINT64_MAX) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: invalid HDF5 file ID %" PRIu64 "\n", __func__, sf_context->h5_file_id); +#endif + + ret_value = FAIL; + goto done; + } + if (*base_filename == '\0') { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: invalid base HDF5 filename %s\n", __func__, base_filename); +#endif + + ret_value = FAIL; + goto done; + } + if (*subfile_dir == '\0') + subfile_dir = "."; + + if (NULL == (config_filename = HDmalloc(PATH_MAX))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't allocate space for subfiling configuration file filename\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + HDsnprintf(config_filename, PATH_MAX, "%s/%s" SF_CONFIG_FILENAME_TEMPLATE, subfile_dir, base_filename, + sf_context->h5_file_id); + + /* Determine whether a subfiling configuration file exists */ + errno = 0; + ret = HDaccess(config_filename, F_OK); + + config_file_exists = (ret == 0) || ((ret < 0) && (ENOENT != errno)); + + if (!config_file_exists) + goto done; + + if (config_file_exists && (ret != 0)) { +#ifdef H5_SUBFILING_DEBUG + HDperror("couldn't check existence of configuration file"); +#endif + + ret_value = FAIL; + goto done; + } + + if (NULL == (config_file = HDfopen(config_filename, mode))) { +#ifdef H5_SUBFILING_DEBUG + HDperror("couldn't open subfiling configuration file"); +#endif + + ret_value = FAIL; + goto done; + } + + *config_file_out = config_file; + +done: + if (ret_value < 0) { + if (config_file && (EOF == HDfclose(config_file))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: fclose failed to close subfiling configuration file\n", __func__); +#endif + + ret_value = FAIL; + } + } + + HDfree(config_filename); + + return ret_value; +} + +/*------------------------------------------------------------------------- + * Function: H5_close_subfiles + * + * Purpose: This is a simple wrapper function for the internal version + * which actually manages all subfile closing via commands + * to the set of IO Concentrators. + * + * Return: Success (0) or Faiure (non-zero) + * Errors: If MPI operations fail for some reason. + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + *------------------------------------------------------------------------- + */ +/*------------------------------------------------------------------------- + * Function: Internal close__subfiles + * + * Purpose: When closing and HDF5 file, we need to close any associated + * subfiles as well. This function cycles through all known + * IO Concentrators to send a file CLOSE_OP command. + * + * This function is collective across all MPI ranks which + * have opened HDF5 file which associated with the provided + * sf_context. Once the request has been issued by all + * ranks, the subfile at each IOC will be closed and an + * completion ACK will be received. + * + * Once the subfiles are closed, we initiate a teardown of + * the IOC and associated thread_pool threads. + * + * Return: Success (0) or Faiure (non-zero) + * Errors: If MPI operations fail for some reason. + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + *------------------------------------------------------------------------- + */ +herr_t +H5_close_subfiles(int64_t subfiling_context_id) +{ + subfiling_context_t *sf_context = NULL; + MPI_Request barrier_req = MPI_REQUEST_NULL; +#ifdef H5_SUBFILING_DEBUG + double t0 = 0.0; + double t1 = 0.0; + double t2 = 0.0; +#endif + int mpi_code; + herr_t ret_value = SUCCEED; + +#ifdef H5_SUBFILING_DEBUG + t0 = MPI_Wtime(); +#endif + + if (NULL == (sf_context = H5_get_subfiling_object(subfiling_context_id))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't get subfiling object from context ID\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + /* We make the subfile close operation collective. + * Otherwise, there may be a race condition between + * our closing the subfiles and the user application + * moving ahead and possibly re-opening a file. + * + * If we can, we utilize an async barrier which gives + * us the opportunity to reduce the CPU load due to + * MPI spinning while waiting for the barrier to + * complete. This is especially important if there + * is heavy thread utilization due to subfiling + * activities, i.e. the thread pool might be + * extremely busy servicing I/O requests from all + * HDF5 application ranks. + */ +#if MPI_VERSION > 3 || (MPI_VERSION == 3 && MPI_SUBVERSION >= 1) + { + int barrier_complete = 0; + + if (MPI_SUCCESS != (mpi_code = MPI_Ibarrier(sf_context->sf_barrier_comm, &barrier_req))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: MPI_Ibarrier failed with rc %d\n", __func__, mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + + while (!barrier_complete) { + useconds_t t_delay = 5; + usleep(t_delay); + + if (MPI_SUCCESS != (mpi_code = MPI_Test(&barrier_req, &barrier_complete, MPI_STATUS_IGNORE))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: MPI_Test failed with rc %d\n", __func__, mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + } + } +#else + if (MPI_SUCCESS != (mpi_code = MPI_Barrier(sf_context->sf_barrier_comm))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: MPI_Barrier failed with rc %d\n", __func__, mpi_code); +#endif + + ret_value = FAIL; + goto done; + } +#endif + + /* The map from FID to subfiling context can now be cleared */ + if (sf_context->h5_file_id != UINT64_MAX) { + clear_fid_map_entry(sf_context->h5_file_id, sf_context->sf_context_id); + } + + if (sf_context->topology->rank_is_ioc) { + if (sf_context->sf_fid >= 0) { + errno = 0; + if (HDclose(sf_context->sf_fid) < 0) { + HDperror("H5_close_subfiles - couldn't close subfile"); + +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't close subfile\n", __func__); +#endif + + ret_value = FAIL; + goto done; + } + + sf_context->sf_fid = -1; + } + +#ifdef H5_SUBFILING_DEBUG + /* FIXME: If we've had multiple files open, our statistics + * will be messed up! + */ + if (sf_verbose_flag) { + t1 = t2; + if (sf_logfile != NULL) { + if (SF_WRITE_OPS > 0) + HDfprintf( + sf_logfile, + "[%d] pwrite perf: wrt_ops=%ld wait=%lf pwrite=%lf IOC_shutdown = %lf seconds\n", + sf_context->sf_group_rank, SF_WRITE_OPS, SF_WRITE_WAIT_TIME, SF_WRITE_TIME, + (t1 - t0)); + if (SF_READ_OPS > 0) + HDfprintf(sf_logfile, + "[%d] pread perf: read_ops=%ld wait=%lf pread=%lf IOC_shutdown = %lf seconds\n", + sf_context->sf_group_rank, SF_READ_OPS, SF_READ_WAIT_TIME, SF_READ_TIME, + (t1 - t0)); + + HDfprintf(sf_logfile, "[%d] Avg queue time=%lf seconds\n", sf_context->sf_group_rank, + SF_QUEUE_DELAYS / (double)(SF_WRITE_OPS + SF_READ_OPS)); + + HDfflush(sf_logfile); + + HDfclose(sf_logfile); + sf_logfile = NULL; + } + } +#endif + } + + /* + * Run another barrier to prevent some ranks from running ahead, + * and opening another file before this file is completely closed + * down. + */ +#if MPI_VERSION > 3 || (MPI_VERSION == 3 && MPI_SUBVERSION >= 1) + { + int barrier_complete = 0; + + if (MPI_SUCCESS != (mpi_code = MPI_Ibarrier(sf_context->sf_barrier_comm, &barrier_req))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: MPI_Ibarrier failed with rc %d\n", __func__, mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + + while (!barrier_complete) { + useconds_t t_delay = 5; + usleep(t_delay); + + if (MPI_SUCCESS != (mpi_code = MPI_Test(&barrier_req, &barrier_complete, MPI_STATUS_IGNORE))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: MPI_Test failed with rc %d\n", __func__, mpi_code); +#endif + + ret_value = FAIL; + goto done; + } + } + } +#else + if (MPI_SUCCESS != (mpi_code = MPI_Barrier(sf_context->sf_barrier_comm))) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: MPI_Barrier failed with rc %d\n", __func__, mpi_code); +#endif + + ret_value = FAIL; + goto done; + } +#endif + +done: + if (sf_context && H5_free_subfiling_object_int(sf_context) < 0) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: couldn't free subfiling context object\n", __func__); +#endif + + ret_value = FAIL; + } + + return ret_value; +} + +/*------------------------------------------------------------------------- + * Function: H5_subfile_fid_to_context + * + * Purpose: This is a basic lookup function which returns the subfiling + * context id associated with the specified file->inode. + * + * Return: Non-negative subfiling context ID if the context exists + * Negative on failure or if the subfiling context doesn't + * exist + * + * Programmer: Richard Warren + * 7/17/2020 + * + * Changes: Initial Version/None. + * + *------------------------------------------------------------------------- + */ +int64_t +H5_subfile_fid_to_context(uint64_t sf_fid) +{ + if (!sf_open_file_map) { +#ifdef H5_SUBFILING_DEBUG + HDprintf("%s: open file map is invalid\n", __func__); +#endif + + return -1; + } + + for (int i = 0; i < sf_file_map_size; i++) { + if (sf_open_file_map[i].h5_file_id == sf_fid) { + return sf_open_file_map[i].sf_context_id; + } + } + + return -1; +} /* end H5_subfile_fid_to_context() */ + +#ifdef H5_SUBFILING_DEBUG +void +H5_subfiling_log(int64_t sf_context_id, const char *fmt, ...) +{ + subfiling_context_t *sf_context = NULL; + va_list log_args; + + va_start(log_args, fmt); + + /* Retrieve the subfiling object for the newly-created context ID */ + if (NULL == (sf_context = H5_get_subfiling_object(sf_context_id))) { + HDprintf("%s: couldn't get subfiling object from context ID\n", __func__); + goto done; + } + + begin_thread_exclusive(); + + if (sf_context->sf_logfile) { + HDvfprintf(sf_context->sf_logfile, fmt, log_args); + HDfputs("\n", sf_context->sf_logfile); + HDfflush(sf_context->sf_logfile); + } + else { + HDvprintf(fmt, log_args); + HDputs(""); + HDfflush(stdout); + } + + end_thread_exclusive(); + +done: + va_end(log_args); + + return; +} +#endif diff --git a/src/H5FDsubfiling/H5subfiling_common.h b/src/H5FDsubfiling/H5subfiling_common.h new file mode 100644 index 0000000..cfcbf4a --- /dev/null +++ b/src/H5FDsubfiling/H5subfiling_common.h @@ -0,0 +1,257 @@ +/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * + * Copyright by The HDF Group. * + * All rights reserved. * + * * + * This file is part of HDF5. The full HDF5 copyright notice, including * + * terms governing use, modification, and redistribution, is contained in * + * the COPYING file, which can be found at the root of the source code * + * distribution tree, or in https://support.hdfgroup.org/ftp/HDF5/releases. * + * If you do not have access to either file, you may request a copy from * + * help@hdfgroup.org. * + * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ + +/* + * Header file for shared code between the HDF5 Subfiling VFD and IOC VFD + */ + +#ifndef H5_SUBFILING_COMMON_H +#define H5_SUBFILING_COMMON_H + +#include <stdatomic.h> + +#include "H5private.h" +#include "H5Iprivate.h" + +/* TODO: needed for ioc_selection_t, which also needs to be public */ +#include "H5FDioc.h" + +/* + * Some definitions for debugging the Subfiling feature + */ +/* #define H5_SUBFILING_DEBUG */ + +/* + * The following is our basic template for a subfile filename. + * Note that eventually we shouldn't use 0_of_N since we + * intend to use the user defined HDF5 filename for a + * zeroth subfile as well as for all metadata. + */ +#define SF_FILENAME_TEMPLATE ".subfile_%" PRIu64 "_%0*d_of_%d" + +/* + * The following is our basic template for a subfiling + * configuration filename. + */ +#define SF_CONFIG_FILENAME_TEMPLATE ".subfile_%" PRIu64 ".config" + +/* + * Environment variables interpreted by the HDF5 subfiling feature + */ +#define H5_IOC_SELECTION_CRITERIA "H5_IOC_SELECTION_CRITERIA" +#define H5_IOC_COUNT_PER_NODE "H5_IOC_COUNT_PER_NODE" +#define H5_IOC_STRIPE_SIZE "H5_IOC_STRIPE_SIZE" +#define H5_IOC_SUBFILE_PREFIX "H5_IOC_SUBFILE_PREFIX" + +#define H5FD_DEFAULT_STRIPE_DEPTH (32 * 1024 * 1024) + +/* + * MPI Tags are 32 bits, we treat them as unsigned + * to allow the use of the available bits for RPC + * selections, i.e. a message from the VFD read or write functions + * to an IO Concentrator. The messages themselves are in general + * ONLY 3 int64_t values which define a) the data size to be read + * or written, b) the file offset where the data will be read from + * or stored, and c) the context_id allows the IO concentrator to + * locate the IO context for the new IO transaction. + * + * 0000 + * 0001 READ_OP (Independent) + * 0010 WRITE_OP (Independent) + * 0011 ///////// + * 0100 CLOSE_OP (Independent) + * ----- + * 1000 + * 1001 COLLECTIVE_READ + * 1010 COLLECTIVE_WRITE + * 1011 ///////// + * 1100 COLLECTIVE_CLOSE + * + * 31 28 24 20 16 12 8 4 0| + * +-------+-------+-------+-------+-------+-------+-------+-------+ + * | | | ACKS | OP | + * +-------+-------+-------+-------+-------+-------+-------+-------+ + * + */ + +/* Bit 3 SET indicates collectives */ +#define COLL_FUNC (0x1 << 3) + +#define ACK_PART (0x01 << 8) +#define DATA_PART (0x02 << 8) +#define READY (0x04 << 8) +#define COMPLETED (0x08 << 8) + +#define INT32_MASK 0x07FFFFFFFFFFFFFFF + +#define READ_INDEP (READ_OP) +#define READ_COLL (COLL_FUNC | READ_OP) +#define WRITE_INDEP (WRITE_OP) +#define WRITE_COLL (COLL_FUNC | WRITE_OP) + +#define GET_EOF_COMPLETED (COMPLETED | GET_EOF_OP) + +#define SET_LOGGING (LOGGING_OP) + +/* MPI tag values for data communicator */ +#define WRITE_INDEP_ACK 0 +#define READ_INDEP_DATA 1 +#define WRITE_TAG_BASE 2 + +/* + * Object type definitions for subfiling objects. + * Used when generating a new subfiling object ID + * or accessing the cache of stored subfiling + * objects. + */ +typedef enum { + SF_BADID = (-1), + SF_TOPOLOGY = 1, + SF_CONTEXT = 2, + SF_NTYPES /* number of subfiling object types, MUST BE LAST */ +} sf_obj_type_t; + +/* The following are the basic 'op codes' used when + * constructing a RPC message for IO Concentrators. + * These are defined in the low 8 bits of the + * message. + */ +typedef enum io_ops { + READ_OP = 1, + WRITE_OP = 2, + OPEN_OP = 3, + CLOSE_OP = 4, + TRUNC_OP = 5, + GET_EOF_OP = 6, + FINI_OP = 8, + LOGGING_OP = 16 +} io_op_t; + +/* Every application rank will record their MPI rank + * and hostid as a structure. These eventually get + * communicated to MPI rank zero(0) and sorted before + * being broadcast. The resulting sorted vector + * provides a basis for determining which MPI ranks + * will host an IO Concentrator (IOC), e.g. For + * default behavior, we choose the first vector entry + * associated with a "new" hostid. + */ +typedef struct { + long rank; + long hostid; +} layout_t; + +/* This typedef defines a fixed process layout which + * can be reused for any number of file open operations + */ +typedef struct app_layout_t { + long hostid; /* value returned by gethostid() */ + layout_t *layout; /* Vector of {rank,hostid} values */ + int * node_ranks; /* ranks extracted from sorted layout */ + int node_count; /* Total nodes (different hostids) */ + int node_index; /* My node: index into node_ranks */ + int local_peers; /* How may local peers on my node */ + int world_rank; /* My MPI rank */ + int world_size; /* Total number of MPI ranks */ +} app_layout_t; + +/* This typedef defines things related to IOC selections */ +typedef struct topology { + app_layout_t * app_layout; /* Pointer to our layout struct */ + bool rank_is_ioc; /* Indicates that we host an IOC */ + int subfile_rank; /* Valid only if rank_is_ioc */ + int n_io_concentrators; /* Number of IO concentrators */ + int * io_concentrators; /* Vector of ranks which are IOCs */ + int * subfile_fd; /* file descriptor (if IOC) */ + ioc_selection_t selection_type; /* Cache our IOC selection criteria */ +} sf_topology_t; + +typedef struct { + int64_t sf_context_id; /* Generated context ID which embeds the cache index */ + uint64_t h5_file_id; /* GUID (basically the inode value) */ + int sf_fid; /* value returned by open(file,..) */ + size_t sf_write_count; /* Statistics: write_count */ + size_t sf_read_count; /* Statistics: read_count */ + haddr_t sf_eof; /* File eof */ + int64_t sf_stripe_size; /* Stripe-depth */ + int64_t sf_blocksize_per_stripe; /* Stripe-depth X n_IOCs */ + int64_t sf_base_addr; /* For an IOC, our base address */ + MPI_Comm sf_msg_comm; /* MPI comm used to send RPC msg */ + MPI_Comm sf_data_comm; /* MPI comm used to move data */ + MPI_Comm sf_eof_comm; /* MPI comm used to communicate EOF */ + MPI_Comm sf_barrier_comm; /* MPI comm used for barrier operations */ + MPI_Comm sf_group_comm; /* Not used: for IOC collectives */ + MPI_Comm sf_intercomm; /* Not used: for msgs to all IOC */ + int sf_group_size; /* IOC count (in sf_group_comm) */ + int sf_group_rank; /* IOC rank (in sf_group_comm) */ + int sf_intercomm_root; /* Not used: for IOC comms */ + char * subfile_prefix; /* If subfiles are node-local */ + char * sf_filename; /* A generated subfile name */ + char * h5_filename; /* The user supplied file name */ + void * ioc_data; /* Private data for underlying IOC */ + sf_topology_t *topology; /* pointer to our topology */ + +#ifdef H5_SUBFILING_DEBUG + char sf_logfile_name[PATH_MAX]; + FILE *sf_logfile; +#endif + +} subfiling_context_t; + +/* The following is a somewhat augmented input (by the IOC) which captures + * the basic RPC from a 'source'. The fields are filled out to allow + * an easy gathering of statistics by the IO Concentrator. + */ +typedef struct { + /* {Datasize, Offset, FileID} */ + int64_t header[3]; /* The basic RPC input plus */ + int tag; /* the supplied OPCODE tag */ + int source; /* Rank of who sent the message */ + int subfile_rank; /* The IOC rank */ + int64_t context_id; /* context to be used to complete */ + double start_time; /* the request, + time of receipt */ + /* from which we calc Time(queued) */ + void *buffer; /* for writes, we keep the buffer */ + /* around for awhile... */ + volatile int in_progress; /* Not used! */ + volatile int serialize; /* worker thread needs to wait while true */ + volatile int dependents; //* If current work item has dependents */ + int depend_id; /* work queue index of the dependent */ +} sf_work_request_t; + +extern int sf_verbose_flag; + +extern app_layout_t *sf_app_layout; + +#ifdef __cplusplus +extern "C" { +#endif + +H5_DLL herr_t H5_open_subfiles(const char *base_filename, uint64_t h5_file_id, + ioc_selection_t ioc_selection_type, int file_acc_flags, MPI_Comm file_comm, + int64_t *context_id_out); +H5_DLL herr_t H5_close_subfiles(int64_t subfiling_context_id); + +H5_DLL int64_t H5_new_subfiling_object_id(sf_obj_type_t obj_type, int64_t index_val); +H5_DLL void * H5_get_subfiling_object(int64_t object_id); +H5_DLL int64_t H5_subfile_fid_to_context(uint64_t h5_fid); +H5_DLL herr_t H5_free_subfiling_object(int64_t object_id); + +H5_DLL void H5_subfiling_log(int64_t sf_context_id, const char *fmt, ...); + +void set_verbose_flag(int subfile_rank, int new_value); + +#ifdef __cplusplus +} +#endif + +#endif /* H5_SUBFILING_COMMON_H */ diff --git a/src/H5FDsubfiling/H5subfiling_err.h b/src/H5FDsubfiling/H5subfiling_err.h new file mode 100644 index 0000000..a65c425 --- /dev/null +++ b/src/H5FDsubfiling/H5subfiling_err.h @@ -0,0 +1,272 @@ +/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * + * Copyright by The HDF Group. * + * All rights reserved. * + * * + * This file is part of HDF5. The full HDF5 copyright notice, including * + * terms governing use, modification, and redistribution, is contained in * + * the COPYING file, which can be found at the root of the source code * + * distribution tree, or in https://support.hdfgroup.org/ftp/HDF5/releases. * + * If you do not have access to either file, you may request a copy from * + * help@hdfgroup.org. * + * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ + +/* + * Error handling for the HDF5 Subfiling feature + */ + +#ifndef H5SUBFILING_ERR_H +#define H5SUBFILING_ERR_H + +#include <errno.h> + +#include "H5Epublic.h" + +extern hid_t H5subfiling_err_stack_g; +extern hid_t H5subfiling_err_class_g; + +#define H5SUBFILING_ERR_CLS_NAME "HDF5 Subfiling" +#define H5SUBFILING_ERR_LIB_NAME "HDF5 Subfiling" +#define H5SUBFILING_ERR_VER "1.0.0" + +/* Error macros */ + +#ifdef H5_NO_DEPRECATED_SYMBOLS + +/* + * Macro to push the current function to the current error stack + * and then goto the "done" label, which should appear inside the + * function. (v2 errors only) + */ +#define H5_SUBFILING_GOTO_ERROR(err_major, err_minor, ret_val, ...) \ + do { \ + H5E_auto2_t err_func; \ + \ + /* Check whether automatic error reporting has been disabled */ \ + (void)H5Eget_auto2(H5E_DEFAULT, &err_func, NULL); \ + if (err_func) { \ + if (H5subfiling_err_stack_g >= 0 && H5subfiling_err_class_g >= 0) { \ + H5Epush2(H5subfiling_err_stack_g, __FILE__, __func__, __LINE__, H5subfiling_err_class_g, \ + err_major, err_minor, __VA_ARGS__); \ + } \ + else { \ + fprintf(stderr, __VA_ARGS__); \ + fprintf(stderr, "\n"); \ + } \ + } \ + \ + ret_value = ret_val; \ + goto done; \ + } while (0) + +/* + * Macro to push the current function to the current error stack + * without calling goto. This is used for handling the case where + * an error occurs during cleanup past the "done" label inside a + * function so that an infinite loop does not occur where goto + * continually branches back to the label. (v2 errors only) + */ +#define H5_SUBFILING_DONE_ERROR(err_major, err_minor, ret_val, ...) \ + do { \ + H5E_auto2_t err_func; \ + \ + /* Check whether automatic error reporting has been disabled */ \ + (void)H5Eget_auto2(H5E_DEFAULT, &err_func, NULL); \ + if (err_func) { \ + if (H5subfiling_err_stack_g >= 0 && H5subfiling_err_class_g >= 0) \ + H5Epush2(H5subfiling_err_stack_g, __FILE__, __func__, __LINE__, H5subfiling_err_class_g, \ + err_major, err_minor, __VA_ARGS__); \ + else { \ + fprintf(stderr, __VA_ARGS__); \ + fprintf(stderr, "\n"); \ + } \ + } \ + \ + ret_value = ret_val; \ + } while (0) + +/* + * Macro to print out the current error stack and then clear it + * for future use. (v2 errors only) + */ +#define PRINT_ERROR_STACK \ + do { \ + H5E_auto2_t err_func; \ + \ + /* Check whether automatic error reporting has been disabled */ \ + (void)H5Eget_auto2(H5E_DEFAULT, &err_func, NULL); \ + if (err_func) { \ + if ((H5subfiling_err_stack_g >= 0) && (H5Eget_num(H5subfiling_err_stack_g) > 0)) { \ + H5Eprint2(H5subfiling_err_stack_g, NULL); \ + H5Eclear2(H5subfiling_err_stack_g); \ + } \ + } \ + } while (0) + +#else /* H5_NO_DEPRECATED_SYMBOLS */ + +/* + * Macro to push the current function to the current error stack + * and then goto the "done" label, which should appear inside the + * function. (compatible with v1 and v2 errors) + */ +#define H5_SUBFILING_GOTO_ERROR(err_major, err_minor, ret_val, ...) \ + do { \ + unsigned is_v2_err; \ + union { \ + H5E_auto1_t err_func_v1; \ + H5E_auto2_t err_func_v2; \ + } err_func; \ + \ + /* Determine version of error */ \ + (void)H5Eauto_is_v2(H5E_DEFAULT, &is_v2_err); \ + \ + if (is_v2_err) \ + (void)H5Eget_auto2(H5E_DEFAULT, &err_func.err_func_v2, NULL); \ + else \ + (void)H5Eget_auto1(&err_func.err_func_v1, NULL); \ + \ + /* Check whether automatic error reporting has been disabled */ \ + if ((is_v2_err && err_func.err_func_v2) || (!is_v2_err && err_func.err_func_v1)) { \ + if (H5subfiling_err_stack_g >= 0 && H5subfiling_err_class_g >= 0) { \ + H5Epush2(H5subfiling_err_stack_g, __FILE__, __func__, __LINE__, H5subfiling_err_class_g, \ + err_major, err_minor, __VA_ARGS__); \ + } \ + else { \ + fprintf(stderr, __VA_ARGS__); \ + fprintf(stderr, "\n"); \ + } \ + } \ + \ + ret_value = ret_val; \ + goto done; \ + } while (0) + +/* + * Macro to push the current function to the current error stack + * without calling goto. This is used for handling the case where + * an error occurs during cleanup past the "done" label inside a + * function so that an infinite loop does not occur where goto + * continually branches back to the label. (compatible with v1 + * and v2 errors) + */ +#define H5_SUBFILING_DONE_ERROR(err_major, err_minor, ret_val, ...) \ + do { \ + unsigned is_v2_err; \ + union { \ + H5E_auto1_t err_func_v1; \ + H5E_auto2_t err_func_v2; \ + } err_func; \ + \ + /* Determine version of error */ \ + (void)H5Eauto_is_v2(H5E_DEFAULT, &is_v2_err); \ + \ + if (is_v2_err) \ + (void)H5Eget_auto2(H5E_DEFAULT, &err_func.err_func_v2, NULL); \ + else \ + (void)H5Eget_auto1(&err_func.err_func_v1, NULL); \ + \ + /* Check whether automatic error reporting has been disabled */ \ + if ((is_v2_err && err_func.err_func_v2) || (!is_v2_err && err_func.err_func_v1)) { \ + if (H5subfiling_err_stack_g >= 0 && H5subfiling_err_class_g >= 0) { \ + H5Epush2(H5subfiling_err_stack_g, __FILE__, __func__, __LINE__, H5subfiling_err_class_g, \ + err_major, err_minor, __VA_ARGS__); \ + } \ + else { \ + fprintf(stderr, __VA_ARGS__); \ + fprintf(stderr, "\n"); \ + } \ + } \ + \ + ret_value = ret_val; \ + } while (0) + +/* + * Macro to print out the current error stack and then clear it + * for future use. (compatible with v1 and v2 errors) + */ +#define PRINT_ERROR_STACK \ + do { \ + unsigned is_v2_err; \ + union { \ + H5E_auto1_t err_func_v1; \ + H5E_auto2_t err_func_v2; \ + } err_func; \ + \ + /* Determine version of error */ \ + (void)H5Eauto_is_v2(H5E_DEFAULT, &is_v2_err); \ + \ + if (is_v2_err) \ + (void)H5Eget_auto2(H5E_DEFAULT, &err_func.err_func_v2, NULL); \ + else \ + (void)H5Eget_auto1(&err_func.err_func_v1, NULL); \ + \ + /* Check whether automatic error reporting has been disabled */ \ + if ((is_v2_err && err_func.err_func_v2) || (!is_v2_err && err_func.err_func_v1)) { \ + if ((H5subfiling_err_stack_g >= 0) && (H5Eget_num(H5subfiling_err_stack_g) > 0)) { \ + H5Eprint2(H5subfiling_err_stack_g, NULL); \ + H5Eclear2(H5subfiling_err_stack_g); \ + } \ + } \ + } while (0) + +#endif /* H5_NO_DEPRECATED_SYMBOLS */ + +#define H5_SUBFILING_SYS_GOTO_ERROR(err_major, err_minor, ret_val, str) \ + do { \ + int myerrno = errno; \ + H5_SUBFILING_GOTO_ERROR(err_major, err_minor, ret_val, "%s, errno = %d, error message = '%s'", str, \ + myerrno, strerror(myerrno)); \ + } while (0) + +/* MPI error handling macros. */ + +extern char H5subfiling_mpi_error_str[MPI_MAX_ERROR_STRING]; +extern int H5subfiling_mpi_error_str_len; + +#define H5_SUBFILING_MPI_DONE_ERROR(retcode, str, mpierr) \ + do { \ + MPI_Error_string(mpierr, H5subfiling_mpi_error_str, &H5subfiling_mpi_error_str_len); \ + H5_SUBFILING_DONE_ERROR(H5E_INTERNAL, H5E_MPI, retcode, "%s: MPI error string is '%s'", str, \ + H5subfiling_mpi_error_str); \ + } while (0) +#define H5_SUBFILING_MPI_GOTO_ERROR(retcode, str, mpierr) \ + do { \ + MPI_Error_string(mpierr, H5subfiling_mpi_error_str, &H5subfiling_mpi_error_str_len); \ + H5_SUBFILING_GOTO_ERROR(H5E_INTERNAL, H5E_MPI, retcode, "%s: MPI error string is '%s'", str, \ + H5subfiling_mpi_error_str); \ + } while (0) + +/* + * Macro to simply jump to the "done" label inside the function, + * setting ret_value to the given value. This is often used for + * short circuiting in functions when certain conditions arise. + */ +#define H5_SUBFILING_GOTO_DONE(ret_val) \ + do { \ + ret_value = ret_val; \ + goto done; \ + } while (0) + +/* + * Macro to return from a top-level API function, printing + * out the error stack on the way out. + * It should be ensured that this macro is only called once + * per HDF5 operation. If it is called multiple times per + * operation (e.g. due to calling top-level API functions + * internally), the error stack will be inconsistent/incoherent. + */ +#define H5_SUBFILING_FUNC_LEAVE_API \ + do { \ + PRINT_ERROR_STACK; \ + return ret_value; \ + } while (0) + +/* + * Macro to return from internal functions. + */ +#define H5_SUBFILING_FUNC_LEAVE \ + do { \ + return ret_value; \ + } while (0) + +#endif /* H5SUBFILING_ERR_H */ diff --git a/src/H5FDsubfiling/mercury/LICENSE.txt b/src/H5FDsubfiling/mercury/LICENSE.txt new file mode 100644 index 0000000..d3f4203 --- /dev/null +++ b/src/H5FDsubfiling/mercury/LICENSE.txt @@ -0,0 +1,27 @@ +Copyright (c) 2013-2021, UChicago Argonne, LLC and The HDF Group. +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + +1. Redistributions of source code must retain the above copyright notice, this + list of conditions and the following disclaimer. + +2. Redistributions in binary form must reproduce the above copyright notice, + this list of conditions and the following disclaimer in the documentation + and/or other materials provided with the distribution. + +3. Neither the name of the copyright holder nor the names of its + contributors may be used to endorse or promote products derived from + this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. diff --git a/src/H5FDsubfiling/mercury/README.md b/src/H5FDsubfiling/mercury/README.md new file mode 100644 index 0000000..d159548 --- /dev/null +++ b/src/H5FDsubfiling/mercury/README.md @@ -0,0 +1,230 @@ +Mercury +======= +[![Build status][github-ci-svg]][github-ci-link] +[![Latest version][mercury-release-svg]][mercury-release-link] +[![Spack version][spack-release-svg]][spack-release-link] + +Mercury is an RPC framework specifically designed for use in HPC systems +that allows asynchronous transfer of parameters and execution requests, +as well as direct support of large data arguments. The network implementation +is abstracted, allowing easy porting to future systems and efficient use +of existing native transport mechanisms. Mercury's interface is generic +and allows any function call to be serialized. +Mercury is a core component of the [Mochi][mochi-link] ecosystem of +microservices. + +Please see the accompanying LICENSE.txt file for license details. + +Contributions and patches are welcomed but require a Contributor License +Agreement (CLA) to be filled out. Please contact us if you are interested +in contributing to Mercury by subscribing to the +[mailing lists][mailing-lists]. + +Architectures supported +======================= + +Architectures supported by MPI implementations are generally supported by the +network abstraction layer. + +The OFI libfabric plugin as well as the SM plugin +are stable and provide the best performance in most workloads. Libfabric +providers currently supported are: `tcp`, `verbs`, `psm2`, `gni`. + +The UCX plugin is also available as an alternative transport on platforms +for which libfabric is either not available or not recommended to use, +currently supported protocols are tcp and verbs. + +MPI and BMI (tcp) plugins are still supported but gradually being moved as +deprecated, therefore should only be used as fallback methods. +The CCI plugin is deprecated and no longer supported. + +See the [plugin requirements](#plugin-requirements) section for +plugin requirement details. + +Documentation +============= + +Please see the documentation available on the mercury [website][documentation] +for a quick introduction to Mercury. + +Software requirements +===================== + +Compiling and running Mercury requires up-to-date versions of various +software packages. Beware that using excessively old versions of these +packages can cause indirect errors that are very difficult to track down. + +Plugin requirements +------------------- + +To make use of the OFI libfabric plugin, please refer to the libfabric build +instructions available on this [page][libfabric]. + +To make use of the UCX plugin, please refer to the UCX build +instructions available on this [page][ucx]. + +To make use of the native NA SM (shared-memory) plugin on Linux, +the cross-memory attach (CMA) feature introduced in kernel v3.2 is required. +The yama security module must also be configured to allow remote process memory +to be accessed (see this [page][yama]). On MacOS, code signing with inclusion of +the na_sm.plist file into the binary is currently required to allow process +memory to be accessed. + +To make use of the BMI plugin, the most convenient way is to install it through +spack or one can also do: + + git clone https://github.com/radix-io/bmi.git && cd bmi + ./prepare && ./configure --enable-shared --enable-bmi-only + make && make install + +To make use of the MPI plugin, Mercury requires a _well-configured_ MPI +implementation (MPICH2 v1.4.1 or higher / OpenMPI v1.6 or higher) with +`MPI_THREAD_MULTIPLE` available on targets that will accept remote +connections. Processes that are _not_ accepting incoming connections are +_not_ required to have a multithreaded level of execution. + +Optional requirements +--------------------- + +For optional automatic code generation features (which are used for generating +serialization and deserialization routines), the preprocessor subset of the +BOOST library must be included (Boost v1.48 or higher is recommended). +The library itself is therefore not necessary since only the header is used. +Mercury includes those headers if one does not have BOOST installed and +wants to make use of this feature. + +Building +======== + +If you install the full sources, put the tarball in a directory where you +have permissions (e.g., your home directory) and unpack it: + + bzip2 -dc mercury-X.tar.bz2 | tar xvf - + +Replace `'X'` with the version number of the package. + +(Optional) If you checked out the sources using git (without the `--recursive` +option) and want to build the testing suite (which requires the kwsys +submodule) or use checksums (which requires the mchecksum submodule), you need +to issue from the root of the source directory the following command: + + git submodule update --init + +Mercury makes use of the CMake build-system and requires that you do an +out-of-source build. In order to do that, you must create a new build +directory and run the `ccmake` command from it: + + cd mercury-X + mkdir build + cd build + ccmake .. (where ".." is the relative path to the mercury-X directory) + +Type `'c'` multiple times and choose suitable options. Recommended options are: + + BUILD_SHARED_LIBS ON (or OFF if the library you link + against requires static libraries) + BUILD_TESTING ON + Boost_INCLUDE_DIR /path/to/include/directory + CMAKE_INSTALL_PREFIX /path/to/install/directory + MERCURY_ENABLE_DEBUG ON/OFF + MERCURY_ENABLE_PARALLEL_TESTING ON/OFF + MERCURY_USE_BOOST_PP ON + MERCURY_USE_CHECKSUMS ON + MERCURY_USE_SYSTEM_BOOST ON/OFF + MERCURY_USE_SYSTEM_MCHECKSUM ON/OFF + MERCURY_USE_XDR OFF + NA_USE_BMI ON/OFF + NA_USE_MPI ON/OFF + NA_USE_CCI ON/OFF + NA_USE_OFI ON/OFF + NA_USE_SM ON/OFF + NA_USE_UCX ON/OFF + +Setting include directory and library paths may require you to toggle to +the advanced mode by typing `'t'`. Once you are done and do not see any +errors, type `'g'` to generate makefiles. Once you exit the CMake +configuration screen and are ready to build the targets, do: + + make + +(Optional) Verbose compile/build output: + +This is done by inserting `VERBOSE=1` in the `make` command. E.g.: + + make VERBOSE=1 + +Installing +========== + +Assuming that the `CMAKE_INSTALL_PREFIX` has been set (see previous step) +and that you have write permissions to the destination directory, do +from the build directory: + + make install + +Testing +======= + +Tests can be run to check that basic RPC functionality (requests and bulk +data transfers) is properly working. CTest is used to run the tests, +simply run from the build directory: + + ctest . + +(Optional) Verbose testing: + +This is done by inserting `-V` in the `ctest` command. E.g.: + + ctest -V . + +Extra verbose information can be displayed by inserting `-VV`. E.g.: + + ctest -VV . + +Some tests run with one server process and X client processes. To change the +number of client processes that are being used, the `MPIEXEC_MAX_NUMPROCS` +variable needs to be modified (toggle to advanced mode if you do not see +it). The default value is automatically detected by CMake based on the number +of cores that are available. +Note that you need to run `make` again after the makefile generation +to use the new value. + +FAQ +=== + +Below is a list of the most common questions. + +- _Q: Why am I getting undefined references to libfabric symbols?_ + + A: In rare occasions, multiple copies of the libfabric library are installed + on the same system. To make sure that you are using the correct copy of the + libfabric library, do: + + ldconfig -p | grep libfabric + + If the library returned is not the one that you would expect, make sure to + either set `LD_LIBRARY_PATH` or add an entry in your `/etc/ld.so.conf.d` + directory. + +- _Q: Is there any logging mechanism?_ + + A: To turn on error/warning/debug logs, the `HG_LOG_LEVEL` environment + variable can be set to either `error`, `warning` or `debug` values. Note that + for debugging output to be printed, the CMake variable `MERCURY_ENABLE_DEBUG` + must also be set at compile time. Specific subsystems can be selected using + the `HG_LOG_SUBSYS` environment variable. + +[mailing-lists]: http://mercury-hpc.github.io/help#mailing-lists +[documentation]: http://mercury-hpc.github.io/documentation/ +[cci]: http://cci-forum.com/?page_id=46 +[libfabric]: https://github.com/ofiwg/libfabric +[ucx]: https://openucx.readthedocs.io/en/master/running.html#ucx-build-and-install +[github-ci-svg]: https://github.com/mercury-hpc/mercury/actions/workflows/ci.yml/badge.svg?branch=master +[github-ci-link]: https://github.com/mercury-hpc/mercury/actions/workflows/ci.yml +[mercury-release-svg]: https://img.shields.io/github/release/mercury-hpc/mercury/all.svg +[mercury-release-link]: https://github.com/mercury-hpc/mercury/releases +[spack-release-svg]: https://img.shields.io/spack/v/mercury.svg +[spack-release-link]: https://spack.readthedocs.io/en/latest/package_list.html#mercury +[yama]: https://www.kernel.org/doc/Documentation/security/Yama.txt +[mochi-link]: https://github.com/mochi-hpc/ + diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_atomic.h b/src/H5FDsubfiling/mercury/src/util/mercury_atomic.h new file mode 100644 index 0000000..54562ad --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_atomic.h @@ -0,0 +1,584 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#ifndef MERCURY_ATOMIC_H +#define MERCURY_ATOMIC_H + +#include "mercury_util_config.h" + +#if defined(_WIN32) +#define _WINSOCKAPI_ +#include <windows.h> +typedef struct { + volatile LONG value; +} hg_atomic_int32_t; +typedef struct { + volatile LONGLONG value; +} hg_atomic_int64_t; +/* clang-format off */ +# define HG_ATOMIC_VAR_INIT(x) {(x)} +/* clang-format on */ +#elif defined(HG_UTIL_HAS_STDATOMIC_H) +#ifndef __cplusplus +#include <stdatomic.h> +typedef atomic_int hg_atomic_int32_t; +#if (HG_UTIL_ATOMIC_LONG_WIDTH == 8) && !defined(__APPLE__) +typedef atomic_long hg_atomic_int64_t; +#else +typedef atomic_llong hg_atomic_int64_t; +#endif +#else +#include <atomic> +typedef std::atomic_int hg_atomic_int32_t; +#if (HG_UTIL_ATOMIC_LONG_WIDTH == 8) && !defined(__APPLE__) +typedef std::atomic_long hg_atomic_int64_t; +#else +typedef std::atomic_llong hg_atomic_int64_t; +#endif +using std::atomic_fetch_add_explicit; +using std::atomic_thread_fence; +using std::memory_order_acq_rel; +using std::memory_order_acquire; +using std::memory_order_release; +#endif +#define HG_ATOMIC_VAR_INIT(x) ATOMIC_VAR_INIT(x) +#elif defined(__APPLE__) +#include <libkern/OSAtomic.h> +typedef struct { + volatile int32_t value; +} hg_atomic_int32_t; +typedef struct { + volatile int64_t value; +} hg_atomic_int64_t; +/* clang-format off */ +# define HG_ATOMIC_VAR_INIT(x) {(x)} +/* clang-format on */ +#else /* GCC 4.7 */ +#if !defined(__GNUC__) || ((__GNUC__ < 4) && (__GNUC_MINOR__ < 7)) +#error "GCC version >= 4.7 required to support built-in atomics." +#endif +/* builtins do not require volatile */ +typedef int32_t hg_atomic_int32_t; +typedef int64_t hg_atomic_int64_t; +#define HG_ATOMIC_VAR_INIT(x) (x) +#endif + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * Init atomic value (32-bit integer). + * + * \param ptr [OUT] pointer to an atomic32 integer + * \param value [IN] value + */ +static HG_UTIL_INLINE void hg_atomic_init32(hg_atomic_int32_t *ptr, int32_t value); + +/** + * Set atomic value (32-bit integer). + * + * \param ptr [OUT] pointer to an atomic32 integer + * \param value [IN] value + */ +static HG_UTIL_INLINE void hg_atomic_set32(hg_atomic_int32_t *ptr, int32_t value); + +/** + * Get atomic value (32-bit integer). + * + * \param ptr [OUT] pointer to an atomic32 integer + * + * \return Value of the atomic integer + */ +static HG_UTIL_INLINE int32_t hg_atomic_get32(hg_atomic_int32_t *ptr); + +/** + * Increment atomic value (32-bit integer). + * + * \param ptr [IN/OUT] pointer to an atomic32 integer + * + * \return Incremented value + */ +static HG_UTIL_INLINE int32_t hg_atomic_incr32(hg_atomic_int32_t *ptr); + +/** + * Decrement atomic value (32-bit integer). + * + * \param ptr [IN/OUT] pointer to an atomic32 integer + * + * \return Decremented value + */ +static HG_UTIL_INLINE int32_t hg_atomic_decr32(hg_atomic_int32_t *ptr); + +/** + * OR atomic value (32-bit integer). + * + * \param ptr [IN/OUT] pointer to an atomic32 integer + * \param value [IN] value to OR with + * + * \return Original value + */ +static HG_UTIL_INLINE int32_t hg_atomic_or32(hg_atomic_int32_t *ptr, int32_t value); + +/** + * XOR atomic value (32-bit integer). + * + * \param ptr [IN/OUT] pointer to an atomic32 integer + * \param value [IN] value to XOR with + * + * \return Original value + */ +static HG_UTIL_INLINE int32_t hg_atomic_xor32(hg_atomic_int32_t *ptr, int32_t value); + +/** + * AND atomic value (32-bit integer). + * + * \param ptr [IN/OUT] pointer to an atomic32 integer + * \param value [IN] value to AND with + * + * \return Original value + */ +static HG_UTIL_INLINE int32_t hg_atomic_and32(hg_atomic_int32_t *ptr, int32_t value); + +/** + * Compare and swap values (32-bit integer). + * + * \param ptr [IN/OUT] pointer to an atomic32 integer + * \param compare_value [IN] value to compare to + * \param swap_value [IN] value to swap with if ptr value is equal to + * compare value + * + * \return true if swapped or false + */ +static HG_UTIL_INLINE bool hg_atomic_cas32(hg_atomic_int32_t *ptr, int32_t compare_value, int32_t swap_value); + +/** + * Init atomic value (64-bit integer). + * + * \param ptr [OUT] pointer to an atomic32 integer + * \param value [IN] value + */ +static HG_UTIL_INLINE void hg_atomic_init64(hg_atomic_int64_t *ptr, int64_t value); + +/** + * Set atomic value (64-bit integer). + * + * \param ptr [OUT] pointer to an atomic64 integer + * \param value [IN] value + */ +static HG_UTIL_INLINE void hg_atomic_set64(hg_atomic_int64_t *ptr, int64_t value); + +/** + * Get atomic value (64-bit integer). + * + * \param ptr [OUT] pointer to an atomic64 integer + * + * \return Value of the atomic integer + */ +static HG_UTIL_INLINE int64_t hg_atomic_get64(hg_atomic_int64_t *ptr); + +/** + * Increment atomic value (64-bit integer). + * + * \param ptr [IN/OUT] pointer to an atomic64 integer + * + * \return Incremented value + */ +static HG_UTIL_INLINE int64_t hg_atomic_incr64(hg_atomic_int64_t *ptr); + +/** + * Decrement atomic value (64-bit integer). + * + * \param ptr [IN/OUT] pointer to an atomic64 integer + * + * \return Decremented value + */ +static HG_UTIL_INLINE int64_t hg_atomic_decr64(hg_atomic_int64_t *ptr); + +/** + * OR atomic value (64-bit integer). + * + * \param ptr [IN/OUT] pointer to an atomic64 integer + * \param value [IN] value to OR with + * + * \return Original value + */ +static HG_UTIL_INLINE int64_t hg_atomic_or64(hg_atomic_int64_t *ptr, int64_t value); + +/** + * XOR atomic value (64-bit integer). + * + * \param ptr [IN/OUT] pointer to an atomic64 integer + * \param value [IN] value to XOR with + * + * \return Original value + */ +static HG_UTIL_INLINE int64_t hg_atomic_xor64(hg_atomic_int64_t *ptr, int64_t value); + +/** + * AND atomic value (64-bit integer). + * + * \param ptr [IN/OUT] pointer to an atomic64 integer + * \param value [IN] value to AND with + * + * \return Original value + */ +static HG_UTIL_INLINE int64_t hg_atomic_and64(hg_atomic_int64_t *ptr, int64_t value); + +/** + * Compare and swap values (64-bit integer). + * + * \param ptr [IN/OUT] pointer to an atomic64 integer + * \param compare_value [IN] value to compare to + * \param swap_value [IN] value to swap with if ptr value is equal to + * compare value + * + * \return true if swapped or false + */ +static HG_UTIL_INLINE bool hg_atomic_cas64(hg_atomic_int64_t *ptr, int64_t compare_value, int64_t swap_value); + +/** + * Memory barrier. + * + */ +static HG_UTIL_INLINE void hg_atomic_fence(void); + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE void +hg_atomic_init32(hg_atomic_int32_t *ptr, int32_t value) +{ +#if defined(HG_UTIL_HAS_STDATOMIC_H) + atomic_init(ptr, value); +#else + hg_atomic_set32(ptr, value); +#endif +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE void +hg_atomic_set32(hg_atomic_int32_t *ptr, int32_t value) +{ +#if defined(_WIN32) + ptr->value = value; +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + atomic_store_explicit(ptr, value, memory_order_release); +#elif defined(__APPLE__) + ptr->value = value; +#else + __atomic_store_n(ptr, value, __ATOMIC_RELEASE); +#endif +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int32_t +hg_atomic_get32(hg_atomic_int32_t *ptr) +{ + int32_t ret; + +#if defined(_WIN32) + ret = ptr->value; +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + ret = atomic_load_explicit(ptr, memory_order_acquire); +#elif defined(__APPLE__) + ret = ptr->value; +#else + ret = __atomic_load_n(ptr, __ATOMIC_ACQUIRE); +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int32_t +hg_atomic_incr32(hg_atomic_int32_t *ptr) +{ + int32_t ret; + +#if defined(_WIN32) + ret = InterlockedIncrementNoFence(&ptr->value); +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + ret = atomic_fetch_add_explicit(ptr, 1, memory_order_acq_rel) + 1; +#elif defined(__APPLE__) + ret = OSAtomicIncrement32(&ptr->value); +#else + ret = __atomic_fetch_add(ptr, 1, __ATOMIC_ACQ_REL) + 1; +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int32_t +hg_atomic_decr32(hg_atomic_int32_t *ptr) +{ + int32_t ret; + +#if defined(_WIN32) + ret = InterlockedDecrementNoFence(&ptr->value); +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + ret = atomic_fetch_sub_explicit(ptr, 1, memory_order_acq_rel) - 1; +#elif defined(__APPLE__) + ret = OSAtomicDecrement32(&ptr->value); +#else + ret = __atomic_fetch_sub(ptr, 1, __ATOMIC_ACQ_REL) - 1; +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int32_t +hg_atomic_or32(hg_atomic_int32_t *ptr, int32_t value) +{ + int32_t ret; + +#if defined(_WIN32) + ret = InterlockedOrNoFence(&ptr->value, value); +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + ret = atomic_fetch_or_explicit(ptr, value, memory_order_acq_rel); +#elif defined(__APPLE__) + ret = OSAtomicOr32Orig((uint32_t)value, (volatile uint32_t *)&ptr->value); +#else + ret = __atomic_fetch_or(ptr, value, __ATOMIC_ACQ_REL); +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int32_t +hg_atomic_xor32(hg_atomic_int32_t *ptr, int32_t value) +{ + int32_t ret; + +#if defined(_WIN32) + ret = InterlockedXorNoFence(&ptr->value, value); +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + ret = atomic_fetch_xor_explicit(ptr, value, memory_order_acq_rel); +#elif defined(__APPLE__) + ret = OSAtomicXor32Orig((uint32_t)value, (volatile uint32_t *)&ptr->value); +#else + ret = __atomic_fetch_xor(ptr, value, __ATOMIC_ACQ_REL); +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int32_t +hg_atomic_and32(hg_atomic_int32_t *ptr, int32_t value) +{ + int32_t ret; + +#if defined(_WIN32) + ret = InterlockedAndNoFence(&ptr->value, value); +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + ret = atomic_fetch_and_explicit(ptr, value, memory_order_acq_rel); +#elif defined(__APPLE__) + ret = OSAtomicAnd32Orig((uint32_t)value, (volatile uint32_t *)&ptr->value); +#else + ret = __atomic_fetch_and(ptr, value, __ATOMIC_ACQ_REL); +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE bool +hg_atomic_cas32(hg_atomic_int32_t *ptr, int32_t compare_value, int32_t swap_value) +{ + bool ret; + +#if defined(_WIN32) + ret = (compare_value == InterlockedCompareExchangeNoFence(&ptr->value, swap_value, compare_value)); +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + ret = atomic_compare_exchange_strong_explicit(ptr, &compare_value, swap_value, memory_order_acq_rel, + memory_order_acquire); +#elif defined(__APPLE__) + ret = OSAtomicCompareAndSwap32(compare_value, swap_value, &ptr->value); +#else + ret = __atomic_compare_exchange_n(ptr, &compare_value, swap_value, false, __ATOMIC_ACQ_REL, + __ATOMIC_ACQUIRE); +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE void +hg_atomic_init64(hg_atomic_int64_t *ptr, int64_t value) +{ +#if defined(HG_UTIL_HAS_STDATOMIC_H) + atomic_init(ptr, value); +#else + hg_atomic_set64(ptr, value); +#endif +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE void +hg_atomic_set64(hg_atomic_int64_t *ptr, int64_t value) +{ +#if defined(_WIN32) + ptr->value = value; +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + atomic_store_explicit(ptr, value, memory_order_release); +#elif defined(__APPLE__) + ptr->value = value; +#else + __atomic_store_n(ptr, value, __ATOMIC_RELEASE); +#endif +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int64_t +hg_atomic_get64(hg_atomic_int64_t *ptr) +{ + int64_t ret; + +#if defined(_WIN32) + ret = ptr->value; +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + ret = atomic_load_explicit(ptr, memory_order_acquire); +#elif defined(__APPLE__) + ptr->value = value; +#else + ret = __atomic_load_n(ptr, __ATOMIC_ACQUIRE); +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int64_t +hg_atomic_incr64(hg_atomic_int64_t *ptr) +{ + int64_t ret; + +#if defined(_WIN32) + ret = InterlockedIncrementNoFence64(&ptr->value); +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + ret = atomic_fetch_add_explicit(ptr, (int64_t)1, memory_order_acq_rel) + 1; +#elif defined(__APPLE__) + ret = OSAtomicIncrement64(&ptr->value); +#else + ret = __atomic_fetch_add(ptr, (int64_t)1, __ATOMIC_ACQ_REL) + 1; +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int64_t +hg_atomic_decr64(hg_atomic_int64_t *ptr) +{ + int64_t ret; + +#if defined(_WIN32) + ret = InterlockedDecrementNoFence64(&ptr->value); +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + ret = atomic_fetch_sub_explicit(ptr, (int64_t)1, memory_order_acq_rel) - 1; +#elif defined(__APPLE__) + ret = OSAtomicDecrement64(&ptr->value); +#else + ret = __atomic_fetch_sub(ptr, (int64_t)1, __ATOMIC_ACQ_REL) - 1; +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int64_t +hg_atomic_or64(hg_atomic_int64_t *ptr, int64_t value) +{ + int64_t ret; + +#if defined(_WIN32) + ret = InterlockedOr64NoFence(&ptr->value, value); +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + ret = atomic_fetch_or_explicit(ptr, value, memory_order_acq_rel); +#else + ret = __atomic_fetch_or(ptr, value, __ATOMIC_ACQ_REL); +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int64_t +hg_atomic_xor64(hg_atomic_int64_t *ptr, int64_t value) +{ + int64_t ret; + +#if defined(_WIN32) + ret = InterlockedXor64NoFence(&ptr->value, value); +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + ret = atomic_fetch_xor_explicit(ptr, value, memory_order_acq_rel); +#else + ret = __atomic_fetch_xor(ptr, value, __ATOMIC_ACQ_REL); +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int64_t +hg_atomic_and64(hg_atomic_int64_t *ptr, int64_t value) +{ + int64_t ret; + +#if defined(_WIN32) + ret = InterlockedAnd64NoFence(&ptr->value, value); +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + ret = atomic_fetch_and_explicit(ptr, value, memory_order_acq_rel); +#else + ret = __atomic_fetch_and(ptr, value, __ATOMIC_ACQ_REL); +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE bool +hg_atomic_cas64(hg_atomic_int64_t *ptr, int64_t compare_value, int64_t swap_value) +{ + bool ret; + +#if defined(_WIN32) + ret = (compare_value == InterlockedCompareExchangeNoFence64(&ptr->value, swap_value, compare_value)); +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + ret = atomic_compare_exchange_strong_explicit(ptr, &compare_value, swap_value, memory_order_acq_rel, + memory_order_acquire); +#elif defined(__APPLE__) + ret = OSAtomicCompareAndSwap64(compare_value, swap_value, &ptr->value); +#else + ret = __atomic_compare_exchange_n(ptr, &compare_value, swap_value, false, __ATOMIC_ACQ_REL, + __ATOMIC_ACQUIRE); +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE void +hg_atomic_fence(void) +{ +#if defined(_WIN32) + MemoryBarrier(); +#elif defined(HG_UTIL_HAS_STDATOMIC_H) + atomic_thread_fence(memory_order_acq_rel); +#elif defined(__APPLE__) + OSMemoryBarrier(); +#else + __atomic_thread_fence(__ATOMIC_ACQ_REL); +#endif +} + +#ifdef __cplusplus +} +#endif + +#endif /* MERCURY_ATOMIC_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_compiler_attributes.h b/src/H5FDsubfiling/mercury/src/util/mercury_compiler_attributes.h new file mode 100644 index 0000000..2406ba8 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_compiler_attributes.h @@ -0,0 +1,116 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#ifndef MERCURY_COMPILER_ATTRIBUTES_H +#define MERCURY_COMPILER_ATTRIBUTES_H + +/*************************************/ +/* Public Type and Struct Definition */ +/*************************************/ + +/*****************/ +/* Public Macros */ +/*****************/ + +/* + * __has_attribute is supported on gcc >= 5, clang >= 2.9 and icc >= 17. + * In the meantime, to support gcc < 5, we implement __has_attribute + * by hand. + */ +#if !defined(__has_attribute) && defined(__GNUC__) && (__GNUC__ >= 4) +#define __has_attribute(x) __GCC4_has_attribute_##x +#define __GCC4_has_attribute___visibility__ 1 +#define __GCC4_has_attribute___warn_unused_result__ 1 +#define __GCC4_has_attribute___unused__ 1 +#define __GCC4_has_attribute___format__ 1 +#define __GCC4_has_attribute___fallthrough__ 0 +#endif + +/* Visibility of symbols */ +#if defined(_WIN32) +#define HG_ATTR_ABI_IMPORT __declspec(dllimport) +#define HG_ATTR_ABI_EXPORT __declspec(dllexport) +#define HG_ATTR_ABI_HIDDEN +#elif __has_attribute(__visibility__) +#define HG_ATTR_ABI_IMPORT __attribute__((__visibility__("default"))) +#define HG_ATTR_ABI_EXPORT __attribute__((__visibility__("default"))) +#define HG_ATTR_ABI_HIDDEN __attribute__((__visibility__("hidden"))) +#else +#define HG_ATTR_ABI_IMPORT +#define HG_ATTR_ABI_EXPORT +#define HG_ATTR_ABI_HIDDEN +#endif + +/* Unused return values */ +#if defined(_WIN32) +#define HG_ATTR_WARN_UNUSED_RESULT _Check_return_ +#elif __has_attribute(__warn_unused_result__) +#define HG_ATTR_WARN_UNUSED_RESULT __attribute__((__warn_unused_result__)) +#else +#define HG_ATTR_WARN_UNUSED_RESULT +#endif + +/* Remove warnings when plugin does not use callback arguments */ +#if defined(_WIN32) +#define HG_ATTR_UNUSED +#elif __has_attribute(__unused__) +#define HG_ATTR_UNUSED __attribute__((__unused__)) +#else +#define HG_ATTR_UNUSED +#endif + +/* Alignment (not optional) */ +#if defined(_WIN32) +#define HG_ATTR_ALIGNED(x, a) __declspec(align(a)) x +#else +#define HG_ATTR_ALIGNED(x, a) x __attribute__((__aligned__(a))) +#endif + +/* Packed (not optional) */ +#if defined(_WIN32) +#define HG_ATTR_PACKED_PUSH __pragma(pack(push, 1)) +#define HG_ATTR_PACKED_POP __pragma(pack(pop)) +#else +#define HG_ATTR_PACKED_PUSH +#define HG_ATTR_PACKED_POP __attribute__((__packed__)) +#endif +#define HG_ATTR_PACKED(x) HG_ATTR_PACKED_PUSH x HG_ATTR_PACKED_POP + +/* Check format arguments */ +#if defined(_WIN32) +#define HG_ATTR_PRINTF(_fmt, _firstarg) +#elif __has_attribute(__format__) +#define HG_ATTR_PRINTF(_fmt, _firstarg) __attribute__((__format__(printf, _fmt, _firstarg))) +#else +#define HG_ATTR_PRINTF(_fmt, _firstarg) +#endif + +/* Constructor (not optional) */ +#if defined(_WIN32) +#define HG_ATTR_CONSTRUCTOR +#define HG_ATTR_CONSTRUCTOR_PRIORITY(x) +#else +#define HG_ATTR_CONSTRUCTOR __attribute__((__constructor__)) +#define HG_ATTR_CONSTRUCTOR_PRIORITY(x) __attribute__((__constructor__(x))) +#endif + +/* Destructor (not optional) */ +#if defined(_WIN32) +#define HG_ATTR_DESTRUCTOR +#else +#define HG_ATTR_DESTRUCTOR __attribute__((__destructor__)) +#endif + +/* Fallthrough (prevent icc from throwing warnings) */ +#if defined(_WIN32) /* clang-format off */ +# define HG_ATTR_FALLTHROUGH do {} while (0) /* fallthrough */ /* clang-format on */ +#elif __has_attribute(__fallthrough__) && !defined(__INTEL_COMPILER) +#define HG_ATTR_FALLTHROUGH __attribute__((__fallthrough__)) +#else /* clang-format off */ +# define HG_ATTR_FALLTHROUGH do {} while (0) /* fallthrough */ +#endif /* clang-format on */ + +#endif /* MERCURY_COMPILER_ATTRIBUTES_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_dlog.c b/src/H5FDsubfiling/mercury/src/util/mercury_dlog.c new file mode 100644 index 0000000..7dd5104 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_dlog.c @@ -0,0 +1,308 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#include "mercury_dlog.h" + +#include <inttypes.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#ifdef _WIN32 +#include <process.h> +#else +#include <unistd.h> +#endif + +/****************/ +/* Local Macros */ +/****************/ + +/************************************/ +/* Local Type and Struct Definition */ +/************************************/ + +/********************/ +/* Local Prototypes */ +/********************/ + +/*******************/ +/* Local Variables */ +/*******************/ + +/*---------------------------------------------------------------------------*/ +struct hg_dlog * +hg_dlog_alloc(char *name, unsigned int lesize, int leloop) +{ + struct hg_dlog_entry *le; + struct hg_dlog * d; + + le = malloc(sizeof(*le) * lesize); + if (!le) + return NULL; + + d = malloc(sizeof(*d)); + if (!d) { + free(le); + return NULL; + } + + memset(d, 0, sizeof(*d)); + snprintf(d->dlog_magic, sizeof(d->dlog_magic), "%s%s", HG_DLOG_STDMAGIC, name); + hg_thread_mutex_init(&d->dlock); + HG_LIST_INIT(&d->cnts32); + HG_LIST_INIT(&d->cnts64); + d->le = le; + d->lesize = lesize; + d->leloop = leloop; + d->mallocd = 1; + + return d; +} + +/*---------------------------------------------------------------------------*/ +void +hg_dlog_free(struct hg_dlog *d) +{ + struct hg_dlog_dcount32 *cp32 = HG_LIST_FIRST(&d->cnts32); + struct hg_dlog_dcount64 *cp64 = HG_LIST_FIRST(&d->cnts64); + + while (cp32) { + struct hg_dlog_dcount32 *cp = cp32; + cp32 = HG_LIST_NEXT(cp, l); + free(cp); + } + HG_LIST_INIT(&d->cnts32); + + while (cp64) { + struct hg_dlog_dcount64 *cp = cp64; + cp64 = HG_LIST_NEXT(cp, l); + free(cp); + } + HG_LIST_INIT(&d->cnts64); + + if (d->mallocd) { + free(d->le); + free(d); + } +} + +/*---------------------------------------------------------------------------*/ +void +hg_dlog_mkcount32(struct hg_dlog *d, hg_atomic_int32_t **cptr, const char *name, const char *descr) +{ + struct hg_dlog_dcount32 *dcnt; + + hg_thread_mutex_lock(&d->dlock); + if (*cptr == NULL) { + dcnt = malloc(sizeof(*dcnt)); + if (!dcnt) { + fprintf(stderr, "hd_dlog_mkcount: malloc of %s failed!", name); + abort(); + } + dcnt->name = name; + dcnt->descr = descr; + hg_atomic_init32(&dcnt->c, 0); + HG_LIST_INSERT_HEAD(&d->cnts32, dcnt, l); + *cptr = &dcnt->c; /* set it in caller's variable */ + } + hg_thread_mutex_unlock(&d->dlock); +} + +/*---------------------------------------------------------------------------*/ +void +hg_dlog_mkcount64(struct hg_dlog *d, hg_atomic_int64_t **cptr, const char *name, const char *descr) +{ + struct hg_dlog_dcount64 *dcnt; + + hg_thread_mutex_lock(&d->dlock); + if (*cptr == NULL) { + dcnt = malloc(sizeof(*dcnt)); + if (!dcnt) { + fprintf(stderr, "hd_dlog_mkcount: malloc of %s failed!", name); + abort(); + } + dcnt->name = name; + dcnt->descr = descr; + hg_atomic_init64(&dcnt->c, 0); + HG_LIST_INSERT_HEAD(&d->cnts64, dcnt, l); + *cptr = &dcnt->c; /* set it in caller's variable */ + } + hg_thread_mutex_unlock(&d->dlock); +} + +/*---------------------------------------------------------------------------*/ +void +hg_dlog_setlogstop(struct hg_dlog *d, int stop) +{ + d->lestop = stop; /* no need to lock */ +} + +/*---------------------------------------------------------------------------*/ +void +hg_dlog_resetlog(struct hg_dlog *d) +{ + hg_thread_mutex_lock(&d->dlock); + d->lefree = 0; + d->leadds = 0; + hg_thread_mutex_unlock(&d->dlock); +} + +/*---------------------------------------------------------------------------*/ +void +hg_dlog_dump(struct hg_dlog *d, int (*log_func)(FILE *, const char *, ...), FILE *stream, int trylock) +{ + unsigned int left, idx; + struct hg_dlog_dcount32 *dc32; + struct hg_dlog_dcount64 *dc64; + + if (trylock) { + int try_ret = hg_thread_mutex_try_lock(&d->dlock); + if (try_ret != HG_UTIL_SUCCESS) /* warn them, but keep going */ { + fprintf(stderr, "hg_dlog_dump: WARN - lock failed\n"); + return; + } + } + else + hg_thread_mutex_lock(&d->dlock); + + if (d->leadds > 0) { + log_func(stream, + "### ----------------------\n" + "### (%s) debug log summary\n" + "### ----------------------\n", + (d->dlog_magic + strlen(HG_DLOG_STDMAGIC))); + if (!HG_LIST_IS_EMPTY(&d->cnts32) && !HG_LIST_IS_EMPTY(&d->cnts64)) { + log_func(stream, "# Counters\n"); + HG_LIST_FOREACH(dc32, &d->cnts32, l) + { + log_func(stream, "# %s: %" PRId32 " [%s]\n", dc32->name, hg_atomic_get32(&dc32->c), + dc32->descr); + } + HG_LIST_FOREACH(dc64, &d->cnts64, l) + { + log_func(stream, "# %s: %" PRId64 " [%s]\n", dc64->name, hg_atomic_get64(&dc64->c), + dc64->descr); + } + log_func(stream, "# -\n"); + } + + log_func(stream, "# Number of log entries: %d\n", d->leadds); + + idx = (d->lefree < d->leadds) ? d->lesize + d->lefree - d->leadds : d->lefree - d->leadds; + left = d->leadds; + while (left--) { + log_func(stream, "# [%lf] %s:%d\n## %s()\n", hg_time_to_double(d->le[idx].time), d->le[idx].file, + d->le[idx].line, d->le[idx].func); + idx = (idx + 1) % d->lesize; + } + } + + hg_thread_mutex_unlock(&d->dlock); +} + +/*---------------------------------------------------------------------------*/ +void +hg_dlog_dump_counters(struct hg_dlog *d, int (*log_func)(FILE *, const char *, ...), FILE *stream, + int trylock) +{ + struct hg_dlog_dcount32 *dc32; + struct hg_dlog_dcount64 *dc64; + + if (trylock) { + int try_ret = hg_thread_mutex_try_lock(&d->dlock); + if (try_ret != HG_UTIL_SUCCESS) /* warn them, but keep going */ { + fprintf(stderr, "hg_dlog_dump: WARN - lock failed\n"); + return; + } + } + else + hg_thread_mutex_lock(&d->dlock); + + if (!HG_LIST_IS_EMPTY(&d->cnts32) || !HG_LIST_IS_EMPTY(&d->cnts64)) { + log_func(stream, + "### ----------------------\n" + "### (%s) counter log summary\n" + "### ----------------------\n", + (d->dlog_magic + strlen(HG_DLOG_STDMAGIC))); + + log_func(stream, "# Counters\n"); + HG_LIST_FOREACH(dc32, &d->cnts32, l) + { + log_func(stream, "# %s: %" PRId32 " [%s]\n", dc32->name, hg_atomic_get32(&dc32->c), dc32->descr); + } + HG_LIST_FOREACH(dc64, &d->cnts64, l) + { + log_func(stream, "# %s: %" PRId64 " [%s]\n", dc64->name, hg_atomic_get64(&dc64->c), dc64->descr); + } + log_func(stream, "# -\n"); + } + + hg_thread_mutex_unlock(&d->dlock); +} + +/*---------------------------------------------------------------------------*/ +void +hg_dlog_dump_file(struct hg_dlog *d, const char *base, int addpid, int trylock) +{ + char buf[2048]; + int pid; + FILE * fp = NULL; + unsigned int left, idx; + struct hg_dlog_dcount32 *dc32; + struct hg_dlog_dcount64 *dc64; + +#ifdef _WIN32 + pid = _getpid(); +#else + pid = getpid(); +#endif + + if (addpid) + snprintf(buf, sizeof(buf), "%s-%d.log", base, pid); + else + snprintf(buf, sizeof(buf), "%s.log", base); + + fp = fopen(buf, "w"); + if (!fp) { + perror("fopen"); + return; + } + + if (trylock) { + int try_ret = hg_thread_mutex_try_lock(&d->dlock); + if (try_ret != HG_UTIL_SUCCESS) /* warn them, but keep going */ { + fprintf(stderr, "hg_dlog_dump: WARN - lock failed\n"); + fclose(fp); + return; + } + } + else + hg_thread_mutex_lock(&d->dlock); + + fprintf(fp, "# START COUNTERS\n"); + HG_LIST_FOREACH(dc32, &d->cnts32, l) + { + fprintf(fp, "%s %d %" PRId32 " # %s\n", dc32->name, pid, hg_atomic_get32(&dc32->c), dc32->descr); + } + HG_LIST_FOREACH(dc64, &d->cnts64, l) + { + fprintf(fp, "%s %d %" PRId64 " # %s\n", dc64->name, pid, hg_atomic_get64(&dc64->c), dc64->descr); + } + fprintf(fp, "# END COUNTERS\n\n"); + + fprintf(fp, "# NLOGS %d FOR %d\n", d->leadds, pid); + + idx = (d->lefree < d->leadds) ? d->lesize + d->lefree - d->leadds : d->lefree - d->leadds; + left = d->leadds; + while (left--) { + fprintf(fp, "%lf %d %s %u %s %s %p\n", hg_time_to_double(d->le[idx].time), pid, d->le[idx].file, + d->le[idx].line, d->le[idx].func, d->le[idx].msg, d->le[idx].data); + idx = (idx + 1) % d->lesize; + } + + hg_thread_mutex_unlock(&d->dlock); + fclose(fp); +} diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_dlog.h b/src/H5FDsubfiling/mercury/src/util/mercury_dlog.h new file mode 100644 index 0000000..0027fde --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_dlog.h @@ -0,0 +1,282 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#ifndef MERCURY_DLOG_H +#define MERCURY_DLOG_H + +#include "mercury_util_config.h" + +#include "mercury_atomic.h" +#include "mercury_list.h" +#include "mercury_thread_mutex.h" +#include "mercury_time.h" + +#include <stdio.h> + +/*****************/ +/* Public Macros */ +/*****************/ + +/* + * putting a magic number at the front of the dlog allows us to search + * for a dlog in a coredump file after a crash and examine its contents. + */ +#define HG_DLOG_MAGICLEN 16 /* bytes to reserve for magic# */ +#define HG_DLOG_STDMAGIC ">D.LO.G<" /* standard for first 8 bytes */ + +/* + * HG_DLOG_INITIALIZER: initializer for a dlog in a global variable. + * LESIZE is the number of entries in the LE array. use it like this: + * + * #define FOO_NENTS 128 + * struct hg_dlog_entry foo_le[FOO_NENTS]; + * struct hg_dlog foo_dlog = HG_DLOG_INITIALIZER("foo", foo_le, FOO_NENTS, 0); + */ +#define HG_DLOG_INITIALIZER(NAME, LE, LESIZE, LELOOP) \ + { \ + HG_DLOG_STDMAGIC NAME, HG_THREAD_MUTEX_INITIALIZER, HG_LIST_HEAD_INITIALIZER(cnts32), \ + HG_LIST_HEAD_INITIALIZER(cnts64), LE, LESIZE, LELOOP, 0, 0, 0, 0 \ + } + +/*************************************/ +/* Public Type and Struct Definition */ +/*************************************/ + +/* + * hg_dlog_entry: an entry in the dlog + */ +struct hg_dlog_entry { + const char * file; /* file name */ + unsigned int line; /* line number */ + const char * func; /* function name */ + const char * msg; /* entry message (optional) */ + const void * data; /* user data (optional) */ + hg_time_t time; /* time added to log */ +}; + +/* + * hg_dlog_dcount32: 32-bit debug counter in the dlog + */ +struct hg_dlog_dcount32 { + const char * name; /* counter name (short) */ + const char * descr; /* description of counter */ + hg_atomic_int32_t c; /* the counter itself */ + HG_LIST_ENTRY(hg_dlog_dcount32) l; /* linkage */ +}; + +/* + * hg_dlog_dcount64: 64-bit debug counter in the dlog + */ +struct hg_dlog_dcount64 { + const char * name; /* counter name (short) */ + const char * descr; /* description of counter */ + hg_atomic_int64_t c; /* the counter itself */ + HG_LIST_ENTRY(hg_dlog_dcount64) l; /* linkage */ +}; + +/* + * hg_dlog: main structure + */ +struct hg_dlog { + char dlog_magic[HG_DLOG_MAGICLEN]; /* magic number + name */ + hg_thread_mutex_t dlock; /* lock for this data struct */ + + /* counter lists */ + HG_LIST_HEAD(hg_dlog_dcount32) cnts32; /* counter list */ + HG_LIST_HEAD(hg_dlog_dcount64) cnts64; /* counter list */ + + /* log */ + struct hg_dlog_entry *le; /* array of log entries */ + unsigned int lesize; /* size of le[] array */ + int leloop; /* circular buffer? */ + unsigned int lefree; /* next free entry in le[] */ + unsigned int leadds; /* #adds done if < lesize */ + int lestop; /* stop taking new logs */ + + int mallocd; /* allocated with malloc? */ +}; + +/*********************/ +/* Public Prototypes */ +/*********************/ + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * malloc and return a new dlog + * + * \param name [IN] name of dlog (truncated past 8 bytes) + * \param lesize [IN] number of entries to allocate for log buffer + * \param leloop [IN] set to make log circular (can overwrite old + * entries) + * + * \return the new dlog or NULL on malloc error + */ +HG_UTIL_PUBLIC struct hg_dlog *hg_dlog_alloc(char *name, unsigned int lesize, int leloop); + +/** + * free anything we malloc'd on a dlog. assumes we have the final + * active reference to dlog and it won't be used anymore after this + * call (so no need to lock it). + * + * \param d [IN] the dlog to finalize + */ +HG_UTIL_PUBLIC void hg_dlog_free(struct hg_dlog *d); + +/** + * make a named atomic32 counter in a dlog and return a pointer to + * it. we use the dlock to ensure a counter under a given name only + * gets created once (makes it easy to share a counter across files). + * aborts if unable to alloc counter. use it like this: + * + * hg_atomic_int32_t *foo_count; + * static int init = 0; + * if (init == 0) { + * hg_dlog_mkcount32(dlog, &foo_count, "foocount", "counts of foo"); + * init = 1; + * } + * + * \param d [IN] dlog to create the counter in + * \param cptr [IN/OUT] pointer to use for counter (set to NULL to + * start) + * \param name [IN] short one word name for counter + * \param descr [IN] short description of counter + */ +HG_UTIL_PUBLIC void hg_dlog_mkcount32(struct hg_dlog *d, hg_atomic_int32_t **cptr, const char *name, + const char *descr); + +/** + * make a named atomic64 counter in a dlog and return a pointer to + * it. we use the dlock to ensure a counter under a given name only + * gets created once (makes it easy to share a counter across files). + * aborts if unable to alloc counter. use it like this: + * + * hg_atomic_int64_t *foo_count; + * static int init = 0; + * if (init == 0) { + * hg_dlog_mkcount64(dlog, &foo_count, "foocount", "counts of foo"); + * init = 1; + * } + * + * \param d [IN] dlog to create the counter in + * \param cptr [IN/OUT] pointer to use for counter (set to NULL to + * start) + * \param name [IN] short one word name for counter + * \param descr [IN] short description of counter + */ +HG_UTIL_PUBLIC void hg_dlog_mkcount64(struct hg_dlog *d, hg_atomic_int64_t **cptr, const char *name, + const char *descr); + +/** + * attempt to add a log record to a dlog. the id and msg should point + * to static strings that are valid throughout the life of the program + * (not something that is is on the stack). + * + * \param d [IN] the dlog to add the log record to + * \param file [IN] file entry + * \param line [IN] line entry + * \param func [IN] func entry + * \param msg [IN] log entry message (optional, NULL ok) + * \param data [IN] user data pointer for record (optional, NULL ok) + * + * \return 1 if added, 0 otherwise + */ +static HG_UTIL_INLINE unsigned int hg_dlog_addlog(struct hg_dlog *d, const char *file, unsigned int line, + const char *func, const char *msg, const void *data); + +/** + * set the value of stop for a dlog (to enable/disable logging) + * + * \param d [IN] dlog to set stop in + * \param stop [IN] value of stop to use (1=stop, 0=go) + */ +HG_UTIL_PUBLIC void hg_dlog_setlogstop(struct hg_dlog *d, int stop); + +/** + * reset the log. this does not change the counters (since users + * have direct access to the hg_atomic_int64_t's, we don't need + * an API to change them here). + * + * \param d [IN] dlog to reset + */ +HG_UTIL_PUBLIC void hg_dlog_resetlog(struct hg_dlog *d); + +/** + * dump dlog info to a stream. set trylock if you want to dump even + * if it is locked (e.g. you are crashing and you don't care about + * locking). + * + * \param d [IN] dlog to dump + * \param log_func [IN] log function to use (default printf) + * \param stream [IN] stream to use + * \param trylock [IN] just try to lock (warn if it fails) + */ +HG_UTIL_PUBLIC void hg_dlog_dump(struct hg_dlog *d, int (*log_func)(FILE *, const char *, ...), FILE *stream, + int trylock); + +/** + * dump dlog counters to a stream. set trylock if you want to dump even + * if it is locked (e.g. you are crashing and you don't care about + * locking). + * + * \param d [IN] dlog to dump + * \param log_func [IN] log function to use (default printf) + * \param stream [IN] stream to use + * \param trylock [IN] just try to lock (warn if it fails) + */ +HG_UTIL_PUBLIC void hg_dlog_dump_counters(struct hg_dlog *d, int (*log_func)(FILE *, const char *, ...), + FILE *stream, int trylock); + +/** + * dump dlog info to a file. set trylock if you want to dump even + * if it is locked (e.g. you are crashing and you don't care about + * locking). the output file is "base.log" or base-pid.log" depending + * on the value of addpid. + * + * \param d [IN] dlog to dump + * \param base [IN] output file basename + * \param addpid [IN] add pid to output filename + * \param trylock [IN] just try to lock (warn if it fails) + */ +HG_UTIL_PUBLIC void hg_dlog_dump_file(struct hg_dlog *d, const char *base, int addpid, int trylock); + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE unsigned int +hg_dlog_addlog(struct hg_dlog *d, const char *file, unsigned int line, const char *func, const char *msg, + const void *data) +{ + unsigned int rv = 0; + unsigned int idx; + + hg_thread_mutex_lock(&d->dlock); + if (d->lestop) + goto done; + if (d->leloop == 0 && d->leadds >= d->lesize) + goto done; + idx = d->lefree; + d->lefree = (d->lefree + 1) % d->lesize; + if (d->leadds < d->lesize) + d->leadds++; + d->le[idx].file = file; + d->le[idx].line = line; + d->le[idx].func = func; + d->le[idx].msg = msg; + d->le[idx].data = data; + hg_time_get_current(&d->le[idx].time); + rv = 1; + +done: + hg_thread_mutex_unlock(&d->dlock); + return rv; +} + +#ifdef __cplusplus +} +#endif + +#endif /* MERCURY_DLOG_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_list.h b/src/H5FDsubfiling/mercury/src/util/mercury_list.h new file mode 100644 index 0000000..7b66c23 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_list.h @@ -0,0 +1,113 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +/* Code below is derived from sys/queue.h which follows the below notice: + * + * Copyright (c) 1991, 1993 + * The Regents of the University of California. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3. Neither the name of the University nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * @(#)queue.h 8.5 (Berkeley) 8/20/94 + */ + +#ifndef MERCURY_LIST_H +#define MERCURY_LIST_H + +#define HG_LIST_HEAD_INITIALIZER(name) \ + { \ + NULL \ + } + +#define HG_LIST_HEAD_INIT(struct_head_name, var_name) \ + struct struct_head_name var_name = HG_LIST_HEAD_INITIALIZER(var_name) + +#define HG_LIST_HEAD_DECL(struct_head_name, struct_entry_name) \ + struct struct_head_name { \ + struct struct_entry_name *head; \ + } + +#define HG_LIST_HEAD(struct_entry_name) \ + struct { \ + struct struct_entry_name *head; \ + } + +#define HG_LIST_ENTRY(struct_entry_name) \ + struct { \ + struct struct_entry_name * next; \ + struct struct_entry_name **prev; \ + } + +#define HG_LIST_INIT(head_ptr) \ + do { \ + (head_ptr)->head = NULL; \ + } while (/*CONSTCOND*/ 0) + +#define HG_LIST_IS_EMPTY(head_ptr) ((head_ptr)->head == NULL) + +#define HG_LIST_FIRST(head_ptr) ((head_ptr)->head) + +#define HG_LIST_NEXT(entry_ptr, entry_field_name) ((entry_ptr)->entry_field_name.next) + +#define HG_LIST_INSERT_AFTER(list_entry_ptr, entry_ptr, entry_field_name) \ + do { \ + if (((entry_ptr)->entry_field_name.next = (list_entry_ptr)->entry_field_name.next) != NULL) \ + (list_entry_ptr)->entry_field_name.next->entry_field_name.prev = \ + &(entry_ptr)->entry_field_name.next; \ + (list_entry_ptr)->entry_field_name.next = (entry_ptr); \ + (entry_ptr)->entry_field_name.prev = &(list_entry_ptr)->entry_field_name.next; \ + } while (/*CONSTCOND*/ 0) + +#define HG_LIST_INSERT_BEFORE(list_entry_ptr, entry_ptr, entry_field_name) \ + do { \ + (entry_ptr)->entry_field_name.prev = (list_entry_ptr)->entry_field_name.prev; \ + (entry_ptr)->entry_field_name.next = (list_entry_ptr); \ + *(list_entry_ptr)->entry_field_name.prev = (entry_ptr); \ + (list_entry_ptr)->entry_field_name.prev = &(entry_ptr)->entry_field_name.next; \ + } while (/*CONSTCOND*/ 0) + +#define HG_LIST_INSERT_HEAD(head_ptr, entry_ptr, entry_field_name) \ + do { \ + if (((entry_ptr)->entry_field_name.next = (head_ptr)->head) != NULL) \ + (head_ptr)->head->entry_field_name.prev = &(entry_ptr)->entry_field_name.next; \ + (head_ptr)->head = (entry_ptr); \ + (entry_ptr)->entry_field_name.prev = &(head_ptr)->head; \ + } while (/*CONSTCOND*/ 0) + +/* TODO would be nice to not have any condition */ +#define HG_LIST_REMOVE(entry_ptr, entry_field_name) \ + do { \ + if ((entry_ptr)->entry_field_name.next != NULL) \ + (entry_ptr)->entry_field_name.next->entry_field_name.prev = (entry_ptr)->entry_field_name.prev; \ + *(entry_ptr)->entry_field_name.prev = (entry_ptr)->entry_field_name.next; \ + } while (/*CONSTCOND*/ 0) + +#define HG_LIST_FOREACH(var, head_ptr, entry_field_name) \ + for ((var) = ((head_ptr)->head); (var); (var) = ((var)->entry_field_name.next)) + +#endif /* MERCURY_LIST_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_log.c b/src/H5FDsubfiling/mercury/src/util/mercury_log.c new file mode 100644 index 0000000..def1abe --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_log.c @@ -0,0 +1,498 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#include "mercury_log.h" + +#include <ctype.h> +#include <stdarg.h> +#include <stdlib.h> +#include <string.h> + +/****************/ +/* Local Macros */ +/****************/ + +/* Make sure it executes first */ +#ifdef HG_UTIL_HAS_ATTR_CONSTRUCTOR_PRIORITY +#define HG_UTIL_CONSTRUCTOR_1 HG_ATTR_CONSTRUCTOR_PRIORITY(101) +#else +#define HG_UTIL_CONSTRUCTOR_1 +#endif + +/* Destructor (used to finalize log outlets) */ +#define HG_UTIL_DESTRUCTOR HG_ATTR_DESTRUCTOR + +/* Max number of subsystems that can be tracked */ +#define HG_LOG_SUBSYS_MAX (16) + +/* Max length of subsystem name (without trailing \0) */ +#define HG_LOG_SUBSYS_NAME_MAX (16) + +/* Log buffer size */ +#define HG_LOG_BUF_MAX (256) + +#ifdef HG_UTIL_HAS_LOG_COLOR +#define HG_LOG_ESC "\033" +#define HG_LOG_RESET HG_LOG_ESC "[0m" +#define HG_LOG_REG HG_LOG_ESC "[0;" +#define HG_LOG_BOLD HG_LOG_ESC "[1;" +#define HG_LOG_RED "31m" +#define HG_LOG_GREEN "32m" +#define HG_LOG_YELLOW "33m" +#define HG_LOG_BLUE "34m" +#define HG_LOG_MAGENTA "35m" +#define HG_LOG_CYAN "36m" +#endif + +/********************/ +/* Local Prototypes */ +/********************/ + +/* Init logs */ +static void hg_log_init(void) HG_UTIL_CONSTRUCTOR_1; + +/* Finalize logs */ +static void hg_log_finalize(void) HG_UTIL_DESTRUCTOR; + +/* Init log level */ +static void hg_log_init_level(void); + +/* Init log subsys */ +static void hg_log_init_subsys(void); + +/* Reset all log levels */ +static void hg_log_outlet_reset_all(void); + +/* Free all attached logs */ +static void hg_log_free_dlogs(void); + +/* Is log active */ +static int hg_log_outlet_active(const char *name); + +/* Update log level of outlet */ +static void hg_log_outlet_update_level(struct hg_log_outlet *hg_log_outlet); + +/* Update level of all outlets */ +static void hg_log_outlet_update_all(void); + +/*******************/ +/* Local Variables */ +/*******************/ + +/* Default log outlet */ +HG_LOG_OUTLET_DECL(hg) = HG_LOG_OUTLET_INITIALIZER(hg, HG_LOG_OFF, NULL, NULL); + +/* List of all registered outlets */ +static HG_QUEUE_HEAD(hg_log_outlet) hg_log_outlets_g = HG_QUEUE_HEAD_INITIALIZER(hg_log_outlets_g); + +/* Default 'printf' log function */ +static hg_log_func_t hg_log_func_g = fprintf; + +/* Default log level */ +static enum hg_log_level hg_log_level_g = HG_LOG_LEVEL_ERROR; + +/* Default log subsystems */ +static char hg_log_subsys_g[HG_LOG_SUBSYS_MAX][HG_LOG_SUBSYS_NAME_MAX + 1] = {{"\0"}}; + +/* Log level string table */ +#define X(a, b, c) b, +static const char *const hg_log_level_name_g[] = {HG_LOG_LEVELS}; +#undef X + +/* Standard log streams */ +#define X(a, b, c) c, +static FILE **const hg_log_std_streams_g[] = {HG_LOG_LEVELS}; +#undef X +static FILE *hg_log_streams_g[HG_LOG_LEVEL_MAX] = {NULL}; + +/* Log colors */ +#ifdef HG_UTIL_HAS_LOG_COLOR +static const char *const hg_log_colors_g[] = {"", HG_LOG_RED, HG_LOG_MAGENTA, HG_LOG_BLUE, HG_LOG_BLUE, ""}; +#endif + +/* Init */ +#ifndef HG_UTIL_HAS_ATTR_CONSTRUCTOR_PRIORITY +static bool hg_log_init_g = false; +#endif + +/*---------------------------------------------------------------------------*/ +static void +hg_log_init(void) +{ + hg_log_init_level(); + hg_log_init_subsys(); + + /* Register top outlet */ + hg_log_outlet_register(&HG_LOG_OUTLET(hg)); +} + +/*---------------------------------------------------------------------------*/ +static void +hg_log_finalize(void) +{ + hg_log_free_dlogs(); +} + +/*---------------------------------------------------------------------------*/ +static void +hg_log_init_level(void) +{ + const char *log_level = getenv("HG_LOG_LEVEL"); + + /* Override default log level */ + if (log_level == NULL) + return; + + hg_log_set_level(hg_log_name_to_level(log_level)); +} + +/*---------------------------------------------------------------------------*/ +static void +hg_log_init_subsys(void) +{ + const char *log_subsys = getenv("HG_LOG_SUBSYS"); + + if (log_subsys == NULL) + return; + + // fprintf(stderr, "subsys: %s\n", log_subsys); + hg_log_set_subsys(log_subsys); +} + +/*---------------------------------------------------------------------------*/ +static void +hg_log_outlet_reset_all(void) +{ + struct hg_log_outlet *outlet; + int i; + + /* Reset levels */ + HG_QUEUE_FOREACH(outlet, &hg_log_outlets_g, entry) + outlet->level = HG_LOG_LEVEL_NONE; + + /* Reset subsys */ + for (i = 0; i < HG_LOG_SUBSYS_MAX; i++) + strcpy(hg_log_subsys_g[i], "\0"); +} + +/*---------------------------------------------------------------------------*/ +static void +hg_log_free_dlogs(void) +{ + struct hg_log_outlet *outlet; + + /* Free logs if any was attached */ + HG_QUEUE_FOREACH(outlet, &hg_log_outlets_g, entry) + { + if (outlet->debug_log && !(outlet->parent && outlet->parent->debug_log)) { + if (outlet->level >= HG_LOG_LEVEL_MIN_DEBUG) { + FILE *stream = hg_log_streams_g[outlet->level] ? hg_log_streams_g[outlet->level] + : *hg_log_std_streams_g[outlet->level]; + hg_dlog_dump_counters(outlet->debug_log, hg_log_func_g, stream, 0); + } + hg_dlog_free(outlet->debug_log); + } + } +} + +/*---------------------------------------------------------------------------*/ +static int +hg_log_outlet_active(const char *name) +{ + int i = 0; + + while (hg_log_subsys_g[i][0] != '\0' && i < HG_LOG_SUBSYS_MAX) { + /* Force a subsystem to be inactive */ + if ((hg_log_subsys_g[i][0] == '~') && (strcmp(&hg_log_subsys_g[i][1], name) == 0)) + return -1; + + if (strcmp(hg_log_subsys_g[i], name) == 0) { + return 1; + } + i++; + } + return 0; +} + +/*---------------------------------------------------------------------------*/ +static void +hg_log_outlet_update_level(struct hg_log_outlet *hg_log_outlet) +{ + int active = hg_log_outlet_active(hg_log_outlet->name); + + if (active > 0 || hg_log_outlet->state == HG_LOG_ON) + hg_log_outlet->level = hg_log_level_g; + else if (!(active < 0) && hg_log_outlet->state == HG_LOG_PASS && hg_log_outlet->parent) + hg_log_outlet->level = hg_log_outlet->parent->level; + else + hg_log_outlet->level = HG_LOG_LEVEL_NONE; +} + +/*---------------------------------------------------------------------------*/ +static void +hg_log_outlet_update_all(void) +{ + struct hg_log_outlet *hg_log_outlet; + + HG_QUEUE_FOREACH(hg_log_outlet, &hg_log_outlets_g, entry) + hg_log_outlet_update_level(hg_log_outlet); +} + +/*---------------------------------------------------------------------------*/ +void +hg_log_set_level(enum hg_log_level log_level) +{ + hg_log_level_g = log_level; + + hg_log_outlet_update_all(); +} + +/*---------------------------------------------------------------------------*/ +enum hg_log_level +hg_log_get_level(void) +{ + return hg_log_level_g; +} + +/*---------------------------------------------------------------------------*/ +void +hg_log_set_subsys(const char *log_subsys) +{ + char *subsys, *current, *next; + int i = 0; + + subsys = strdup(log_subsys); + if (!subsys) + return; + + current = subsys; + + /* Reset all */ + hg_log_outlet_reset_all(); + + /* Enable each of the subsys */ + while (strtok_r(current, ",", &next) && i < HG_LOG_SUBSYS_MAX) { + int j, exist = 0; + + /* Skip duplicates */ + for (j = 0; j < i; j++) { + if (strcmp(current, hg_log_subsys_g[j]) == 0) { + exist = 1; + break; + } + } + + if (!exist) { + strncpy(hg_log_subsys_g[i], current, HG_LOG_SUBSYS_NAME_MAX); + i++; + } + current = next; + } + + /* Update outlets */ + hg_log_outlet_update_all(); + + free(subsys); +} + +/*---------------------------------------------------------------------------*/ +const char * +hg_log_get_subsys(void) +{ + static char log_subsys[HG_LOG_SUBSYS_MAX * (HG_LOG_SUBSYS_NAME_MAX + 2)] = "\0"; + char * p = log_subsys; + int i = 0; + + while (hg_log_subsys_g[i][0] != '\0' && i < HG_LOG_SUBSYS_MAX) { + strcpy(p, hg_log_subsys_g[i]); + p += strlen(hg_log_subsys_g[i]); + *p = ','; + p++; + i++; + } + if (i > 0) + *(p - 1) = '\0'; + + return (const char *)log_subsys; +} + +/*---------------------------------------------------------------------------*/ +void +hg_log_set_subsys_level(const char *subsys, enum hg_log_level log_level) +{ + const char *log_subsys = hg_log_get_subsys(); + char * new_subsys = NULL; + const char *new_subsys_ptr; + + if (strcmp(log_subsys, "") != 0) { + new_subsys = malloc(strlen(log_subsys) + strlen(subsys) + 2); + if (!new_subsys) + return; + strcpy(new_subsys, log_subsys); + strcat(new_subsys, ","); + strcat(new_subsys, subsys); + new_subsys_ptr = new_subsys; + } + else + new_subsys_ptr = subsys; + + hg_log_set_level(log_level); + hg_log_set_subsys(new_subsys_ptr); + + free(new_subsys); +} + +/*---------------------------------------------------------------------------*/ +enum hg_log_level +hg_log_name_to_level(const char *log_level) +{ + enum hg_log_level l = 0; + + if (!log_level || strcasecmp("none", log_level) == 0) + return HG_LOG_LEVEL_NONE; + + while (strcasecmp(hg_log_level_name_g[l], log_level) != 0 && l != HG_LOG_LEVEL_MAX) + l++; + + if (l == HG_LOG_LEVEL_MAX) { + fprintf(stderr, "Warning: invalid log level was passed, defaulting to none\n"); + return HG_LOG_LEVEL_NONE; + } + + return l; +} + +/*---------------------------------------------------------------------------*/ +void +hg_log_set_func(hg_log_func_t log_func) +{ + hg_log_func_g = log_func; +} + +/*---------------------------------------------------------------------------*/ +hg_log_func_t +hg_log_get_func(void) +{ + return hg_log_func_g; +} + +/*---------------------------------------------------------------------------*/ +void +hg_log_set_stream_debug(FILE *stream) +{ + hg_log_streams_g[HG_LOG_LEVEL_DEBUG] = stream; +} + +/*---------------------------------------------------------------------------*/ +FILE * +hg_log_get_stream_debug(void) +{ + return hg_log_streams_g[HG_LOG_LEVEL_DEBUG] ? hg_log_streams_g[HG_LOG_LEVEL_DEBUG] + : *hg_log_std_streams_g[HG_LOG_LEVEL_DEBUG]; +} + +/*---------------------------------------------------------------------------*/ +void +hg_log_set_stream_warning(FILE *stream) +{ + hg_log_streams_g[HG_LOG_LEVEL_WARNING] = stream; +} + +/*---------------------------------------------------------------------------*/ +FILE * +hg_log_get_stream_warning(void) +{ + return hg_log_streams_g[HG_LOG_LEVEL_WARNING] ? hg_log_streams_g[HG_LOG_LEVEL_WARNING] + : *hg_log_std_streams_g[HG_LOG_LEVEL_WARNING]; +} + +/*---------------------------------------------------------------------------*/ +void +hg_log_set_stream_error(FILE *stream) +{ + hg_log_streams_g[HG_LOG_LEVEL_ERROR] = stream; +} + +/*---------------------------------------------------------------------------*/ +FILE * +hg_log_get_stream_error(void) +{ + return hg_log_streams_g[HG_LOG_LEVEL_ERROR] ? hg_log_streams_g[HG_LOG_LEVEL_ERROR] + : *hg_log_std_streams_g[HG_LOG_LEVEL_ERROR]; +} + +/*---------------------------------------------------------------------------*/ +void +hg_log_outlet_register(struct hg_log_outlet *hg_log_outlet) +{ +#ifndef HG_UTIL_HAS_ATTR_CONSTRUCTOR_PRIORITY + if (!hg_log_init_g) { + /* Set here to prevent infinite loop */ + hg_log_init_g = true; + hg_log_init(); + } +#endif + + hg_log_outlet_update_level(hg_log_outlet); + + /* Inherit debug log if not set and parent has one */ + if (!hg_log_outlet->debug_log && hg_log_outlet->parent && hg_log_outlet->parent->debug_log) + hg_log_outlet->debug_log = hg_log_outlet->parent->debug_log; + + HG_QUEUE_PUSH_TAIL(&hg_log_outlets_g, hg_log_outlet, entry); +} + +/*---------------------------------------------------------------------------*/ +void +hg_log_write(struct hg_log_outlet *hg_log_outlet, enum hg_log_level log_level, const char *file, + unsigned int line, const char *func, const char *format, ...) +{ + char buf[HG_LOG_BUF_MAX]; + FILE * stream = NULL; + const char *level_name = NULL; +#ifdef HG_UTIL_HAS_LOG_COLOR + const char *color = hg_log_colors_g[log_level]; +#endif + hg_time_t tv; + va_list ap; + + if (!(log_level > HG_LOG_LEVEL_NONE && log_level < HG_LOG_LEVEL_MAX)) + return; + + hg_time_get_current(&tv); + level_name = hg_log_level_name_g[log_level]; + stream = hg_log_streams_g[log_level] ? hg_log_streams_g[log_level] : *hg_log_std_streams_g[log_level]; +#ifdef HG_UTIL_HAS_LOG_COLOR + color = hg_log_colors_g[log_level]; +#endif + + va_start(ap, format); + vsnprintf(buf, HG_LOG_BUF_MAX, format, ap); + va_end(ap); + +#ifdef HG_UTIL_HAS_LOG_COLOR + /* Print using logging function */ + hg_log_func_g(stream, + "# %s%s[%lf] %s%s%s->%s%s: %s%s[%s]%s%s %s:%d %s\n" + "## %s%s%s()%s: %s%s%s%s\n", + HG_LOG_REG, HG_LOG_GREEN, hg_time_to_double(tv), HG_LOG_REG, HG_LOG_YELLOW, "mercury", + hg_log_outlet->name, HG_LOG_RESET, HG_LOG_BOLD, color, level_name, HG_LOG_REG, color, file, + line, HG_LOG_RESET, HG_LOG_REG, HG_LOG_YELLOW, func, HG_LOG_RESET, HG_LOG_REG, + log_level != HG_LOG_LEVEL_DEBUG ? color : HG_LOG_RESET, buf, HG_LOG_RESET); +#else + /* Print using logging function */ + hg_log_func_g(stream, + "# [%lf] %s->%s: [%s] %s:%d\n" + " # %s(): %s\n", + hg_time_to_double(tv), "mercury", hg_log_outlet->name, level_name, file, line, func, buf); +#endif + + if (log_level == HG_LOG_LEVEL_ERROR && hg_log_outlet->debug_log && + hg_log_outlet->level >= HG_LOG_LEVEL_MIN_DEBUG) { + hg_dlog_dump(hg_log_outlet->debug_log, hg_log_func_g, stream, 0); + hg_dlog_resetlog(hg_log_outlet->debug_log); + } +} diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_log.h b/src/H5FDsubfiling/mercury/src/util/mercury_log.h new file mode 100644 index 0000000..a550d97 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_log.h @@ -0,0 +1,333 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#ifndef MERCURY_LOG_H +#define MERCURY_LOG_H + +#include "mercury_util_config.h" + +#include "mercury_dlog.h" +#include "mercury_queue.h" + +#include <stdio.h> + +/*****************/ +/* Public Macros */ +/*****************/ + +/* For compatibility */ +#if defined(__STDC_VERSION__) && (__STDC_VERSION__ < 199901L) +#if defined(__GNUC__) && (__GNUC__ >= 2) +#define __func__ __FUNCTION__ +#else +#define __func__ "<unknown>" +#endif +#elif defined(_WIN32) +#define __func__ __FUNCTION__ +#endif + +/* Cat macro */ +#define HG_UTIL_CAT(x, y) x##y + +/* Stringify macro */ +#define HG_UTIL_STRINGIFY(x) #x + +/* Constructor (used to initialize log outlets) */ +#define HG_UTIL_CONSTRUCTOR HG_ATTR_CONSTRUCTOR + +/* Available log levels, additional log levels should be added to that list by + * order of verbosity. Format is: + * - enum type + * - level name + * - default output + * + * error: print error level logs + * warning: print warning level logs + * min_debug: store minimal debug information and defer printing until error + * debug: print debug level logs + */ +#define HG_LOG_LEVELS \ + X(HG_LOG_LEVEL_NONE, "", NULL) /*!< no log */ \ + X(HG_LOG_LEVEL_ERROR, "error", &stderr) /*!< error log type */ \ + X(HG_LOG_LEVEL_WARNING, "warning", &stdout) /*!< warning log type */ \ + X(HG_LOG_LEVEL_MIN_DEBUG, "min_debug", &stdout) /*!< debug log type */ \ + X(HG_LOG_LEVEL_DEBUG, "debug", &stdout) /*!< debug log type */ \ + X(HG_LOG_LEVEL_MAX, "", NULL) + +/* HG_LOG_OUTLET: global variable name of log outlet. */ +#define HG_LOG_OUTLET(name) HG_UTIL_CAT(name, _log_outlet_g) + +/* HG_LOG_OUTLET_DECL: declare an outlet. */ +#define HG_LOG_OUTLET_DECL(name) struct hg_log_outlet HG_LOG_OUTLET(name) + +/* + * HG_LOG_OUTLET_INITIALIZER: initializer for a log in a global variable. + * (parent and debug_log are optional and can be set to NULL) + */ +#define HG_LOG_OUTLET_INITIALIZER(name, state, parent, debug_log) \ + { \ + HG_UTIL_STRINGIFY(name), state, HG_LOG_LEVEL_NONE, parent, debug_log, \ + { \ + NULL \ + } \ + } + +/* HG_LOG_OUTLET_SUBSYS_INITIALIZER: initializer for a sub-system log. */ +#define HG_LOG_OUTLET_SUBSYS_INITIALIZER(name, parent_name) \ + HG_LOG_OUTLET_INITIALIZER(name, HG_LOG_PASS, &HG_LOG_OUTLET(parent_name), NULL) + +/* HG_LOG_OUTLET_SUBSYS_STATE_INITIALIZER: initializer for a sub-system log with + * a defined state. */ +#define HG_LOG_OUTLET_SUBSYS_STATE_INITIALIZER(name, parent_name, state) \ + HG_LOG_OUTLET_INITIALIZER(name, state, &HG_LOG_OUTLET(parent_name), NULL) + +/* HG_LOG_SUBSYS_REGISTER: register a name */ +#define HG_LOG_SUBSYS_REGISTER(name) \ + static void HG_UTIL_CAT(hg_log_outlet_, name)(void) HG_UTIL_CONSTRUCTOR; \ + static void HG_UTIL_CAT(hg_log_outlet_, name)(void) \ + { \ + hg_log_outlet_register(&HG_LOG_OUTLET(name)); \ + } \ + /* Keep unused prototype to use semicolon at end of macro */ \ + void hg_log_outlet_##name##_unused(void) + +/* HG_LOG_SUBSYS_DECL_REGISTER: declare and register a log outlet. */ +#define HG_LOG_SUBSYS_DECL_REGISTER(name, parent_name) \ + struct hg_log_outlet HG_LOG_OUTLET(name) = HG_LOG_OUTLET_SUBSYS_INITIALIZER(name, parent_name); \ + HG_LOG_SUBSYS_REGISTER(name) + +/* HG_LOG_SUBSYS_DECL_STATE_REGISTER: declare and register a log outlet and + * enforce an init state. */ +#define HG_LOG_SUBSYS_DECL_STATE_REGISTER(name, parent_name, state) \ + struct hg_log_outlet HG_LOG_OUTLET(name) = \ + HG_LOG_OUTLET_SUBSYS_STATE_INITIALIZER(name, parent_name, state); \ + HG_LOG_SUBSYS_REGISTER(name) + +/* Log macro */ +#define HG_LOG_WRITE(name, log_level, ...) \ + do { \ + if (log_level == HG_LOG_LEVEL_DEBUG && HG_LOG_OUTLET(name).level >= HG_LOG_LEVEL_MIN_DEBUG && \ + HG_LOG_OUTLET(name).debug_log) \ + hg_dlog_addlog(HG_LOG_OUTLET(name).debug_log, __FILE__, __LINE__, __func__, NULL, NULL); \ + if (HG_LOG_OUTLET(name).level >= log_level) \ + hg_log_write(&HG_LOG_OUTLET(name), log_level, __FILE__, __LINE__, __func__, __VA_ARGS__); \ + } while (0) + +#define HG_LOG_WRITE_DEBUG_EXT(name, header, ...) \ + do { \ + if (HG_LOG_OUTLET(name).level == HG_LOG_LEVEL_DEBUG) { \ + hg_log_func_t log_func = hg_log_get_func(); \ + hg_log_write(&HG_LOG_OUTLET(name), HG_LOG_LEVEL_DEBUG, __FILE__, __LINE__, __func__, header); \ + log_func(hg_log_get_stream_debug(), __VA_ARGS__); \ + log_func(hg_log_get_stream_debug(), "---\n"); \ + } \ + } while (0) + +/** + * Additional macros for debug log support. + */ + +/* HG_LOG_DEBUG_DLOG: global variable name of debug log. */ +#define HG_LOG_DEBUG_DLOG(name) HG_UTIL_CAT(name, _dlog_g) + +/* HG_LOG_DEBUG_LE: global variable name of debug log entries. */ +#define HG_LOG_DEBUG_LE(name) HG_UTIL_CAT(name, _dlog_entries_g) + +/* HG_LOG_DEBUG_DECL_DLOG: declare new debug log. */ +#define HG_LOG_DEBUG_DECL_DLOG(name) struct hg_dlog HG_LOG_DEBUG_DLOG(name) + +/* HG_LOG_DEBUG_DECL_LE: declare array of debug log entries. */ +#define HG_LOG_DEBUG_DECL_LE(name, size) struct hg_dlog_entry HG_LOG_DEBUG_LE(name)[size] + +/* HG_LOG_DLOG_INITIALIZER: initializer for a debug log */ +#define HG_LOG_DLOG_INITIALIZER(name, size) \ + HG_DLOG_INITIALIZER(HG_UTIL_STRINGIFY(name), HG_LOG_DEBUG_LE(name), size, 1) + +/* HG_LOG_OUTLET_SUBSYS_DLOG_INITIALIZER: initializer for a sub-system with + * debug log. */ +#define HG_LOG_OUTLET_SUBSYS_DLOG_INITIALIZER(name, parent_name) \ + HG_LOG_OUTLET_INITIALIZER(name, HG_LOG_PASS, &HG_LOG_OUTLET(parent_name), &HG_LOG_DEBUG_DLOG(name)) + +/* HG_LOG_SUBSYS_DLOG_DECL_REGISTER: declare and register a log outlet with + * debug log. */ +#define HG_LOG_SUBSYS_DLOG_DECL_REGISTER(name, parent_name) \ + struct hg_log_outlet HG_LOG_OUTLET(name) = HG_LOG_OUTLET_SUBSYS_DLOG_INITIALIZER(name, parent_name); \ + HG_LOG_SUBSYS_REGISTER(name) + +/* HG_LOG_ADD_COUNTER32: add 32-bit debug log counter */ +#define HG_LOG_ADD_COUNTER32(name, counter_ptr, counter_name, counter_desc) \ + hg_dlog_mkcount32(HG_LOG_OUTLET(name).debug_log, counter_ptr, counter_name, counter_desc) + +/* HG_LOG_ADD_COUNTER64: add 64-bit debug log counter */ +#define HG_LOG_ADD_COUNTER64(name, counter_ptr, counter_name, counter_desc) \ + hg_dlog_mkcount64(HG_LOG_OUTLET(name).debug_log, counter_ptr, counter_name, counter_desc) + +/*************************************/ +/* Public Type and Struct Definition */ +/*************************************/ + +#define X(a, b, c) a, +/* Log levels */ +enum hg_log_level { HG_LOG_LEVELS }; +#undef X + +/* Log states */ +enum hg_log_state { HG_LOG_PASS, HG_LOG_OFF, HG_LOG_ON }; + +/* Log outlet */ +struct hg_log_outlet { + const char * name; /* Name of outlet */ + enum hg_log_state state; /* Init state of outlet */ + enum hg_log_level level; /* Level of outlet */ + struct hg_log_outlet *parent; /* Parent of outlet */ + struct hg_dlog * debug_log; /* Debug log to use */ + HG_QUEUE_ENTRY(hg_log_outlet) entry; /* List entry */ +}; + +/* Log function */ +typedef int (*hg_log_func_t)(FILE *stream, const char *format, ...); + +/*********************/ +/* Public Prototypes */ +/*********************/ + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * Set the global log level. + * + * \param log_level [IN] enum log level type + */ +HG_UTIL_PUBLIC void hg_log_set_level(enum hg_log_level log_level); + +/** + * Get the global log level. + * + * \return global log_level + */ +HG_UTIL_PUBLIC enum hg_log_level hg_log_get_level(void); + +/** + * Set the log subsystems from a string. Format is: subsys1,subsys2,... + * Subsys can also be forced to be disabled with "~", e.g., ~subsys1 + * + * \param log_level [IN] null terminated string + */ +HG_UTIL_PUBLIC void hg_log_set_subsys(const char *log_subsys); + +/** + * Get the log subsystems as a string. Format is similar to hg_log_set_subsys(). + * Buffer returned is static. + * + * \return string of enabled log subsystems + */ +HG_UTIL_PUBLIC const char *hg_log_get_subsys(void); + +/** + * Set a specific subsystem's log level. + */ +HG_UTIL_PUBLIC void hg_log_set_subsys_level(const char *subsys, enum hg_log_level log_level); + +/** + * Get the log level from a string. + * + * \param log_level [IN] null terminated string + * + * \return log type enum value + */ +HG_UTIL_PUBLIC enum hg_log_level hg_log_name_to_level(const char *log_level); + +/** + * Set the logging function. + * + * \param log_func [IN] pointer to function + */ +HG_UTIL_PUBLIC void hg_log_set_func(hg_log_func_t log_func); + +/** + * Get the logging function. + * + * \return pointer pointer to function + */ +HG_UTIL_PUBLIC hg_log_func_t hg_log_get_func(void); + +/** + * Set the stream for error output. + * + * \param stream [IN/OUT] pointer to stream + */ +HG_UTIL_PUBLIC void hg_log_set_stream_error(FILE *stream); + +/** + * Get the stream for error output. + * + * \return pointer to stream + */ +HG_UTIL_PUBLIC FILE *hg_log_get_stream_error(void); + +/** + * Set the stream for warning output. + * + * \param stream [IN/OUT] pointer to stream + */ +HG_UTIL_PUBLIC void hg_log_set_stream_warning(FILE *stream); + +/** + * Get the stream for warning output. + * + * \return pointer to stream + */ +HG_UTIL_PUBLIC FILE *hg_log_get_stream_warning(void); + +/** + * Set the stream for debug output. + * + * \param stream [IN/OUT] pointer to stream + */ +HG_UTIL_PUBLIC void hg_log_set_stream_debug(FILE *stream); + +/** + * Get the stream for debug output. + * + * \return pointer to stream + */ +HG_UTIL_PUBLIC FILE *hg_log_get_stream_debug(void); + +/** + * Register log outlet. + * + * \param outlet [IN] log outlet + */ +HG_UTIL_PUBLIC void hg_log_outlet_register(struct hg_log_outlet *outlet); + +/** + * Write log. + * + * \param outlet [IN] log outlet + * \param log_level [IN] log level + * \param file [IN] file name + * \param line [IN] line number + * \param func [IN] function name + * \param format [IN] string format + */ +HG_UTIL_PUBLIC void hg_log_write(struct hg_log_outlet *outlet, enum hg_log_level log_level, const char *file, + unsigned int line, const char *func, const char *format, ...) + HG_UTIL_PRINTF(6, 7); + +/*********************/ +/* Public Variables */ +/*********************/ + +/* Top error outlet */ +extern HG_UTIL_PUBLIC HG_LOG_OUTLET_DECL(hg); + +#ifdef __cplusplus +} +#endif + +#endif /* MERCURY_LOG_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_queue.h b/src/H5FDsubfiling/mercury/src/util/mercury_queue.h new file mode 100644 index 0000000..07d977f --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_queue.h @@ -0,0 +1,115 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +/* Code below is derived from sys/queue.h which follows the below notice: + * + * Copyright (c) 1991, 1993 + * The Regents of the University of California. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3. Neither the name of the University nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * @(#)queue.h 8.5 (Berkeley) 8/20/94 + */ + +#ifndef MERCURY_QUEUE_H +#define MERCURY_QUEUE_H + +#define HG_QUEUE_HEAD_INITIALIZER(name) \ + { \ + NULL, &(name).head \ + } + +#define HG_QUEUE_HEAD_INIT(struct_head_name, var_name) \ + struct struct_head_name var_name = HG_QUEUE_HEAD_INITIALIZER(var_name) + +#define HG_QUEUE_HEAD_DECL(struct_head_name, struct_entry_name) \ + struct struct_head_name { \ + struct struct_entry_name * head; \ + struct struct_entry_name **tail; \ + } + +#define HG_QUEUE_HEAD(struct_entry_name) \ + struct { \ + struct struct_entry_name * head; \ + struct struct_entry_name **tail; \ + } + +#define HG_QUEUE_ENTRY(struct_entry_name) \ + struct { \ + struct struct_entry_name *next; \ + } + +#define HG_QUEUE_INIT(head_ptr) \ + do { \ + (head_ptr)->head = NULL; \ + (head_ptr)->tail = &(head_ptr)->head; \ + } while (/*CONSTCOND*/ 0) + +#define HG_QUEUE_IS_EMPTY(head_ptr) ((head_ptr)->head == NULL) + +#define HG_QUEUE_FIRST(head_ptr) ((head_ptr)->head) + +#define HG_QUEUE_NEXT(entry_ptr, entry_field_name) ((entry_ptr)->entry_field_name.next) + +#define HG_QUEUE_PUSH_TAIL(head_ptr, entry_ptr, entry_field_name) \ + do { \ + (entry_ptr)->entry_field_name.next = NULL; \ + *(head_ptr)->tail = (entry_ptr); \ + (head_ptr)->tail = &(entry_ptr)->entry_field_name.next; \ + } while (/*CONSTCOND*/ 0) + +/* TODO would be nice to not have any condition */ +#define HG_QUEUE_POP_HEAD(head_ptr, entry_field_name) \ + do { \ + if ((head_ptr)->head && ((head_ptr)->head = (head_ptr)->head->entry_field_name.next) == NULL) \ + (head_ptr)->tail = &(head_ptr)->head; \ + } while (/*CONSTCOND*/ 0) + +#define HG_QUEUE_FOREACH(var, head_ptr, entry_field_name) \ + for ((var) = ((head_ptr)->head); (var); (var) = ((var)->entry_field_name.next)) + +/** + * Avoid using those for performance reasons or use mercury_list.h instead + */ + +#define HG_QUEUE_REMOVE(head_ptr, entry_ptr, type, entry_field_name) \ + do { \ + if ((head_ptr)->head == (entry_ptr)) { \ + HG_QUEUE_POP_HEAD((head_ptr), entry_field_name); \ + } \ + else { \ + struct type *curelm = (head_ptr)->head; \ + while (curelm->entry_field_name.next != (entry_ptr)) \ + curelm = curelm->entry_field_name.next; \ + if ((curelm->entry_field_name.next = curelm->entry_field_name.next->entry_field_name.next) == \ + NULL) \ + (head_ptr)->tail = &(curelm)->entry_field_name.next; \ + } \ + } while (/*CONSTCOND*/ 0) + +#endif /* MERCURY_QUEUE_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_thread.c b/src/H5FDsubfiling/mercury/src/util/mercury_thread.c new file mode 100644 index 0000000..858434f --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_thread.c @@ -0,0 +1,162 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#include "mercury_thread.h" + +#if !defined(_WIN32) && !defined(__APPLE__) +#include <sched.h> +#endif + +/*---------------------------------------------------------------------------*/ +void +hg_thread_init(hg_thread_t *thread) +{ +#ifdef _WIN32 + *thread = NULL; +#else + *thread = 0; +#endif +} + +/*---------------------------------------------------------------------------*/ +int +hg_thread_create(hg_thread_t *thread, hg_thread_func_t f, void *data) +{ +#ifdef _WIN32 + *thread = CreateThread(NULL, 0, f, data, 0, NULL); + if (*thread == NULL) + return HG_UTIL_FAIL; +#else + if (pthread_create(thread, NULL, f, data)) + return HG_UTIL_FAIL; +#endif + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +void +hg_thread_exit(hg_thread_ret_t ret) +{ +#ifdef _WIN32 + ExitThread(ret); +#else + pthread_exit(ret); +#endif +} + +/*---------------------------------------------------------------------------*/ +int +hg_thread_join(hg_thread_t thread) +{ +#ifdef _WIN32 + WaitForSingleObject(thread, INFINITE); + CloseHandle(thread); +#else + if (pthread_join(thread, NULL)) + return HG_UTIL_FAIL; +#endif + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +int +hg_thread_cancel(hg_thread_t thread) +{ +#ifdef _WIN32 + WaitForSingleObject(thread, 0); + CloseHandle(thread); +#else + if (pthread_cancel(thread)) + return HG_UTIL_FAIL; +#endif + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +int +hg_thread_yield(void) +{ +#ifdef _WIN32 + SwitchToThread(); +#elif defined(__APPLE__) + pthread_yield_np(); +#else + sched_yield(); +#endif + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +int +hg_thread_key_create(hg_thread_key_t *key) +{ + if (!key) + return HG_UTIL_FAIL; + +#ifdef _WIN32 + if ((*key = TlsAlloc()) == TLS_OUT_OF_INDEXES) + return HG_UTIL_FAIL; +#else + if (pthread_key_create(key, NULL)) + return HG_UTIL_FAIL; +#endif + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +int +hg_thread_key_delete(hg_thread_key_t key) +{ +#ifdef _WIN32 + if (!TlsFree(key)) + return HG_UTIL_FAIL; +#else + if (pthread_key_delete(key)) + return HG_UTIL_FAIL; +#endif + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +int +hg_thread_getaffinity(hg_thread_t thread, hg_cpu_set_t *cpu_mask) +{ +#if defined(_WIN32) + return HG_UTIL_FAIL; +#elif defined(__APPLE__) + (void)thread; + (void)cpu_mask; + return HG_UTIL_FAIL; +#else + if (pthread_getaffinity_np(thread, sizeof(hg_cpu_set_t), cpu_mask)) + return HG_UTIL_FAIL; + return HG_UTIL_SUCCESS; +#endif +} + +/*---------------------------------------------------------------------------*/ +int +hg_thread_setaffinity(hg_thread_t thread, const hg_cpu_set_t *cpu_mask) +{ +#if defined(_WIN32) + if (!SetThreadAffinityMask(thread, *cpu_mask)) + return HG_UTIL_FAIL; +#elif defined(__APPLE__) + (void)thread; + (void)cpu_mask; + return HG_UTIL_FAIL; +#else + if (pthread_setaffinity_np(thread, sizeof(hg_cpu_set_t), cpu_mask)) + return HG_UTIL_FAIL; + return HG_UTIL_SUCCESS; +#endif +} diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_thread.h b/src/H5FDsubfiling/mercury/src/util/mercury_thread.h new file mode 100644 index 0000000..185d997 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_thread.h @@ -0,0 +1,225 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#ifndef MERCURY_THREAD_H +#define MERCURY_THREAD_H + +#if !defined(_WIN32) && !defined(_GNU_SOURCE) +#define _GNU_SOURCE +#endif +#include "mercury_util_config.h" + +#ifdef _WIN32 +#define _WINSOCKAPI_ +#include <windows.h> +typedef HANDLE hg_thread_t; +typedef LPTHREAD_START_ROUTINE hg_thread_func_t; +typedef DWORD hg_thread_ret_t; +#define HG_THREAD_RETURN_TYPE hg_thread_ret_t WINAPI +typedef DWORD hg_thread_key_t; +typedef DWORD_PTR hg_cpu_set_t; +#else +#include <pthread.h> +typedef pthread_t hg_thread_t; +typedef void *(*hg_thread_func_t)(void *); +typedef void * hg_thread_ret_t; +#define HG_THREAD_RETURN_TYPE hg_thread_ret_t +typedef pthread_key_t hg_thread_key_t; +#ifdef __APPLE__ +/* Size definition for CPU sets. */ +#define HG_CPU_SETSIZE 1024 +#define HG_NCPUBITS (8 * sizeof(hg_cpu_mask_t)) +/* Type for array elements in 'cpu_set_t'. */ +typedef uint64_t hg_cpu_mask_t; +typedef struct { + hg_cpu_mask_t bits[HG_CPU_SETSIZE / HG_NCPUBITS]; +} hg_cpu_set_t; +#else +typedef cpu_set_t hg_cpu_set_t; +#endif +#endif + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * Initialize the thread. + * + * \param thread [IN/OUT] pointer to thread object + */ +HG_UTIL_PUBLIC void hg_thread_init(hg_thread_t *thread); + +/** + * Create a new thread for the given function. + * + * \param thread [IN/OUT] pointer to thread object + * \param f [IN] pointer to function + * \param data [IN] pointer to data than be passed to function f + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC int hg_thread_create(hg_thread_t *thread, hg_thread_func_t f, void *data); + +/** + * Ends the calling thread. + * + * \param ret [IN] exit code for the thread + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC void hg_thread_exit(hg_thread_ret_t ret); + +/** + * Wait for thread completion. + * + * \param thread [IN] thread object + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC int hg_thread_join(hg_thread_t thread); + +/** + * Terminate the thread. + * + * \param thread [IN] thread object + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC int hg_thread_cancel(hg_thread_t thread); + +/** + * Yield the processor. + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC int hg_thread_yield(void); + +/** + * Obtain handle of the calling thread. + * + * \return + */ +static HG_UTIL_INLINE hg_thread_t hg_thread_self(void); + +/** + * Compare thread IDs. + * + * \return Non-zero if equal, zero if not equal + */ +static HG_UTIL_INLINE int hg_thread_equal(hg_thread_t t1, hg_thread_t t2); + +/** + * Create a thread-specific data key visible to all threads in the process. + * + * \param key [OUT] pointer to thread key object + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC int hg_thread_key_create(hg_thread_key_t *key); + +/** + * Delete a thread-specific data key previously returned by + * hg_thread_key_create(). + * + * \param key [IN] thread key object + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC int hg_thread_key_delete(hg_thread_key_t key); + +/** + * Get value from specified key. + * + * \param key [IN] thread key object + * + * \return Pointer to data associated to the key + */ +static HG_UTIL_INLINE void *hg_thread_getspecific(hg_thread_key_t key); + +/** + * Set value to specified key. + * + * \param key [IN] thread key object + * \param value [IN] pointer to data that will be associated + * + * \return Non-negative on success or negative on failure + */ +static HG_UTIL_INLINE int hg_thread_setspecific(hg_thread_key_t key, const void *value); + +/** + * Get affinity mask. + * + * \param thread [IN] thread object + * \param cpu_mask [IN/OUT] cpu mask + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC int hg_thread_getaffinity(hg_thread_t thread, hg_cpu_set_t *cpu_mask); + +/** + * Set affinity mask. + * + * \param thread [IN] thread object + * \param cpu_mask [IN] cpu mask + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC int hg_thread_setaffinity(hg_thread_t thread, const hg_cpu_set_t *cpu_mask); + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE hg_thread_t +hg_thread_self(void) +{ +#ifdef _WIN32 + return GetCurrentThread(); +#else + return pthread_self(); +#endif +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_thread_equal(hg_thread_t t1, hg_thread_t t2) +{ +#ifdef _WIN32 + return GetThreadId(t1) == GetThreadId(t2); +#else + return pthread_equal(t1, t2); +#endif +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE void * +hg_thread_getspecific(hg_thread_key_t key) +{ +#ifdef _WIN32 + return TlsGetValue(key); +#else + return pthread_getspecific(key); +#endif +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_thread_setspecific(hg_thread_key_t key, const void *value) +{ +#ifdef _WIN32 + if (!TlsSetValue(key, (LPVOID)value)) + return HG_UTIL_FAIL; +#else + if (pthread_setspecific(key, value)) + return HG_UTIL_FAIL; +#endif + + return HG_UTIL_SUCCESS; +} + +#ifdef __cplusplus +} +#endif + +#endif /* MERCURY_THREAD_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_thread_annotation.h b/src/H5FDsubfiling/mercury/src/util/mercury_thread_annotation.h new file mode 100644 index 0000000..50056a1 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_thread_annotation.h @@ -0,0 +1,35 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#ifndef MERCURY_THREAD_ANNOTATION_H +#define MERCURY_THREAD_ANNOTATION_H + +/* Enable thread safety attributes only with clang. + * The attributes can be safely erased when compiling with other compilers. */ +#if defined(__clang__) && (__clang_major__ > 3) +#define HG_THREAD_ANNOTATION_ATTRIBUTE__(x) __attribute__((x)) +#else +#define HG_THREAD_ANNOTATION_ATTRIBUTE__(x) // no-op +#endif + +#define HG_LOCK_CAPABILITY(x) HG_THREAD_ANNOTATION_ATTRIBUTE__(capability(x)) + +#define HG_LOCK_ACQUIRE(...) HG_THREAD_ANNOTATION_ATTRIBUTE__(acquire_capability(__VA_ARGS__)) + +#define HG_LOCK_ACQUIRE_SHARED(...) HG_THREAD_ANNOTATION_ATTRIBUTE__(acquire_shared_capability(__VA_ARGS__)) + +#define HG_LOCK_RELEASE(...) HG_THREAD_ANNOTATION_ATTRIBUTE__(release_capability(__VA_ARGS__)) + +#define HG_LOCK_RELEASE_SHARED(...) HG_THREAD_ANNOTATION_ATTRIBUTE__(release_shared_capability(__VA_ARGS__)) + +#define HG_LOCK_TRY_ACQUIRE(...) HG_THREAD_ANNOTATION_ATTRIBUTE__(try_acquire_capability(__VA_ARGS__)) + +#define HG_LOCK_TRY_ACQUIRE_SHARED(...) \ + HG_THREAD_ANNOTATION_ATTRIBUTE__(try_acquire_shared_capability(__VA_ARGS__)) + +#define HG_LOCK_NO_THREAD_SAFETY_ANALYSIS HG_THREAD_ANNOTATION_ATTRIBUTE__(no_thread_safety_analysis) + +#endif /* MERCURY_THREAD_ANNOTATION_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_thread_condition.c b/src/H5FDsubfiling/mercury/src/util/mercury_thread_condition.c new file mode 100644 index 0000000..9eed4c1 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_thread_condition.c @@ -0,0 +1,42 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#include "mercury_thread_condition.h" + +/*---------------------------------------------------------------------------*/ +int +hg_thread_cond_init(hg_thread_cond_t *cond) +{ +#ifdef _WIN32 + InitializeConditionVariable(cond); +#else + pthread_condattr_t attr; + + pthread_condattr_init(&attr); +#if defined(HG_UTIL_HAS_PTHREAD_CONDATTR_SETCLOCK) && defined(HG_UTIL_HAS_CLOCK_MONOTONIC_COARSE) + /* Must set clock ID if using different clock + * (CLOCK_MONOTONIC_COARSE not supported here) */ + pthread_condattr_setclock(&attr, CLOCK_MONOTONIC); +#endif + if (pthread_cond_init(cond, &attr)) + return HG_UTIL_FAIL; + pthread_condattr_destroy(&attr); +#endif + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +int +hg_thread_cond_destroy(hg_thread_cond_t *cond) +{ +#ifndef _WIN32 + if (pthread_cond_destroy(cond)) + return HG_UTIL_FAIL; +#endif + + return HG_UTIL_SUCCESS; +} diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_thread_condition.h b/src/H5FDsubfiling/mercury/src/util/mercury_thread_condition.h new file mode 100644 index 0000000..1435667 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_thread_condition.h @@ -0,0 +1,172 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#ifndef MERCURY_THREAD_CONDITION_H +#define MERCURY_THREAD_CONDITION_H + +#include "mercury_thread_mutex.h" + +#ifdef _WIN32 +typedef CONDITION_VARIABLE hg_thread_cond_t; +#else +#if defined(HG_UTIL_HAS_PTHREAD_CONDATTR_SETCLOCK) && defined(HG_UTIL_HAS_CLOCK_MONOTONIC_COARSE) +#include <time.h> +#elif defined(HG_UTIL_HAS_SYSTIME_H) +#include <sys/time.h> +#endif +#include <stdlib.h> +typedef pthread_cond_t hg_thread_cond_t; +#endif + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * Initialize the condition. + * + * \param cond [IN/OUT] pointer to condition object + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC int hg_thread_cond_init(hg_thread_cond_t *cond); + +/** + * Destroy the condition. + * + * \param cond [IN/OUT] pointer to condition object + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC int hg_thread_cond_destroy(hg_thread_cond_t *cond); + +/** + * Wake one thread waiting for the condition to change. + * + * \param cond [IN/OUT] pointer to condition object + * + * \return Non-negative on success or negative on failure + */ +static HG_UTIL_INLINE int hg_thread_cond_signal(hg_thread_cond_t *cond); + +/** + * Wake all the threads waiting for the condition to change. + * + * \param cond [IN/OUT] pointer to condition object + * + * \return Non-negative on success or negative on failure + */ +static HG_UTIL_INLINE int hg_thread_cond_broadcast(hg_thread_cond_t *cond); + +/** + * Wait for the condition to change. + * + * \param cond [IN/OUT] pointer to condition object + * \param mutex [IN/OUT] pointer to mutex object + * + * \return Non-negative on success or negative on failure + */ +static HG_UTIL_INLINE int hg_thread_cond_wait(hg_thread_cond_t *cond, hg_thread_mutex_t *mutex); + +/** + * Wait timeout ms for the condition to change. + * + * \param cond [IN/OUT] pointer to condition object + * \param mutex [IN/OUT] pointer to mutex object + * \param timeout [IN] timeout (in milliseconds) + * + * \return Non-negative on success or negative on failure + */ +static HG_UTIL_INLINE int hg_thread_cond_timedwait(hg_thread_cond_t *cond, hg_thread_mutex_t *mutex, + unsigned int timeout); + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_thread_cond_signal(hg_thread_cond_t *cond) +{ +#ifdef _WIN32 + WakeConditionVariable(cond); +#else + if (pthread_cond_signal(cond)) + return HG_UTIL_FAIL; +#endif + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_thread_cond_broadcast(hg_thread_cond_t *cond) +{ +#ifdef _WIN32 + WakeAllConditionVariable(cond); +#else + if (pthread_cond_broadcast(cond)) + return HG_UTIL_FAIL; +#endif + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_thread_cond_wait(hg_thread_cond_t *cond, hg_thread_mutex_t *mutex) +{ +#ifdef _WIN32 + if (!SleepConditionVariableCS(cond, mutex, INFINITE)) + return HG_UTIL_FAIL; +#else + if (pthread_cond_wait(cond, mutex)) + return HG_UTIL_FAIL; +#endif + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_thread_cond_timedwait(hg_thread_cond_t *cond, hg_thread_mutex_t *mutex, unsigned int timeout) +{ +#ifdef _WIN32 + if (!SleepConditionVariableCS(cond, mutex, timeout)) + return HG_UTIL_FAIL; +#else +#if defined(HG_UTIL_HAS_PTHREAD_CONDATTR_SETCLOCK) && defined(HG_UTIL_HAS_CLOCK_MONOTONIC_COARSE) + struct timespec now; +#else + struct timeval now; +#endif + struct timespec abs_timeout; + ldiv_t ld; + + /* Need to convert timeout (ms) to absolute time */ +#if defined(HG_UTIL_HAS_PTHREAD_CONDATTR_SETCLOCK) && defined(HG_UTIL_HAS_CLOCK_MONOTONIC_COARSE) + clock_gettime(CLOCK_MONOTONIC_COARSE, &now); + + /* Get sec / nsec */ + ld = ldiv(now.tv_nsec + timeout * 1000000L, 1000000000L); + abs_timeout.tv_nsec = ld.rem; +#elif defined(HG_UTIL_HAS_SYSTIME_H) + gettimeofday(&now, NULL); + + /* Get sec / usec */ + ld = ldiv(now.tv_usec + timeout * 1000L, 1000000L); + abs_timeout.tv_nsec = ld.rem * 1000L; +#endif + abs_timeout.tv_sec = now.tv_sec + ld.quot; + + if (pthread_cond_timedwait(cond, mutex, &abs_timeout)) + return HG_UTIL_FAIL; +#endif + + return HG_UTIL_SUCCESS; +} + +#ifdef __cplusplus +} +#endif + +#endif /* MERCURY_THREAD_CONDITION_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_thread_mutex.c b/src/H5FDsubfiling/mercury/src/util/mercury_thread_mutex.c new file mode 100644 index 0000000..c60ca94 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_thread_mutex.c @@ -0,0 +1,92 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#include "mercury_thread_mutex.h" + +#include "mercury_util_error.h" + +#include <string.h> + +#ifndef _WIN32 +static int +hg_thread_mutex_init_posix(hg_thread_mutex_t *mutex, int kind) +{ + pthread_mutexattr_t mutex_attr; + int ret = HG_UTIL_SUCCESS; + int rc; + + rc = pthread_mutexattr_init(&mutex_attr); + HG_UTIL_CHECK_ERROR(rc != 0, done, ret, HG_UTIL_FAIL, "pthread_mutexattr_init() failed (%s)", + strerror(rc)); + + /* Keep mutex mode as normal and do not expect error checking */ + rc = pthread_mutexattr_settype(&mutex_attr, kind); + HG_UTIL_CHECK_ERROR(rc != 0, done, ret, HG_UTIL_FAIL, "pthread_mutexattr_settype() failed (%s)", + strerror(rc)); + + rc = pthread_mutex_init(mutex, &mutex_attr); + HG_UTIL_CHECK_ERROR(rc != 0, done, ret, HG_UTIL_FAIL, "pthread_mutex_init() failed (%s)", strerror(rc)); + +done: + rc = pthread_mutexattr_destroy(&mutex_attr); + HG_UTIL_CHECK_ERROR_DONE(rc != 0, "pthread_mutexattr_destroy() failed (%s)", strerror(rc)); + + return ret; +} +#endif + +/*---------------------------------------------------------------------------*/ +int +hg_thread_mutex_init(hg_thread_mutex_t *mutex) +{ + int ret = HG_UTIL_SUCCESS; + +#ifdef _WIN32 + InitializeCriticalSection(mutex); +#else + ret = hg_thread_mutex_init_posix(mutex, PTHREAD_MUTEX_NORMAL); +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +int +hg_thread_mutex_init_fast(hg_thread_mutex_t *mutex) +{ + int ret = HG_UTIL_SUCCESS; + +#if defined(_WIN32) + ret = hg_thread_mutex_init(mutex); +#elif defined(HG_UTIL_HAS_PTHREAD_MUTEX_ADAPTIVE_NP) + /* Set type to PTHREAD_MUTEX_ADAPTIVE_NP to improve performance */ + ret = hg_thread_mutex_init_posix(mutex, PTHREAD_MUTEX_ADAPTIVE_NP); +#else + ret = hg_thread_mutex_init_posix(mutex, PTHREAD_MUTEX_NORMAL); +#endif + + return ret; +} + +/*---------------------------------------------------------------------------*/ +int +hg_thread_mutex_destroy(hg_thread_mutex_t *mutex) +{ + int ret = HG_UTIL_SUCCESS; + +#ifdef _WIN32 + DeleteCriticalSection(mutex); +#else + int rc; + + rc = pthread_mutex_destroy(mutex); + HG_UTIL_CHECK_ERROR(rc != 0, done, ret, HG_UTIL_FAIL, "pthread_mutex_destroy() failed (%s)", + strerror(rc)); + +done: +#endif + return ret; +} diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_thread_mutex.h b/src/H5FDsubfiling/mercury/src/util/mercury_thread_mutex.h new file mode 100644 index 0000000..61d74a3 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_thread_mutex.h @@ -0,0 +1,121 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#ifndef MERCURY_THREAD_MUTEX_H +#define MERCURY_THREAD_MUTEX_H + +#include "mercury_util_config.h" + +#include "mercury_thread_annotation.h" + +#ifdef _WIN32 +#define _WINSOCKAPI_ +#include <windows.h> +#define HG_THREAD_MUTEX_INITIALIZER NULL +typedef CRITICAL_SECTION hg_thread_mutex_t; +#else +#include <pthread.h> +#define HG_THREAD_MUTEX_INITIALIZER PTHREAD_MUTEX_INITIALIZER +typedef pthread_mutex_t HG_LOCK_CAPABILITY("mutex") hg_thread_mutex_t; +#endif + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * Initialize the mutex. + * + * \param mutex [IN/OUT] pointer to mutex object + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC int hg_thread_mutex_init(hg_thread_mutex_t *mutex); + +/** + * Initialize the mutex, asking for "fast" mutex. + * + * \param mutex [IN/OUT] pointer to mutex object + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC int hg_thread_mutex_init_fast(hg_thread_mutex_t *mutex); + +/** + * Destroy the mutex. + * + * \param mutex [IN/OUT] pointer to mutex object + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC int hg_thread_mutex_destroy(hg_thread_mutex_t *mutex); + +/** + * Lock the mutex. + * + * \param mutex [IN/OUT] pointer to mutex object + */ +static HG_UTIL_INLINE void hg_thread_mutex_lock(hg_thread_mutex_t *mutex) HG_LOCK_ACQUIRE(*mutex); + +/** + * Try locking the mutex. + * + * \param mutex [IN/OUT] pointer to mutex object + * + * \return Non-negative on success or negative on failure + */ +static HG_UTIL_INLINE int hg_thread_mutex_try_lock(hg_thread_mutex_t *mutex) + HG_LOCK_TRY_ACQUIRE(HG_UTIL_SUCCESS, *mutex); + +/** + * Unlock the mutex. + * + * \param mutex [IN/OUT] pointer to mutex object + */ +static HG_UTIL_INLINE void hg_thread_mutex_unlock(hg_thread_mutex_t *mutex) HG_LOCK_RELEASE(*mutex); + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE void +hg_thread_mutex_lock(hg_thread_mutex_t *mutex) HG_LOCK_NO_THREAD_SAFETY_ANALYSIS +{ +#ifdef _WIN32 + EnterCriticalSection(mutex); +#else + (void)pthread_mutex_lock(mutex); +#endif +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_thread_mutex_try_lock(hg_thread_mutex_t *mutex) HG_LOCK_NO_THREAD_SAFETY_ANALYSIS +{ +#ifdef _WIN32 + if (!TryEnterCriticalSection(mutex)) + return HG_UTIL_FAIL; +#else + if (pthread_mutex_trylock(mutex)) + return HG_UTIL_FAIL; +#endif + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE void +hg_thread_mutex_unlock(hg_thread_mutex_t *mutex) HG_LOCK_NO_THREAD_SAFETY_ANALYSIS +{ +#ifdef _WIN32 + LeaveCriticalSection(mutex); +#else + (void)pthread_mutex_unlock(mutex); +#endif +} + +#ifdef __cplusplus +} +#endif + +#endif /* MERCURY_THREAD_MUTEX_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_thread_pool.c b/src/H5FDsubfiling/mercury/src/util/mercury_thread_pool.c new file mode 100644 index 0000000..76248d1 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_thread_pool.c @@ -0,0 +1,175 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#include "mercury_thread_pool.h" + +#include "mercury_util_error.h" + +#include <stdlib.h> + +/****************/ +/* Local Macros */ +/****************/ + +/************************************/ +/* Local Type and Struct Definition */ +/************************************/ + +struct hg_thread_pool_private { + struct hg_thread_pool pool; + unsigned int thread_count; + hg_thread_t * threads; +}; + +/********************/ +/* Local Prototypes */ +/********************/ + +/** + * Worker thread run by the thread pool + */ +static HG_THREAD_RETURN_TYPE hg_thread_pool_worker(void *args); + +/*******************/ +/* Local Variables */ +/*******************/ + +/*---------------------------------------------------------------------------*/ +static HG_THREAD_RETURN_TYPE +hg_thread_pool_worker(void *args) +{ + hg_thread_ret_t ret = 0; + hg_thread_pool_t * pool = (hg_thread_pool_t *)args; + struct hg_thread_work *work; + + while (1) { + hg_thread_mutex_lock(&pool->mutex); + + /* If not shutting down and nothing to do, worker sleeps */ + while (!pool->shutdown && HG_QUEUE_IS_EMPTY(&pool->queue)) { + int rc; + + pool->sleeping_worker_count++; + + rc = hg_thread_cond_wait(&pool->cond, &pool->mutex); + HG_UTIL_CHECK_ERROR_NORET(rc != HG_UTIL_SUCCESS, unlock, + "Thread cannot wait on condition variable"); + + pool->sleeping_worker_count--; + } + + if (pool->shutdown && HG_QUEUE_IS_EMPTY(&pool->queue)) + goto unlock; + + /* Grab our task */ + work = HG_QUEUE_FIRST(&pool->queue); + HG_QUEUE_POP_HEAD(&pool->queue, entry); + + /* Unlock */ + hg_thread_mutex_unlock(&pool->mutex); + + /* Get to work */ + (*work->func)(work->args); + } + +unlock: + hg_thread_mutex_unlock(&pool->mutex); + + return ret; +} + +/*---------------------------------------------------------------------------*/ +int +hg_thread_pool_init(unsigned int thread_count, hg_thread_pool_t **pool_ptr) +{ + int ret = HG_UTIL_SUCCESS, rc; + struct hg_thread_pool_private *priv_pool = NULL; + unsigned int i; + + HG_UTIL_CHECK_ERROR(pool_ptr == NULL, error, ret, HG_UTIL_FAIL, "NULL pointer"); + + priv_pool = (struct hg_thread_pool_private *)malloc(sizeof(struct hg_thread_pool_private)); + HG_UTIL_CHECK_ERROR(priv_pool == NULL, error, ret, HG_UTIL_FAIL, "Could not allocate thread pool"); + + priv_pool->pool.sleeping_worker_count = 0; + priv_pool->thread_count = thread_count; + priv_pool->threads = NULL; + HG_QUEUE_INIT(&priv_pool->pool.queue); + priv_pool->pool.shutdown = 0; + + rc = hg_thread_mutex_init(&priv_pool->pool.mutex); + HG_UTIL_CHECK_ERROR(rc != HG_UTIL_SUCCESS, error, ret, HG_UTIL_FAIL, "Could not initialize mutex"); + + rc = hg_thread_cond_init(&priv_pool->pool.cond); + HG_UTIL_CHECK_ERROR(rc != HG_UTIL_SUCCESS, error, ret, HG_UTIL_FAIL, + "Could not initialize thread condition"); + + priv_pool->threads = (hg_thread_t *)malloc(thread_count * sizeof(hg_thread_t)); + HG_UTIL_CHECK_ERROR(!priv_pool->threads, error, ret, HG_UTIL_FAIL, + "Could not allocate thread pool array"); + + /* Start worker threads */ + for (i = 0; i < thread_count; i++) { + rc = hg_thread_create(&priv_pool->threads[i], hg_thread_pool_worker, (void *)priv_pool); + HG_UTIL_CHECK_ERROR(rc != HG_UTIL_SUCCESS, error, ret, HG_UTIL_FAIL, "Could not create thread"); + } + + *pool_ptr = (struct hg_thread_pool *)priv_pool; + + return ret; + +error: + if (priv_pool) + hg_thread_pool_destroy((struct hg_thread_pool *)priv_pool); + + return ret; +} + +/*---------------------------------------------------------------------------*/ +int +hg_thread_pool_destroy(hg_thread_pool_t *pool) +{ + struct hg_thread_pool_private *priv_pool = (struct hg_thread_pool_private *)pool; + int ret = HG_UTIL_SUCCESS, rc; + unsigned int i; + + if (!priv_pool) + goto done; + + if (priv_pool->threads) { + hg_thread_mutex_lock(&priv_pool->pool.mutex); + + priv_pool->pool.shutdown = 1; + + rc = hg_thread_cond_broadcast(&priv_pool->pool.cond); + HG_UTIL_CHECK_ERROR(rc != HG_UTIL_SUCCESS, error, ret, HG_UTIL_FAIL, + "Could not broadcast condition signal"); + + hg_thread_mutex_unlock(&priv_pool->pool.mutex); + + for (i = 0; i < priv_pool->thread_count; i++) { + rc = hg_thread_join(priv_pool->threads[i]); + HG_UTIL_CHECK_ERROR(rc != HG_UTIL_SUCCESS, done, ret, HG_UTIL_FAIL, "Could not join thread"); + } + } + + rc = hg_thread_mutex_destroy(&priv_pool->pool.mutex); + HG_UTIL_CHECK_ERROR(rc != HG_UTIL_SUCCESS, done, ret, HG_UTIL_FAIL, "Could not destroy mutex"); + + rc = hg_thread_cond_destroy(&priv_pool->pool.cond); + HG_UTIL_CHECK_ERROR(rc != HG_UTIL_SUCCESS, done, ret, HG_UTIL_FAIL, "Could not destroy thread condition"); + + free(priv_pool->threads); + free(priv_pool); + +done: + return ret; + +error: + hg_thread_mutex_unlock(&priv_pool->pool.mutex); + + return ret; +} diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_thread_pool.h b/src/H5FDsubfiling/mercury/src/util/mercury_thread_pool.h new file mode 100644 index 0000000..b399f66 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_thread_pool.h @@ -0,0 +1,114 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#ifndef MERCURY_THREAD_POOL_H +#define MERCURY_THREAD_POOL_H + +#include "mercury_queue.h" +#include "mercury_thread.h" +#include "mercury_thread_condition.h" + +/*************************************/ +/* Public Type and Struct Definition */ +/*************************************/ + +typedef struct hg_thread_pool hg_thread_pool_t; + +struct hg_thread_pool { + unsigned int sleeping_worker_count; + HG_QUEUE_HEAD(hg_thread_work) queue; + int shutdown; + hg_thread_mutex_t mutex; + hg_thread_cond_t cond; +}; + +struct hg_thread_work { + hg_thread_func_t func; + void * args; + HG_QUEUE_ENTRY(hg_thread_work) entry; /* Internal */ +}; + +/*****************/ +/* Public Macros */ +/*****************/ + +/*********************/ +/* Public Prototypes */ +/*********************/ + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * Initialize the thread pool. + * + * \param thread_count [IN] number of threads that will be created at + * initialization + * \param pool [OUT] pointer to pool object + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC int hg_thread_pool_init(unsigned int thread_count, hg_thread_pool_t **pool); + +/** + * Destroy the thread pool. + * + * \param pool [IN/OUT] pointer to pool object + * + * \return Non-negative on success or negative on failure + */ +HG_UTIL_PUBLIC int hg_thread_pool_destroy(hg_thread_pool_t *pool); + +/** + * Post work to the pool. Note that the operation may be queued depending on + * the number of threads and number of tasks already running. + * + * \param pool [IN/OUT] pointer to pool object + * \param work [IN] pointer to work struct + * + * \return Non-negative on success or negative on failure + */ +static HG_UTIL_INLINE int hg_thread_pool_post(hg_thread_pool_t *pool, struct hg_thread_work *work); + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_thread_pool_post(hg_thread_pool_t *pool, struct hg_thread_work *work) +{ + int ret = HG_UTIL_SUCCESS; + + if (!pool || !work) + return HG_UTIL_FAIL; + + if (!work->func) + return HG_UTIL_FAIL; + + hg_thread_mutex_lock(&pool->mutex); + + /* Are we shutting down ? */ + if (pool->shutdown) { + ret = HG_UTIL_FAIL; + goto unlock; + } + + /* Add task to task queue */ + HG_QUEUE_PUSH_TAIL(&pool->queue, work, entry); + + /* Wake up sleeping worker */ + if (pool->sleeping_worker_count && (hg_thread_cond_signal(&pool->cond) != HG_UTIL_SUCCESS)) + ret = HG_UTIL_FAIL; + +unlock: + hg_thread_mutex_unlock(&pool->mutex); + + return ret; +} + +#ifdef __cplusplus +} +#endif + +#endif /* MERCURY_THREAD_POOL_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_time.h b/src/H5FDsubfiling/mercury/src/util/mercury_time.h new file mode 100644 index 0000000..ba82a8a --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_time.h @@ -0,0 +1,500 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#ifndef MERCURY_TIME_H +#define MERCURY_TIME_H + +#include "mercury_util_config.h" + +#if defined(_WIN32) +#define _WINSOCKAPI_ +#include <windows.h> +#elif defined(HG_UTIL_HAS_TIME_H) && defined(HG_UTIL_HAS_CLOCK_GETTIME) +#include <time.h> +#elif defined(__APPLE__) && defined(HG_UTIL_HAS_SYSTIME_H) +#include <mach/mach_time.h> +#include <sys/time.h> +#else +#include <stdio.h> +#include <unistd.h> +#if defined(HG_UTIL_HAS_SYSTIME_H) +#include <sys/time.h> +#else +#error "Not supported on this platform." +#endif +#endif + +/*************************************/ +/* Public Type and Struct Definition */ +/*************************************/ + +#if defined(HG_UTIL_HAS_TIME_H) && defined(HG_UTIL_HAS_CLOCK_GETTIME) +typedef struct timespec hg_time_t; +#else +typedef struct hg_time hg_time_t; + +struct hg_time { + long tv_sec; + long tv_usec; +}; +#endif + +/*****************/ +/* Public Macros */ +/*****************/ + +/*********************/ +/* Public Prototypes */ +/*********************/ + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * Get an elapsed time on the calling processor. + * + * \param tv [OUT] pointer to returned time structure + * + * \return Non-negative on success or negative on failure + */ +static HG_UTIL_INLINE int hg_time_get_current(hg_time_t *tv); + +/** + * Get an elapsed time on the calling processor (resolution is ms). + * + * \param tv [OUT] pointer to returned time structure + * + * \return Non-negative on success or negative on failure + */ +static HG_UTIL_INLINE int hg_time_get_current_ms(hg_time_t *tv); + +/** + * Convert hg_time_t to double. + * + * \param tv [IN] time structure + * + * \return Converted time in seconds + */ +static HG_UTIL_INLINE double hg_time_to_double(hg_time_t tv); + +/** + * Convert double to hg_time_t. + * + * \param d [IN] time in seconds + * + * \return Converted time structure + */ +static HG_UTIL_INLINE hg_time_t hg_time_from_double(double d); + +/** + * Convert (integer) milliseconds to hg_time_t. + * + * \param ms [IN] time in milliseconds + * + * \return Converted time structure + */ +static HG_UTIL_INLINE hg_time_t hg_time_from_ms(unsigned int ms); + +/** + * Convert hg_time_t to (integer) milliseconds. + * + * \param tv [IN] time structure + * + * \return Time in milliseconds + */ +static HG_UTIL_INLINE unsigned int hg_time_to_ms(hg_time_t tv); + +/** + * Compare time values. + * + * \param in1 [IN] time structure + * \param in2 [IN] time structure + * + * \return 1 if in1 < in2, 0 otherwise + */ +static HG_UTIL_INLINE int hg_time_less(hg_time_t in1, hg_time_t in2); + +/** + * Diff time values and return the number of seconds elapsed between + * time \in2 and time \in1. + * + * \param in2 [IN] time structure + * \param in1 [IN] time structure + * + * \return Subtracted time + */ +static HG_UTIL_INLINE double hg_time_diff(hg_time_t in2, hg_time_t in1); + +/** + * Add time values. + * + * \param in1 [IN] time structure + * \param in2 [IN] time structure + * + * \return Summed time structure + */ +static HG_UTIL_INLINE hg_time_t hg_time_add(hg_time_t in1, hg_time_t in2); + +/** + * Subtract time values. + * + * \param in1 [IN] time structure + * \param in2 [IN] time structure + * + * \return Subtracted time structure + */ +static HG_UTIL_INLINE hg_time_t hg_time_subtract(hg_time_t in1, hg_time_t in2); + +/** + * Sleep until the time specified in rqt has elapsed. + * + * \param reqt [IN] time structure + * + * \return Non-negative on success or negative on failure + */ +static HG_UTIL_INLINE int hg_time_sleep(const hg_time_t rqt); + +/** + * Get a string containing current time/date stamp. + * + * \return Valid string or NULL on failure + */ +static HG_UTIL_INLINE char *hg_time_stamp(void); + +/*---------------------------------------------------------------------------*/ +#ifdef _WIN32 +static HG_UTIL_INLINE LARGE_INTEGER +get_FILETIME_offset(void) +{ + SYSTEMTIME s; + FILETIME f; + LARGE_INTEGER t; + + s.wYear = 1970; + s.wMonth = 1; + s.wDay = 1; + s.wHour = 0; + s.wMinute = 0; + s.wSecond = 0; + s.wMilliseconds = 0; + SystemTimeToFileTime(&s, &f); + t.QuadPart = f.dwHighDateTime; + t.QuadPart <<= 32; + t.QuadPart |= f.dwLowDateTime; + + return t; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_time_get_current(hg_time_t *tv) +{ + LARGE_INTEGER t; + FILETIME f; + double t_usec; + static LARGE_INTEGER offset; + static double freq_to_usec; + static int initialized = 0; + static BOOL use_perf_counter = 0; + + if (!initialized) { + LARGE_INTEGER perf_freq; + initialized = 1; + use_perf_counter = QueryPerformanceFrequency(&perf_freq); + if (use_perf_counter) { + QueryPerformanceCounter(&offset); + freq_to_usec = (double)perf_freq.QuadPart / 1000000.; + } + else { + offset = get_FILETIME_offset(); + freq_to_usec = 10.; + } + } + if (use_perf_counter) { + QueryPerformanceCounter(&t); + } + else { + GetSystemTimeAsFileTime(&f); + t.QuadPart = f.dwHighDateTime; + t.QuadPart <<= 32; + t.QuadPart |= f.dwLowDateTime; + } + + t.QuadPart -= offset.QuadPart; + t_usec = (double)t.QuadPart / freq_to_usec; + t.QuadPart = (LONGLONG)t_usec; + tv->tv_sec = (long)(t.QuadPart / 1000000); + tv->tv_usec = (long)(t.QuadPart % 1000000); + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_time_get_current_ms(hg_time_t *tv) +{ + return hg_time_get_current(tv); +} + +/*---------------------------------------------------------------------------*/ +#elif defined(HG_UTIL_HAS_TIME_H) && defined(HG_UTIL_HAS_CLOCK_GETTIME) +static HG_UTIL_INLINE int +hg_time_get_current(hg_time_t *tv) +{ + clock_gettime(CLOCK_MONOTONIC, tv); + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_time_get_current_ms(hg_time_t *tv) +{ +/* ppc/32 and ppc/64 do not support CLOCK_MONOTONIC_COARSE in vdso */ +#if defined(__ppc64__) || defined(__ppc__) || defined(__PPC64__) || defined(__PPC__) || \ + !defined(HG_UTIL_HAS_CLOCK_MONOTONIC_COARSE) + clock_gettime(CLOCK_MONOTONIC, tv); +#else + /* We don't need fine grain time stamps, _COARSE resolution is 1ms */ + clock_gettime(CLOCK_MONOTONIC_COARSE, tv); +#endif + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +#elif defined(__APPLE__) && defined(HG_UTIL_HAS_SYSTIME_H) +static HG_UTIL_INLINE int +hg_time_get_current(hg_time_t *tv) +{ + static uint64_t monotonic_timebase_factor = 0; + uint64_t monotonic_nsec; + + if (monotonic_timebase_factor == 0) { + mach_timebase_info_data_t timebase_info; + + (void)mach_timebase_info(&timebase_info); + monotonic_timebase_factor = timebase_info.numer / timebase_info.denom; + } + monotonic_nsec = (mach_absolute_time() * monotonic_timebase_factor); + tv->tv_sec = (long)(monotonic_nsec / 1000000000); + tv->tv_usec = (long)((monotonic_nsec - (uint64_t)tv->tv_sec) / 1000); + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_time_get_current_ms(hg_time_t *tv) +{ + return hg_time_get_current(tv); +} + +#else +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_time_get_current(hg_time_t *tv) +{ + gettimeofday((struct timeval *)tv, NULL); + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_time_get_current_ms(hg_time_t *tv) +{ + return hg_time_get_current(tv); +} + +#endif +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE double +hg_time_to_double(hg_time_t tv) +{ +#if defined(HG_UTIL_HAS_TIME_H) && defined(HG_UTIL_HAS_CLOCK_GETTIME) + return (double)tv.tv_sec + (double)(tv.tv_nsec) * 0.000000001; +#else + return (double)tv.tv_sec + (double)(tv.tv_usec) * 0.000001; +#endif +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE hg_time_t +hg_time_from_double(double d) +{ + hg_time_t tv; + + tv.tv_sec = (long)d; +#if defined(HG_UTIL_HAS_TIME_H) && defined(HG_UTIL_HAS_CLOCK_GETTIME) + tv.tv_nsec = (long)((d - (double)(tv.tv_sec)) * 1000000000); +#else + tv.tv_usec = (long)((d - (double)(tv.tv_sec)) * 1000000); +#endif + + return tv; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE unsigned int +hg_time_to_ms(hg_time_t tv) +{ +#if defined(HG_UTIL_HAS_TIME_H) && defined(HG_UTIL_HAS_CLOCK_GETTIME) + return (unsigned int)(tv.tv_sec * 1000 + ((tv.tv_nsec + 999999) / 1000000)); +#else + return (unsigned int)(tv.tv_sec * 1000 + ((tv.tv_usec + 999) / 1000)); +#endif +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE hg_time_t +hg_time_from_ms(unsigned int ms) +{ +#if defined(HG_UTIL_HAS_TIME_H) && defined(HG_UTIL_HAS_CLOCK_GETTIME) + return (hg_time_t){.tv_sec = ms / 1000, .tv_nsec = (ms - (ms / 1000) * 1000) * 1000000}; +#else + return (hg_time_t){.tv_sec = ms / 1000, .tv_usec = (ms - (ms / 1000) * 1000) * 1000}; +#endif +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_time_less(hg_time_t in1, hg_time_t in2) +{ + return ((in1.tv_sec < in2.tv_sec) || ((in1.tv_sec == in2.tv_sec) && +#if defined(HG_UTIL_HAS_TIME_H) && defined(HG_UTIL_HAS_CLOCK_GETTIME) + (in1.tv_nsec < in2.tv_nsec))); +#else + (in1.tv_usec < in2.tv_usec))); +#endif +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE double +hg_time_diff(hg_time_t in2, hg_time_t in1) +{ +#if defined(HG_UTIL_HAS_TIME_H) && defined(HG_UTIL_HAS_CLOCK_GETTIME) + return ((double)in2.tv_sec + (double)(in2.tv_nsec) * 0.000000001) - + ((double)in1.tv_sec + (double)(in1.tv_nsec) * 0.000000001); +#else + return ((double)in2.tv_sec + (double)(in2.tv_usec) * 0.000001) - + ((double)in1.tv_sec + (double)(in1.tv_usec) * 0.000001); +#endif +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE hg_time_t +hg_time_add(hg_time_t in1, hg_time_t in2) +{ + hg_time_t out; + + out.tv_sec = in1.tv_sec + in2.tv_sec; +#if defined(HG_UTIL_HAS_TIME_H) && defined(HG_UTIL_HAS_CLOCK_GETTIME) + out.tv_nsec = in1.tv_nsec + in2.tv_nsec; + if (out.tv_nsec > 1000000000) { + out.tv_nsec -= 1000000000; + out.tv_sec += 1; + } +#else + out.tv_usec = in1.tv_usec + in2.tv_usec; + if (out.tv_usec > 1000000) { + out.tv_usec -= 1000000; + out.tv_sec += 1; + } +#endif + + return out; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE hg_time_t +hg_time_subtract(hg_time_t in1, hg_time_t in2) +{ + hg_time_t out; + + out.tv_sec = in1.tv_sec - in2.tv_sec; +#if defined(HG_UTIL_HAS_TIME_H) && defined(HG_UTIL_HAS_CLOCK_GETTIME) + out.tv_nsec = in1.tv_nsec - in2.tv_nsec; + if (out.tv_nsec < 0) { + out.tv_nsec += 1000000000; + out.tv_sec -= 1; + } +#else + out.tv_usec = in1.tv_usec - in2.tv_usec; + if (out.tv_usec < 0) { + out.tv_usec += 1000000; + out.tv_sec -= 1; + } +#endif + + return out; +} + +/*---------------------------------------------------------------------------*/ +static HG_UTIL_INLINE int +hg_time_sleep(const hg_time_t rqt) +{ +#ifdef _WIN32 + DWORD dwMilliseconds = (DWORD)(hg_time_to_double(rqt) / 1000); + + Sleep(dwMilliseconds); +#elif defined(HG_UTIL_HAS_TIME_H) && defined(HG_UTIL_HAS_CLOCK_GETTIME) + if (nanosleep(&rqt, NULL)) + return HG_UTIL_FAIL; +#else + useconds_t usec = (useconds_t)rqt.tv_sec * 1000000 + (useconds_t)rqt.tv_usec; + + if (usleep(usec)) + return HG_UTIL_FAIL; +#endif + + return HG_UTIL_SUCCESS; +} + +/*---------------------------------------------------------------------------*/ +#define HG_UTIL_STAMP_MAX 128 +static HG_UTIL_INLINE char * +hg_time_stamp(void) +{ + static char buf[HG_UTIL_STAMP_MAX] = {'\0'}; + +#if defined(_WIN32) + /* TODO not implemented */ +#elif defined(HG_UTIL_HAS_TIME_H) && defined(HG_UTIL_HAS_CLOCK_GETTIME) + struct tm *local_time; + time_t t; + + t = time(NULL); + local_time = localtime(&t); + if (local_time == NULL) + return NULL; + + if (strftime(buf, HG_UTIL_STAMP_MAX, "%a, %d %b %Y %T %Z", local_time) == 0) + return NULL; +#else + struct timeval tv; + struct timezone tz; + unsigned long days, hours, minutes, seconds; + + gettimeofday(&tv, &tz); + days = (unsigned long)tv.tv_sec / (3600 * 24); + hours = ((unsigned long)tv.tv_sec - days * 24 * 3600) / 3600; + minutes = ((unsigned long)tv.tv_sec - days * 24 * 3600 - hours * 3600) / 60; + seconds = (unsigned long)tv.tv_sec - days * 24 * 3600 - hours * 3600 - minutes * 60; + hours -= (unsigned long)tz.tz_minuteswest / 60; + + snprintf(buf, HG_UTIL_STAMP_MAX, "%02lu:%02lu:%02lu (GMT-%d)", hours, minutes, seconds, + tz.tz_minuteswest / 60); +#endif + + return buf; +} + +#ifdef __cplusplus +} +#endif + +#endif /* MERCURY_TIME_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_util.c b/src/H5FDsubfiling/mercury/src/util/mercury_util.c new file mode 100644 index 0000000..b9c1101 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_util.c @@ -0,0 +1,47 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#include "mercury_util.h" + +#include "mercury_util_error.h" + +#include <stdlib.h> +#include <string.h> + +/****************/ +/* Local Macros */ +/****************/ + +/* Name of this subsystem */ +#define HG_UTIL_SUBSYS_NAME hg_util +#define HG_UTIL_STRINGIFY1(x) HG_UTIL_STRINGIFY(x) +#define HG_UTIL_SUBSYS_NAME_STRING HG_UTIL_STRINGIFY1(HG_UTIL_SUBSYS_NAME) + +/*******************/ +/* Local Variables */ +/*******************/ + +/* Default error log mask */ +HG_LOG_SUBSYS_DECL_REGISTER(HG_UTIL_SUBSYS_NAME, hg); + +/*---------------------------------------------------------------------------*/ +void +HG_Util_version_get(unsigned int *major, unsigned int *minor, unsigned int *patch) +{ + if (major) + *major = HG_UTIL_VERSION_MAJOR; + if (minor) + *minor = HG_UTIL_VERSION_MINOR; + if (patch) + *patch = HG_UTIL_VERSION_PATCH; +} + +/*---------------------------------------------------------------------------*/ +void +HG_Util_set_log_level(const char *level) +{ + hg_log_set_subsys_level(HG_UTIL_SUBSYS_NAME_STRING, hg_log_name_to_level(level)); +} diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_util.h b/src/H5FDsubfiling/mercury/src/util/mercury_util.h new file mode 100644 index 0000000..aad9a11 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_util.h @@ -0,0 +1,49 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#ifndef MERCURY_UTIL_LOG_H +#define MERCURY_UTIL_LOG_H + +#include "mercury_util_config.h" + +/*************************************/ +/* Public Type and Struct Definition */ +/*************************************/ + +/*****************/ +/* Public Macros */ +/*****************/ + +/*********************/ +/* Public Prototypes */ +/*********************/ + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * Get HG util version number. + * + * \param major [OUT] pointer to unsigned integer + * \param minor [OUT] pointer to unsigned integer + * \param patch [OUT] pointer to unsigned integer + */ +HG_UTIL_PUBLIC void HG_Util_version_get(unsigned int *major, unsigned int *minor, unsigned int *patch); + +/** + * Set the log level for HG util. That setting is valid for all HG classes. + * + * \param level [IN] level string, valid values are: + * "none", "error", "warning", "debug" + */ +HG_UTIL_PUBLIC void HG_Util_set_log_level(const char *level); + +#ifdef __cplusplus +} +#endif + +#endif /* MERCURY_UTIL_LOG_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_util_config.h b/src/H5FDsubfiling/mercury/src/util/mercury_util_config.h new file mode 100644 index 0000000..41972df --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_util_config.h @@ -0,0 +1,116 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +/* Generated file. Only edit mercury_util_config.h.in. */ + +#ifndef MERCURY_UTIL_CONFIG_H +#define MERCURY_UTIL_CONFIG_H + +/*************************************/ +/* Public Type and Struct Definition */ +/*************************************/ + +/* Type definitions */ +#include <stddef.h> +#include <stdint.h> +#include <stdbool.h> + +/*****************/ +/* Public Macros */ +/*****************/ + +/* Reflects any major or incompatible public API changes */ +#define HG_UTIL_VERSION_MAJOR 3 +/* Reflects any minor backwards compatible API or functionality addition */ +#define HG_UTIL_VERSION_MINOR 0 +/* Reflects any backwards compatible bug fixes */ +#define HG_UTIL_VERSION_PATCH 0 + +/* Return codes */ +#define HG_UTIL_SUCCESS 0 +#define HG_UTIL_FAIL -1 + +#include <mercury_compiler_attributes.h> + +/* Inline macro */ +#ifdef _WIN32 +#define HG_UTIL_INLINE __inline +#else +#define HG_UTIL_INLINE __inline__ +#endif + +/* Alignment */ +#define HG_UTIL_ALIGNED(x, a) HG_ATTR_ALIGNED(x, a) + +/* Check format arguments */ +#define HG_UTIL_PRINTF(_fmt, _firstarg) HG_ATTR_PRINTF(_fmt, _firstarg) + +/* Shared libraries */ +/* #undef HG_UTIL_BUILD_SHARED_LIBS */ +#ifdef HG_UTIL_BUILD_SHARED_LIBS +#ifdef mercury_util_EXPORTS +#define HG_UTIL_PUBLIC HG_ATTR_ABI_EXPORT +#else +#define HG_UTIL_PUBLIC HG_ATTR_ABI_IMPORT +#endif +#define HG_UTIL_PRIVATE HG_ATTR_ABI_HIDDEN +#else +#define HG_UTIL_PUBLIC +#define HG_UTIL_PRIVATE +#endif + +/* Define if has __attribute__((constructor(priority))) */ +#define HG_UTIL_HAS_ATTR_CONSTRUCTOR_PRIORITY + +/* Define if has 'clock_gettime()' */ +#define HG_UTIL_HAS_CLOCK_GETTIME + +/* Define if has CLOCK_MONOTONIC_COARSE */ +#define HG_UTIL_HAS_CLOCK_MONOTONIC_COARSE + +/* Define is has debug */ +/* #undef HG_UTIL_HAS_DEBUG */ + +/* Define if has eventfd_t type */ +#define HG_UTIL_HAS_EVENTFD_T + +/* Define if has colored output */ +/* #undef HG_UTIL_HAS_LOG_COLOR */ + +/* Define if has 'pthread_condattr_setclock()' */ +#define HG_UTIL_HAS_PTHREAD_CONDATTR_SETCLOCK + +/* Define if has PTHREAD_MUTEX_ADAPTIVE_NP */ +#define HG_UTIL_HAS_PTHREAD_MUTEX_ADAPTIVE_NP + +/* Define if has pthread_spinlock_t type */ +#define HG_UTIL_HAS_PTHREAD_SPINLOCK_T + +/* Define if has <stdatomic.h> */ +#define HG_UTIL_HAS_STDATOMIC_H + +/* Define type size of atomic_long */ +#define HG_UTIL_ATOMIC_LONG_WIDTH 8 + +/* Define if has <sys/epoll.h> */ +#define HG_UTIL_HAS_SYSEPOLL_H + +/* Define if has <sys/event.h> */ +/* #undef HG_UTIL_HAS_SYSEVENT_H */ + +/* Define if has <sys/eventfd.h> */ +#define HG_UTIL_HAS_SYSEVENTFD_H + +/* Define if has <sys/param.h> */ +#define HG_UTIL_HAS_SYSPARAM_H + +/* Define if has <sys/time.h> */ +#define HG_UTIL_HAS_SYSTIME_H + +/* Define if has <time.h> */ +#define HG_UTIL_HAS_TIME_H + +#endif /* MERCURY_UTIL_CONFIG_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_util_config.h.in b/src/H5FDsubfiling/mercury/src/util/mercury_util_config.h.in new file mode 100644 index 0000000..d20e0e6 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_util_config.h.in @@ -0,0 +1,116 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +/* Generated file. Only edit mercury_util_config.h.in. */ + +#ifndef MERCURY_UTIL_CONFIG_H +#define MERCURY_UTIL_CONFIG_H + +/*************************************/ +/* Public Type and Struct Definition */ +/*************************************/ + +/* Type definitions */ +#include <stddef.h> +#include <stdint.h> +#include <stdbool.h> + +/*****************/ +/* Public Macros */ +/*****************/ + +/* Reflects any major or incompatible public API changes */ +#define HG_UTIL_VERSION_MAJOR @MERCURY_UTIL_VERSION_MAJOR@ +/* Reflects any minor backwards compatible API or functionality addition */ +#define HG_UTIL_VERSION_MINOR @MERCURY_UTIL_VERSION_MINOR@ +/* Reflects any backwards compatible bug fixes */ +#define HG_UTIL_VERSION_PATCH @MERCURY_UTIL_VERSION_PATCH@ + +/* Return codes */ +#define HG_UTIL_SUCCESS 0 +#define HG_UTIL_FAIL -1 + +#include <mercury_compiler_attributes.h> + +/* Inline macro */ +#ifdef _WIN32 +# define HG_UTIL_INLINE __inline +#else +# define HG_UTIL_INLINE __inline__ +#endif + +/* Alignment */ +#define HG_UTIL_ALIGNED(x, a) HG_ATTR_ALIGNED(x, a) + +/* Check format arguments */ +#define HG_UTIL_PRINTF(_fmt, _firstarg) HG_ATTR_PRINTF(_fmt, _firstarg) + +/* Shared libraries */ +#cmakedefine HG_UTIL_BUILD_SHARED_LIBS +#ifdef HG_UTIL_BUILD_SHARED_LIBS +# ifdef mercury_util_EXPORTS +# define HG_UTIL_PUBLIC HG_ATTR_ABI_EXPORT +# else +# define HG_UTIL_PUBLIC HG_ATTR_ABI_IMPORT +# endif +# define HG_UTIL_PRIVATE HG_ATTR_ABI_HIDDEN +#else +# define HG_UTIL_PUBLIC +# define HG_UTIL_PRIVATE +#endif + +/* Define if has __attribute__((constructor(priority))) */ +#cmakedefine HG_UTIL_HAS_ATTR_CONSTRUCTOR_PRIORITY + +/* Define if has 'clock_gettime()' */ +#cmakedefine HG_UTIL_HAS_CLOCK_GETTIME + +/* Define if has CLOCK_MONOTONIC_COARSE */ +#cmakedefine HG_UTIL_HAS_CLOCK_MONOTONIC_COARSE + +/* Define is has debug */ +#cmakedefine HG_UTIL_HAS_DEBUG + +/* Define if has eventfd_t type */ +#cmakedefine HG_UTIL_HAS_EVENTFD_T + +/* Define if has colored output */ +#cmakedefine HG_UTIL_HAS_LOG_COLOR + +/* Define if has 'pthread_condattr_setclock()' */ +#cmakedefine HG_UTIL_HAS_PTHREAD_CONDATTR_SETCLOCK + +/* Define if has PTHREAD_MUTEX_ADAPTIVE_NP */ +#cmakedefine HG_UTIL_HAS_PTHREAD_MUTEX_ADAPTIVE_NP + +/* Define if has pthread_spinlock_t type */ +#cmakedefine HG_UTIL_HAS_PTHREAD_SPINLOCK_T + +/* Define if has <stdatomic.h> */ +#cmakedefine HG_UTIL_HAS_STDATOMIC_H + +/* Define type size of atomic_long */ +#cmakedefine HG_UTIL_ATOMIC_LONG_WIDTH @HG_UTIL_ATOMIC_LONG_WIDTH@ + +/* Define if has <sys/epoll.h> */ +#cmakedefine HG_UTIL_HAS_SYSEPOLL_H + +/* Define if has <sys/event.h> */ +#cmakedefine HG_UTIL_HAS_SYSEVENT_H + +/* Define if has <sys/eventfd.h> */ +#cmakedefine HG_UTIL_HAS_SYSEVENTFD_H + +/* Define if has <sys/param.h> */ +#cmakedefine HG_UTIL_HAS_SYSPARAM_H + +/* Define if has <sys/time.h> */ +#cmakedefine HG_UTIL_HAS_SYSTIME_H + +/* Define if has <time.h> */ +#cmakedefine HG_UTIL_HAS_TIME_H + +#endif /* MERCURY_UTIL_CONFIG_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/mercury_util_error.h b/src/H5FDsubfiling/mercury/src/util/mercury_util_error.h new file mode 100644 index 0000000..9004c5a --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/mercury_util_error.h @@ -0,0 +1,79 @@ +/** + * Copyright (c) 2013-2021 UChicago Argonne, LLC and The HDF Group. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#ifndef MERCURY_UTIL_ERROR_H +#define MERCURY_UTIL_ERROR_H + +#include "mercury_util_config.h" + +/* Default error macro */ +#include <mercury_log.h> +extern HG_UTIL_PRIVATE HG_LOG_OUTLET_DECL(hg_util); +#define HG_UTIL_LOG_ERROR(...) HG_LOG_WRITE(hg_util, HG_LOG_LEVEL_ERROR, __VA_ARGS__) +#define HG_UTIL_LOG_WARNING(...) HG_LOG_WRITE(hg_util, HG_LOG_LEVEL_WARNING, __VA_ARGS__) +#ifdef HG_UTIL_HAS_DEBUG +#define HG_UTIL_LOG_DEBUG(...) HG_LOG_WRITE(hg_util, HG_LOG_LEVEL_DEBUG, __VA_ARGS__) +#else +#define HG_UTIL_LOG_DEBUG(...) (void)0 +#endif + +/* Branch predictor hints */ +#ifndef _WIN32 +#define likely(x) __builtin_expect(!!(x), 1) +#define unlikely(x) __builtin_expect(!!(x), 0) +#else +#define likely(x) (x) +#define unlikely(x) (x) +#endif + +/* Error macros */ +#define HG_UTIL_GOTO_DONE(label, ret, ret_val) \ + do { \ + ret = ret_val; \ + goto label; \ + } while (0) + +#define HG_UTIL_GOTO_ERROR(label, ret, err_val, ...) \ + do { \ + HG_UTIL_LOG_ERROR(__VA_ARGS__); \ + ret = err_val; \ + goto label; \ + } while (0) + +/* Check for cond, set ret to err_val and goto label */ +#define HG_UTIL_CHECK_ERROR(cond, label, ret, err_val, ...) \ + do { \ + if (unlikely(cond)) { \ + HG_UTIL_LOG_ERROR(__VA_ARGS__); \ + ret = err_val; \ + goto label; \ + } \ + } while (0) + +#define HG_UTIL_CHECK_ERROR_NORET(cond, label, ...) \ + do { \ + if (unlikely(cond)) { \ + HG_UTIL_LOG_ERROR(__VA_ARGS__); \ + goto label; \ + } \ + } while (0) + +#define HG_UTIL_CHECK_ERROR_DONE(cond, ...) \ + do { \ + if (unlikely(cond)) { \ + HG_UTIL_LOG_ERROR(__VA_ARGS__); \ + } \ + } while (0) + +/* Check for cond and print warning */ +#define HG_UTIL_CHECK_WARNING(cond, ...) \ + do { \ + if (unlikely(cond)) { \ + HG_UTIL_LOG_WARNING(__VA_ARGS__); \ + } \ + } while (0) + +#endif /* MERCURY_UTIL_ERROR_H */ diff --git a/src/H5FDsubfiling/mercury/src/util/version.txt b/src/H5FDsubfiling/mercury/src/util/version.txt new file mode 100644 index 0000000..4a36342 --- /dev/null +++ b/src/H5FDsubfiling/mercury/src/util/version.txt @@ -0,0 +1 @@ +3.0.0 diff --git a/src/H5FDsubfiling/mercury/version.txt b/src/H5FDsubfiling/mercury/version.txt new file mode 100644 index 0000000..676a2fb --- /dev/null +++ b/src/H5FDsubfiling/mercury/version.txt @@ -0,0 +1 @@ +2.2.0rc6 |