diff options
author | Albert Cheng <acheng@hdfgroup.org> | 2010-08-11 18:12:21 (GMT) |
---|---|---|
committer | Albert Cheng <acheng@hdfgroup.org> | 2010-08-11 18:12:21 (GMT) |
commit | 5db5169813134ed668fb452f25a8f5f69248b4a3 (patch) | |
tree | 7d119e448ffb8514797360db4cefa554c6bce5a0 | |
parent | 051cf48d0bab6990fa27301cb43355f2a251d503 (diff) | |
download | hdf5-5db5169813134ed668fb452f25a8f5f69248b4a3.zip hdf5-5db5169813134ed668fb452f25a8f5f69248b4a3.tar.gz hdf5-5db5169813134ed668fb452f25a8f5f69248b4a3.tar.bz2 |
[svn-r19229] Reset alarm_seconds back to 20 minutes.
Description:
honest3 v1.8 failed in parallel test. It got stuck in the same
testpar/testphdf5 subtest (cbhsssdrpio). This is an old problem.
Upon closer inspection, the testphdf5, when terminated, had clocked
up 1hr 9min 46 sec wall clock time. Honest1 system also sent a message
that an mpi process has used up 30+ CPU minutes which exceeded their login
node cpu time limit and they killed the process. I also did a hand-run
of testphdf5. All subtests before cbhsssdrpio completed in a few minutes.
Therefore, it is safe to say the majority of the 70 minutes of wall clock
time are spent in the sub-test cbhsssdrpio. It also used up lots of CPU
time. cbhsssdrpio is likely infinite looping.
Since MPI application is prone to infinite looping due to message deadlock,
the testphdf5 has a built-in protection to give each subtest at most 20 minutes
of wall-clock time to run. When the 20 minutes wall-clock time is exceeded,
the testphdf5 will attempt to terminate itself. This prevents unnecessary
CPU time consumption in infinite looping.
But that clock limit was changed to 30 and then 60 minutes. I should have
but failed to, noticed the change mentioned by Quincey. IMO, 20 wall clock
time is more than sufficient for each subtest of testphdf5 to complete.
If a subtest takes longer than 20 minutes, it is likely infinite looping.
Giving it more time will not help.
If a subtest of testphdf5 takes more than 20 minutes, it should be broken
down to small tests that will finish way under 20 minutes so that it is
much easier to see progress and identify any deadlock problems.
In view of this, I am changing the testphdf5 time limit back to 20 minutes.
This will at least stop the CPU TIME exceeding limits and annoying the
system administrators.
Maybe there could be a provision, such as environment variable like
$HDF5_ALARM_SECOND to modify the alarm duration on individual execution.
Even so, that should be used temporary to see if an execution just needs
a little more time.
Tested: just eyeballed as the change is trivia.
-rw-r--r-- | test/h5test.h | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/test/h5test.h b/test/h5test.h index 6587793..ff065bc 100644 --- a/test/h5test.h +++ b/test/h5test.h @@ -127,7 +127,7 @@ extern MPI_Info h5_io_info_g; /* MPI INFO object for IO */ /* * Alarm definitions to wait up (terminate) a test that runs too long. */ -#define alarm_seconds 3600 /* default is 60 minutes */ +#define alarm_seconds 1200 /* default is 20 minutes */ #define ALARM_ON HDalarm(alarm_seconds) #define ALARM_OFF HDalarm(0) /* set alarms to N seconds if N > 0, else use default alarm_seconds. */ |