From 5db5169813134ed668fb452f25a8f5f69248b4a3 Mon Sep 17 00:00:00 2001 From: Albert Cheng Date: Wed, 11 Aug 2010 13:12:21 -0500 Subject: [svn-r19229] Reset alarm_seconds back to 20 minutes. Description: honest3 v1.8 failed in parallel test. It got stuck in the same testpar/testphdf5 subtest (cbhsssdrpio). This is an old problem. Upon closer inspection, the testphdf5, when terminated, had clocked up 1hr 9min 46 sec wall clock time. Honest1 system also sent a message that an mpi process has used up 30+ CPU minutes which exceeded their login node cpu time limit and they killed the process. I also did a hand-run of testphdf5. All subtests before cbhsssdrpio completed in a few minutes. Therefore, it is safe to say the majority of the 70 minutes of wall clock time are spent in the sub-test cbhsssdrpio. It also used up lots of CPU time. cbhsssdrpio is likely infinite looping. Since MPI application is prone to infinite looping due to message deadlock, the testphdf5 has a built-in protection to give each subtest at most 20 minutes of wall-clock time to run. When the 20 minutes wall-clock time is exceeded, the testphdf5 will attempt to terminate itself. This prevents unnecessary CPU time consumption in infinite looping. But that clock limit was changed to 30 and then 60 minutes. I should have but failed to, noticed the change mentioned by Quincey. IMO, 20 wall clock time is more than sufficient for each subtest of testphdf5 to complete. If a subtest takes longer than 20 minutes, it is likely infinite looping. Giving it more time will not help. If a subtest of testphdf5 takes more than 20 minutes, it should be broken down to small tests that will finish way under 20 minutes so that it is much easier to see progress and identify any deadlock problems. In view of this, I am changing the testphdf5 time limit back to 20 minutes. This will at least stop the CPU TIME exceeding limits and annoying the system administrators. Maybe there could be a provision, such as environment variable like $HDF5_ALARM_SECOND to modify the alarm duration on individual execution. Even so, that should be used temporary to see if an execution just needs a little more time. Tested: just eyeballed as the change is trivia. --- test/h5test.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/h5test.h b/test/h5test.h index 6587793..ff065bc 100644 --- a/test/h5test.h +++ b/test/h5test.h @@ -127,7 +127,7 @@ extern MPI_Info h5_io_info_g; /* MPI INFO object for IO */ /* * Alarm definitions to wait up (terminate) a test that runs too long. */ -#define alarm_seconds 3600 /* default is 60 minutes */ +#define alarm_seconds 1200 /* default is 20 minutes */ #define ALARM_ON HDalarm(alarm_seconds) #define ALARM_OFF HDalarm(0) /* set alarms to N seconds if N > 0, else use default alarm_seconds. */ -- cgit v0.12