Bug alert: Beware scheduler profiling in SGE 6.2

Posted by chris Thu, 04 Sep 2008 21:31:50 GMT

The command "qconf -tsm" when run as the root user is a nice (but totally under-documented in the past) tool for SGE admins. The command (when it works) does a one-time dump of scheduler information and writes it to the location $SGE_ROOT/$SGE_CELL/default/schedd_runlog.

Props to DanT for discovering an interesting bug in Grid Egine 6.2 -- if you invoke the command "qconf -tsm" the process does not stop after the first attempt -- it keeps on repeating the command and growing the schedd_runlog file over and over again (every scheduling interval).

This is not a huge bug but it does have two negative consequences:

  • Scheduler profiling is non-trivial, doing it repeatedly each scheduling interval may place additional load on your qmaster
  • Most SGE admins would not be rotating or otherwise tracking the size of the schedd_runlog file as they would other SGE files like "accounting" that grow over time. Left unchecked on a busy cluster, this file may grow and cause space issues on the $SGE_ROOT filesystem

A really interesting facet of this bug is that restarting SGE and/or the scheduler has no effect and does not fix the recurring profile dump. This is likely why the issue was rated with a higher than normal severity level. Expect a patch or fix to be issued shortly.