Feedback needed: Obsolete options and parameters considered for removal
Grid Engine developers posted a list today of SGE configuration parameters and client arguments that are being considered for removal from the product because they are either obsolete or they duplicate settings found elsewhere.
The developers are seeking feedback and comments on their plans - if you have any please drop a line to the users@gridengine.sunsource.net mailing list. The current roadmap calls for these methods to be marked as 'deprecated' in the SGE 6.2 release with total removal planned for a future post-6.2 release.
The message can be found here:
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=25045
The full list of items being considered for removal can also be found after the jump ...
The parameters planned to obsolete are: host_conf(5) - processors obsolete, same as num_proc from the complex list sched_conf(5) - algorithm just default is allowed, no additional algorithms are planed - params JC_FILTER huge performance impact plus may lead to wrong scheduling decisions sge_conf(5) - reprioritize redundant because hard bound to reprioritize_interval in sched_conf(5) - shell_start_mode obsolete, value from queue_conf(5) is used - set_token_cmd no known AFS support - pag_cmd no known AFS support - token_extend_time no known AFS support - qmaster_params DISABLE_AUTO_RESCHEDULING equivalent to default reschedule_unknown=0:0:0 - qmaster_params merge ACCT_RESERVED_USAGE and SHARETREE_RESERVED_USAGE We can't imaging a use case to have these values separated - finished_jobs qstat -j does not work with successful finished jobs. Code seems to work only with jobs going into error state. user(5) - delete_time change to internal, not changeable/visible field Implicit set by auto_user_delete_time qconf(1) - sep option obsolete, same as num_proc - ks option obsolete, same as -kt scheduler qmod(1) - c option depreciated, use -cj or -cq - r option depreciated, use -rj or -rq - s option depreciated, use -sj or -sq - us option depreciated, use -usj or -usq
SGE 6.2 beta binaries are available for testing
I'm not going to waste time copying the release announcement into a blog post. The full announcement can be read here:
http://gridengine.sunsource.net/servlets/ReadMsg?list=announce&msgNo=94
Lots of significant changes in the product itself. I also love the migration of manuals and docs to the new http://wikis.sun.com/display/GridEngine site.
Please remember that the reason for this beta release is to allow you to test 6.2 before it officially goes out the door in final form. The more people we have working on and stress-testing 6.2 the less chance there will be an inconvenient or unexpected upgrade issue, bug or glitch. The developers have good testbed environments and testsuites but they can't simulate all the different ways and methods that we use (and abuse!) SGE to get work done. Help make the 6.2 release a big success by testing now and providing feedback.
SGE XML output getting some needed attention
For people like myself who are interested (or say, dependent) on the XML output features of Grid Engine it's been a lonely time. This area of Grid Engine was not really getting much love, attention or bug fixes until recently.
Happy to report that this seems to have changed. If you are at all interested in using SGE data in XML form then you may want to:
- Pay attention to this mailing list thread
- Watch this SGE Wiki page
Kudos to Michael Pospisil from the Sun Microsystems SGE developer team in Prague for soliciting and listening to community input -- looks like the change may be bigger than simple bug fixes and output normalization. There is some talk about making XML output more usable to the end-users instead of the current design where XML output is largely a straight representation of internal SGE Cull lists and data structures.
Roland: things that affect job deletion time
In this interesting users-list thread, Roland provides some nice comments on the various things that can affect the time it takes to delete a Grid Engine job.
Specifically mentioned is a new hash implementation slated for the upcoming 6.2 release that dramatically improves things.
From Roland's post:
...for GE 6.2 I've analyzed the hotspots deleting jobs and what I've found is:
1) the time deleting a job increases with the amount of pending jobs in the cluster and the amount of queue instances. The reason for this is the messages list for schedd_job_info. Every message in the qstat -j output is one list element and below this element are the job id references stored inheriting this message. At job deletion time qmaster has to loop over the whole list of messages and loop over all references to removes right one. As a matter of fact this does not scale, and for 6.2 I've added a hash access to the reference id that decreased the job deletion time in large clusters heavily. Sadly I don't remember the exact numbers.
To verify this you can disable schedd_job_info in the scheduler config and then delete your jobs.
2) The job script and the job itself needs to be removed from the database. This time depends if you use berkeleydb or classic spooling and if you spool on local storage or on a NFS share. As faster your access to the storage is as faster you can delete the jobs.
If disabling schedd_job_info doesn't help in your case you might be hit by this point.
3) With 6.1u3 we've introduced the parameters gdi_timeout and gdi_retries to tune this behaviour. But that's anyway more a workaround than a real solution.
Roland: things that affect job deletion time
In this interesting users-list thread, Roland provides some nice comments on the various things that can affect the time it takes to delete a Grid Engine job.
Specifically mentioned is a new hash implementation slated for the upcoming 6.2 release that dramatically improves things.
From Roland's post:
...for GE 6.2 I've analyzed the hotspots deleting jobs and what I've found is:
1) the time deleting a job increases with the amount of pending jobs in the cluster and the amount of queue instances. The reason for this is the messages list for schedd_job_info. Every message in the qstat -j output is one list element and below this element are the job id references stored inheriting this message. At job deletion time qmaster has to loop over the whole list of messages and loop over all references to removes right one. As a matter of fact this does not scale, and for 6.2 I've added a hash access to the reference id that decreased the job deletion time in large clusters heavily. Sadly I don't remember the exact numbers.
To verify this you can disable schedd_job_info in the scheduler config and then delete your jobs.
2) The job script and the job itself needs to be removed from the database. This time depends if you use berkeleydb or classic spooling and if you spool on local storage or on a NFS share. As faster your access to the storage is as faster you can delete the jobs.
If disabling schedd_job_info doesn't help in your case you might be hit by this point.
3) With 6.1u3 we've introduced the parameters gdi_timeout and gdi_retries to tune this behaviour. But that's anyway more a workaround than a real solution.
Roland: things that affect job deletion time
In this interesting users-list thread, Roland provides some nice comments on the various things that can affect the time it takes to delete a Grid Engine job.
Specifically mentioned is a new hash implementation slated for the upcoming 6.2 release that dramatically improves things.
From Roland's post:
...for GE 6.2 I've analyzed the hotspots deleting jobs and what I've found is:
1) the time deleting a job increases with the amount of pending jobs in the cluster and the amount of queue instances. The reason for this is the messages list for schedd_job_info. Every message in the qstat -j output is one list element and below this element are the job id references stored inheriting this message. At job deletion time qmaster has to loop over the whole list of messages and loop over all references to removes right one. As a matter of fact this does not scale, and for 6.2 I've added a hash access to the reference id that decreased the job deletion time in large clusters heavily. Sadly I don't remember the exact numbers.
To verify this you can disable schedd_job_info in the scheduler config and then delete your jobs.
2) The job script and the job itself needs to be removed from the database. This time depends if you use berkeleydb or classic spooling and if you spool on local storage or on a NFS share. As faster your access to the storage is as faster you can delete the jobs.
If disabling schedd_job_info doesn't help in your case you might be hit by this point.
3) With 6.1u3 we've introduced the parameters gdi_timeout and gdi_retries to tune this behaviour. But that's anyway more a workaround than a real solution.
Keeping single slot jobs off of certain nodes
In this thread, Paul asks:
"I'm looking at finding a way to either limit single-slot jobs, or requiring all jobs in a given queue to be running in a pe. Specifically, I have some SMP nodes, that I'd rather not waste on single thread, and also keep the single thread jobs off of the infiniband connected nodes. I have gigE small cpu count nodes for this task."
Dan replied with another example of clever use of the new SGE Resource Quota syntax within SGE 6.1 and later:
You can use resource quota sets to restrict non-PE jobs to certain queues hosts.limit pes !* hosts @smp to slots=0
Slick!
SGE 6.2 goes beta next week (your help needed)
SGE 6.2 is being released in Beta form next week and the developers are asking for people to make some time if possible to fully test out the beta snapshot of the latest major SGE point release.
Andy's full note can be found here (well worth reading in full ...):
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=24426
In my mind, I'm most excited about the following:
- Advance Reservations & array job inter-dependencies
- The scheduler is now a thread within the qmaster!
- The JVM running within the qmaster
- SGE moving all docs into wiki form!
SGE testbeds: Simulate mass numbers of exec hosts
Interesting message on the developers list recently as a comment attached to Issue 2364. Within, Andreas explains the use of SIMULATE_EXECDS=true parameter that allows unrestricted execution host creation (via suppressing unknown host errors).
I can see this as being very useful for testing SGE scheduler and policy configuration settings before implementing them on production systems.
From the comment:
This is a short HOWTO for the use of the cluster simulator: (1) Start with installing a new SGE cluster as used, but install not more than the qmaster itself (2) After successful installation use qconf -mconf to set SIMULATE_EXECDS=true in qmaster_params section of sge_conf(5). This causes the suppression of the 'unknown' queue states. (3) Make sure the "all.q" and any other queue that you configure does not use any 'load_threasholds'. Cluster simulator has no means to anyhow emulate load values. As a result there will be no load values. For that reason load_threasholds may not be used as it would cause load alarm queue states that prevent scheduler from dispatching jobs into your queues. (4) Use qconf -ae|-Ae to create arbitrary number of simulated execution hosts. The hosts needs not exist as qmaster anyways won't try to send anything to it, but the hostname must be resolvable. Optionally: (5) If you care for scheduler runtimes set PROFILE=true in the params section of sched_conf(5) using qconf -msconf. Now your simulated cluster is ready. You can send in arbitrary numbers of jobs. Due to (2) and (3) scheduler will dispatch them and send corresponding orders to qmaster. Qmaster will behave as if it would start the jobs, but it raise timers to ensure job state transitions are passed as used. What won't work is interactive jobs (i.e. qrsh, qsh etc.) and parallel jobs with control_slaves set to true in sge_pe(5). Jobs' runtime can be controled via the first job argument. That means when # qsub -b y /bin/sleep 5 is submitted, the job will finish after five seconds.
SGE testbeds: Simulate mass numbers of exec hosts
Interesting message on the developers list recently as a comment attached to Issue 2364. Within, Andreas explains the use of SIMULATE_EXECDS=true parameter that allows unrestricted execution host creation (via suppressing unknown host errors).
I can see this as being very useful for testing SGE scheduler and policy configuration settings before implementing them on production systems.
From the comment:
This is a short HOWTO for the use of the cluster simulator: (1) Start with installing a new SGE cluster as used, but install not more than the qmaster itself (2) After successful installation use qconf -mconf to set SIMULATE_EXECDS=true in qmaster_params section of sge_conf(5). This causes the suppression of the 'unknown' queue states. (3) Make sure the "all.q" and any other queue that you configure does not use any 'load_threasholds'. Cluster simulator has no means to anyhow emulate load values. As a result there will be no load values. For that reason load_threasholds may not be used as it would cause load alarm queue states that prevent scheduler from dispatching jobs into your queues. (4) Use qconf -ae|-Ae to create arbitrary number of simulated execution hosts. The hosts needs not exist as qmaster anyways won't try to send anything to it, but the hostname must be resolvable. Optionally: (5) If you care for scheduler runtimes set PROFILE=true in the params section of sched_conf(5) using qconf -msconf. Now your simulated cluster is ready. You can send in arbitrary numbers of jobs. Due to (2) and (3) scheduler will dispatch them and send corresponding orders to qmaster. Qmaster will behave as if it would start the jobs, but it raise timers to ensure job state transitions are passed as used. What won't work is interactive jobs (i.e. qrsh, qsh etc.) and parallel jobs with control_slaves set to true in sge_pe(5). Jobs' runtime can be controled via the first job argument. That means when # qsub -b y /bin/sleep 5 is submitted, the job will finish after five seconds.
RHEL5.2/Centos5 kernel update may cause problems
This is a heads up for RedHat Enterprise Linux (RHEL) users as well as for users (like myself) of the various Centos variants.
There is a recent patch for RHEL that changes the inode data structure exposed to NFS clients from 32 bits to 64 bits in size. The basic summary of this issue is that many applications may not handle this change gracefully (such as one report with the SGE linux binaries.)
RHEL and modern Centos users should probably pay attention to (by subscribing as CC: contacts) to this issue:
http://gridengine.sunsource.net/issues/show_bug.cgi?id=2543
A RedHat bug report discussing the issue in more detail is here:
"Large inode number patch breaks applications"
https://bugzilla.redhat.com/show_bug.cgi?id=241348
Clever job prioritization tip
Grid Engine has a built-in priority mechanism that is useful for allowing end users to sort and prioritize their own personal pending tasks -- this gives the users the ability to submit many jobs but still dictate which of those jobs need to be run more urgently than the rest.
In practice, though, this is actually fairly clunky to implement. By default the following conditions exist:
- SGE will accept a priority range of
-1023 to 1024 - By default all jobs get assigned a value of
0 - Only SGE managers can assign priority values higher than
0 - Normal users can only assign negative priority values
This is, ummmm, awkward to say the least and works in a way that is 100% opposite from what a sensible user or SGE Admin would expect. Users can only decrease the relative priority of their job in the default environment.
A recent mailing list post from Jeff highlights a nice little workaround. Jeff describes creating an entry in the sge_request file that automatically assigns a value of -p -100 to all submitted jobs that don't override the default with their own use of the -p switch.
This is a nice approach because by default it harms nobody (as all jobs have -p -100. Yet it gives headroom for a non privileged user to use the priority range -99 to 0 to designate some of her jobs as more personally important than others.
Background reference: manpage for sge_request.
Reducing scheduler memory usage with libhoard
It’s pretty interesting subscribing to the SGE Issues mailing list. This comment on Issue 2464 came across the wire today:
… I installed libhoard.so (http://www.hoard.org/) and started sge_schedd with it (changing the sge_schedd starting line in sgemaster to "LD_PRELOAD=/opt/hoard-3.7.1/lib64/libhoard.so sge_schedd").
There seems to be some problems with malloc and threads not freeing memory (or something similar, Andreas could explain this the right way) which could be affecting sge_schedd.
Since restarting sge_schedd using hoard I didn’t have any memory problems anymore, but this just happened one day ago.
If anyone else tries this method I’d appreciate feedback and comments.
Olesen FLEXlm integration tools updated
Mark has posted a significant update to his most excellent FLEXlm license management integration tools. Key changes include:
- XML configuration files
- XML status output
- XSLT stylesheets to transform monitoring information into web pages
- Ability to integrate with xml-qstat
Mark further explains the updates and new Wiki-based documentation in his post to the mailing list.
DRMAA memory leak found & fixed
Most casual SGE users and admins probably find little cause to monitor the Grid Engine developer mailing list. A nice little success story has played out on the list recently with a user assisting the SGE dev team in quickly discovering, isolating and fixing a memory leak that has been in the codebase since the DRMAA 1.0 API release.
A user posted this message to the developer list, showing what appears to be a memory leak in in drmaa_run_job(). Andreas then replied asking if it was possible for the user to recreate the issue while running under the valgrind instrumentation framework.
In this follow-up thread, the user-provided valgrind data allowed Andreas to pinpoint the problem, file Issue #2497 with the bug tracking database and then post a preliminary patch that fixes the problem.
The patch still needs to undergo code review before it makes it officially into the Grid Engine codebase. Overall this is a nice little success story where a user was able to go the extra mile (by instrumenting under valgrind) in order to provide the developers exactly what they needed to quickly identify and fix things.
Kudos to James & Andreas.

XML Feeds