Installing on Mac OS X
Over at this link:
http://blog.bioteam.net/2010/02/07/grid-engine-6-2-on-mac-os-x/
... I've posted an article and accompanying 7 minute recorded screencast showing how to manually install SGE 6.2u5 on a Mac OS X Server system. The test system in the video was running 10.5.8 but the same methods are known to work on Snow Leopard systems as well.
Windows 2008 R2, SAMBA PDC and "HOST_NOT_RESOLVABLE"
This is a quick mailing list hit to mention that for Windows users experiencing HOST_NOT_RESOLVABLE errors due to domain binding issues, the Windows registry key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\Tcpip\Parameters\NVDomain
... might be a route to resoving the issue.
JSV example for rewriting parallel environment requests
Job Submission Verifiers are expected to be a huge win for Grid Engine users and administrators but the feature is new enough that there is not a lot of best practices and working code "in the wild" that the community can copy and learn from ...
In this mailing list thread, however, we get an actual JSV code snippet showing how one might intercept user "-pe " requests and seamlessly alter the parallel environment request to one that makes use of the wildcard '*' selector:
In the latest SGE, you can use the JSV(1) mechanism to do arbitrary re-writes of the qsub options. I don't remember seeing real examples of this posted, so one that re-writes something like `-pe openmpi' to `-pe openmpi-*' to hide the fact that there are multiple PEs for nodes with different core counts, and you normally don't want the parallel job scheduled across such node groups.
#!/bin/sh
jsv_on_start() {
return
}
jsv_on_verify() {
pe=$(jsv_get_param pe_name)
case "$pe" in
openmpi | fluent)
jsv_set_param pe_name "$pe-*"
jsv_correct "Job was modified"
;;
esac
jsv_accept "Job OK"
return
}
. ${SGE_ROOT}/util/resources/jsv/jsv_include.sh
jsv_main
Throttling execution of array job tasks
I've long found that SGE users are perfectly willing to do the right thing when it comes to sharing a computing infrastructure among multiple competing workgroups. What has often been lacking have been SGE features accessible to non-admin users that empower users to have more control over how their jobs run and are prioritized.
A very common example of this is a situation where a user will say:
"I need to submit 100,000 jobs but I don't want to totally take over the cluster and upset my coworkers - can I limit how many of my jobs run at any given time so that resources are left free for others?"
As a Grid Engine consultant, training and administrator I've personally felt that working with people wanting to be "good citizens" has sometimes been a challenge. Most of the common SGE methods for limiting or controlling job execution and policies are available only to users with SGE Administrator privileges. As nice as it is to handle one-off cluster resource allocation situations these sorts of requests can consume lots of admin time and can occasionally cause problems if people make SGE quota or scheduler changes without tight coordination and planning.
Well, it was undocumented in the initial release but ever since SGE version 6.2u4 people have had the ability to limit concurrent execution of tasks within array jobs that they submit. The syntax looks like:
$ qsub -t 1-20 -tc 5 test.sh
... where the "-tc" argument is new. The example above shows a 20-task array job being submitted with a request to run no more than 5 at any one time.
This feature is now documented as of SGE 6.2u5:
-tc max_running_tasks
allow users to limit concurrent array job task execution.
Parameter max_running_tasks specifies maximum number of simultaneously
running tasks. For example we have running SGE with 10 free slots. We
call qsub -t 1-100 -tc 2 jobscript. Then only 2 tasks will be
scheduled to run even when 8 slots are free.
This is a very welcome new feature addition to Grid Engine, I suspect it will be popular and well received by the user community.
Adding memory requirement awareness to the scheduler
In our SGE cluster, we have 2 nodes each of 4 CPU's and we are using "fill up host" scheduler configuration for job submission.
In this scheduler configuration, assume one parallel job (Job1) with 2 CPU's is running on nodeA and user submits another parallel job (Job2) of 2 CPU then SGE submit this job2 on nodeA.
Consider if the Job1 is utilizing higher memory on nodeA then job2 fails due to memory unavailability.
Is there a way to avoid this using SGE configuration?
As usual, Reuti comes through with a great answer:
... you will need to request the estimated amount of memory which the job might need. There are two ways to do it. Make:
a) h_vmem
or b) virtual_free
consumable in the complex definition (qconf -sc) and define a default comsumption there. Then attach a feasible value to each node (qconf - me
) for the installed memory. Use the one you defined in your qsub command by requesting it with the -l option (it's per slot, hence multiplied for parallel jobs unless you use special settings in the complex definition). The difference between the two ways is, that h_vmem will be enforced and kill the job when it needs one byte more, while b) is more a hint for SGE for the job distribution.
More background on Grid Engine and consumable resources is available at this Wiki doc link. That page concentrates on GUI based methods but also discusses the command-line methods that Reuti shows.
SGE 6.2u4 update is out today
This is a bugfix/maintenance release, read the full announcement here. .
As always, checking the list of fixed bugs and issues is a good way to start deciding if an upgraded is needed and how urgent it may be.
SGE utilities from Duke SCSC
https://wiki.duke.edu/display/SCSC/SGE+Tools


Hat tip: Ed L. from MRL Boston who first pointed me towards the 'jobpar' link on the web. Big thanks also to John Pormann from Duke who took the time to make these utilities available to the community under an MIT open source license.
control-c , applications and qrsh
Quick hit from the mailing list - in this thread, a user coming from a Platform LSF environment is having trouble with an application (NCSim) that allows execution to be suspended/resumed via the control-C command.
The short answer apparently is to invoke 'qrsh' with the '-pty yes' argument.
Tracking & rollback of SGE config changes
Ed Dale has a great article showing how he uses a subversion ("SVN") repository in conjunction with the SGE-supplied 'save_sge_config.sh' script to provide versioning and rollback capabilities for a Grid Engine installation.
The full writeup is here, well worth a read.
http://scompt.com/blog/archives/2009/10/13/versioned-grid-engine-configuration
Ed's work is a perfect companion to this recent mailing list thread where we discussed the need for comments and log messages that accompany SGE queue instance disablement and other state changes. The end result of that is renewed focus on the following open SGE enhancement requests:
- http://gridengine.sunsource.net/issues/show_bug.cgi?id=1539
- http://gridengine.sunsource.net/issues/show_bug.cgi?id=2179
- http://gridengine.sunsource.net/issues/show_bug.cgi?id=3161
If you agree with the above RFE requests, please use your collabnet votes to express your opinions.
Key FlexLM license integration tools updated
Mark has updated his code for making Grid Engine aware of FlexLM license servers. Read the full announcement here:
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=37&dsMessageId=221361
Without a doubt this is currently the industry best practice way of dealing with SGE/FlexLM integration issues. Kudos to Mark O. for open-sourcing his work.

XML Feeds