Listing idle execution hosts
In response to a query on the SGE users mailing list, Dave Love posted a short shell script that parses the output of "qhost -j" in order to list out hosts that are active in Grid Engine yet not running any jobs.
The post (with script added as an attachment) can be found here:
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=94053
Fedora 10 will ship with SGE 6.2
I'm late in catching up with Grid Engine mailing list traffic but this one from Orion Poplawski caught my eye:
F-10 will ship with 6.2-3. I'll be pushing a 6.3-4 (or later) 0-day update as well:
* Tue Nov 11 2008 - Orion Poplawski
- 6.2-4
- Add note to README about localhost line in /etc/hosts
- Cleanup setting.sh some, no more MAN stuff
- Add conditional build support for EL
- Use system db_* utils in bdb_checkpoint script
I've got the src.rpm here:
http://www.cora.nwra.com/~orion/fedora/gridengine-6.2-4.fc11.src.rpm
This should build on EL-5, F-9, and F-8 with sun java 1.6.0 installed.
These rpms are geared for minimal NFS type installs. install_* scripts
should work, though install_execd should not be needed for standard
"default" installs. Bugs to https://bugzilla.redhat.com/.
Fedora 10 will ship with SGE 6.2
I'm late in catching up with Grid Engine mailing list traffic but this one from Orion Poplawski caught my eye:
F-10 will ship with 6.2-3. I'll be pushing a 6.3-4 (or later) 0-day update as well:
* Tue Nov 11 2008 - Orion Poplawski
- 6.2-4
- Add note to README about localhost line in /etc/hosts
- Cleanup setting.sh some, no more MAN stuff
- Add conditional build support for EL
- Use system db_* utils in bdb_checkpoint script
I've got the src.rpm here:
http://www.cora.nwra.com/~orion/fedora/gridengine-6.2-4.fc11.src.rpm
This should build on EL-5, F-9, and F-8 with sun java 1.6.0 installed.
These rpms are geared for minimal NFS type installs. install_* scripts
should work, though install_execd should not be needed for standard
"default" installs. Bugs to https://bugzilla.redhat.com/.
Reuti: Tight integration with Intel MPI 3.1 or MPICH2
Via this thread ...
Reuti has updated his methods and information for achieving tight integration in MPICH2 environments. An updated set of files for mpd integration for MPICH(2) is now at http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-60.tgz
The thread discusses Intel MPI 3.1 with the suggestion that the above methods for MPICH2 may work with the Intel product. The basic issue is that the standard "mpdboot" method has always been difficult to achieve tight integration with Grid Engine environments.
Fixing a berkeley db spool database
Per this thread on the users list, a recepie for rebuilding and re-verifying a Berkeley based binary SGE spool:
service sgemaster stop # on failover server service sgemaster stop # on master server cd $SGE_ROOT/default/spool cp -a spooldb spooldb.bak cd spooldb $SGE_ROOT/utilbin/lx24-amd64/db_verify sge $SGE_ROOT/utilbin/lx24-amd64/db_recover $SGE_ROOT/utilbin/lx24-amd64/db_dump -f sge.out sge mv sge sge.old $SGE_ROOT/utilbin/lx24-amd64/db_load -f sge.out sge $SGE_ROOT/utilbin/lx24-amd64/db_verify sge service sgemaster start # on master server service sgemaster start # on failover server
Fixing a berkeley db spool database
Per this thread on the users list, a recepie for rebuilding and re-verifying a Berkeley based binary SGE spool:
service sgemaster stop # on failover server service sgemaster stop # on master server cd $SGE_ROOT/default/spool cp -a spooldb spooldb.bak cd spooldb $SGE_ROOT/utilbin/lx24-amd64/db_verify sge $SGE_ROOT/utilbin/lx24-amd64/db_recover $SGE_ROOT/utilbin/lx24-amd64/db_dump -f sge.out sge mv sge sge.old $SGE_ROOT/utilbin/lx24-amd64/db_load -f sge.out sge $SGE_ROOT/utilbin/lx24-amd64/db_verify sge service sgemaster start # on master server service sgemaster start # on failover server
Grid Engine, workflows & virtualization
Another discussion happening recently on the SGE user list concerns how best to handle virtualization. That thread can be browsed here.
In a followup, Andreas is soliciting feedback from the wider community on how you want to see this area handled in future revisions of Grid Engine. Time to speak up if you have an opinion!
Read Andreas's request for feedback here.
Grid Engine & power saving
I'd guess that most people don't follow the SGE developer list all that closely. Sometimes the developer discussions cross over into areas that all users may be interested in.
There has been an interesting discussion on various ways to give SGE the ability to either directly trigger or otherwise interact with various systems that either switch nodes down into lower power states or even completely power them down/up as needed (Project Hedeby / SDM, etc.)
Automatic methods for powering up and down portions of clusters based on workload have been used for years now but the topic seems to be getting more interest and more backing. A few years ago I saw a neat solution that some people at Cornell Medical College had done -- they used PBS/Torque and had various IPMI scripts that powered nodes on or off depending on the size of the pending job list.
The developer thread (via MarkMail) is here. The CollabNet "Forum View" is here.
Help make grid engine better
The Grid Engine development team can't develop the product in a vacuum -- feedback, suggestions and input from real-world production users of Grid Engine is always critical.
If you are a serious Grid Engine user and care about the future direction of software development efforts, take the time to read this proposal for a new "Job Submission Verifier" enhancement. If the subject is of interest to how you use SGE, the developers would welcome comments, suggestions and feedback before Friday.
"Job Submission Verifier Specifications:"
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=25999
Follow the email thread via MarkMail if you are interested in what others are saying about this.
Help make grid engine better
The Grid Engine development team can't develop the product in a vacuum -- feedback, suggestions and input from real-world production users of Grid Engine is always critical.
If you are a serious Grid Engine user and care about the future direction of software development efforts, take the time to read this proposal for a new "Job Submission Verifier" enhancement. If the subject is of interest to how you use SGE, the developers would welcome comments, suggestions and feedback before Friday.
"Job Submission Verifier Specifications:"
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=25999
Follow the email thread via MarkMail if you are interested in what others are saying about this.
Fixing SGE email issues on Apple OS X
Are you in the following situation?
- /usr/bin/mail works perfectly from the command line
- /usr/bin/mail configured as the SGE mailer produces no email
- substituting a wrapper with extra logging also produces no logs or email
The only clue is in the spool logs:
09/10/2008 16:22:07|execd|xxx-fs01|E|mailer had timeout - killing 09/10/2008 16:22:07|execd|xxx-fs01|E|mailer exited with exit status= 1 09/10/2008 16:22:19|execd|xxx-fs01|E|mailer had timeout - killing 09/10/2008 16:22:19|execd|xxx-fs01|E|mailer exited with exit status= 1
Thanks to Valerio Luccio we have a workaround. The issue is apparently a conflict between one of the SGE supplied libraries that interferes with the mail MTA on OS X when SGE tries to invoke it. A trivial wrapper script that overrides the DYLD_LIBRARY_PATH environment variable is the fix:
#!/bin/sh export DYLD_LIBRARY_PATH=/usr/lib /usr/bin/mail -s "$2" $3
This solved a problem that had been bothering me for days, thanks Valerio - I owe you a beer if we ever end up at the same meeting or conference!
Fixing SGE email issues on Apple OS X
Are you in the following situation?
- /usr/bin/mail works perfectly from the command line
- /usr/bin/mail configured as the SGE mailer produces no email
- substituting a wrapper with extra logging also produces no logs or email
The only clue is in the spool logs:
09/10/2008 16:22:07|execd|xxx-fs01|E|mailer had timeout - killing 09/10/2008 16:22:07|execd|xxx-fs01|E|mailer exited with exit status= 1 09/10/2008 16:22:19|execd|xxx-fs01|E|mailer had timeout - killing 09/10/2008 16:22:19|execd|xxx-fs01|E|mailer exited with exit status= 1
Thanks to Valerio Luccio we have a workaround. The issue is apparently a conflict between one of the SGE supplied libraries that interferes with the mail MTA on OS X when SGE tries to invoke it. A trivial wrapper script that overrides the DYLD_LIBRARY_PATH environment variable is the fix:
#!/bin/sh export DYLD_LIBRARY_PATH=/usr/lib /usr/bin/mail -s "$2" $3
This solved a problem that had been bothering me for days, thanks Valerio - I owe you a beer if we ever end up at the same meeting or conference!
MarkMail: Mine the grid engine maillist archives
MarkMail has just imorted all of the Grid Engine mailing lists from http://gridengine.sunsource.net into their archive, search, index and database system. Initial results are pretty impressive based on a few minutes of searching and experimentation -- seems like a great way to search the mailing lists for answers and info.
Click on the image above and you'll be take to a search on the term 'rqs'. Leave a comment with your impressions if you are so inclined.
MarkMail: Mine the grid engine maillist archives
MarkMail has just imorted all of the Grid Engine mailing lists from http://gridengine.sunsource.net into their archive, search, index and database system. Initial results are pretty impressive based on a few minutes of searching and experimentation -- seems like a great way to search the mailing lists for answers and info.
Click on the image above and you'll be take to a search on the term 'rqs'. Leave a comment with your impressions if you are so inclined.
Bugfix madness

Wow, this is what my inbox looked like this morning -- a massive influx of resolved SGE issues via the SGE issues mailing list. Go Jana!

XML Feeds