Listing idle execution hosts

Posted by chris Tue, 23 Dec 2008 13:22:03 GMT

In response to a query on the SGE users mailing list, Dave Love posted a short shell script that parses the output of "qhost -j" in order to list out hosts that are active in Grid Engine yet not running any jobs.

The post (with script added as an attachment) can be found here:
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=94053

Fedora 10 will ship with SGE 6.2

Posted by chris Mon, 17 Nov 2008 23:36:30 GMT

I'm late in catching up with Grid Engine mailing list traffic but this one from Orion Poplawski caught my eye:

F-10 will ship with 6.2-3. I'll be pushing a 6.3-4 (or later) 0-day update as well: * Tue Nov 11 2008 - Orion Poplawski
- 6.2-4
- Add note to README about localhost line in /etc/hosts
- Cleanup setting.sh some, no more MAN stuff
- Add conditional build support for EL
- Use system db_* utils in bdb_checkpoint script

I've got the src.rpm here:

http://www.cora.nwra.com/~orion/fedora/gridengine-6.2-4.fc11.src.rpm

This should build on EL-5, F-9, and F-8 with sun java 1.6.0 installed.
These rpms are geared for minimal NFS type installs. install_* scripts should work, though install_execd should not be needed for standard "default" installs. Bugs to https://bugzilla.redhat.com/.

Fedora 10 will ship with SGE 6.2

Posted by chris Mon, 17 Nov 2008 23:36:30 GMT

I'm late in catching up with Grid Engine mailing list traffic but this one from Orion Poplawski caught my eye:

F-10 will ship with 6.2-3. I'll be pushing a 6.3-4 (or later) 0-day update as well: * Tue Nov 11 2008 - Orion Poplawski
- 6.2-4
- Add note to README about localhost line in /etc/hosts
- Cleanup setting.sh some, no more MAN stuff
- Add conditional build support for EL
- Use system db_* utils in bdb_checkpoint script

I've got the src.rpm here:

http://www.cora.nwra.com/~orion/fedora/gridengine-6.2-4.fc11.src.rpm

This should build on EL-5, F-9, and F-8 with sun java 1.6.0 installed.
These rpms are geared for minimal NFS type installs. install_* scripts should work, though install_execd should not be needed for standard "default" installs. Bugs to https://bugzilla.redhat.com/.

Reuti: Tight integration with Intel MPI 3.1 or MPICH2

Posted by chris Mon, 17 Nov 2008 22:14:46 GMT

Via this thread ...

Reuti has updated his methods and information for achieving tight integration in MPICH2 environments. An updated set of files for mpd integration for MPICH(2) is now at http://gridengine.su​nsource.net/howto/mp​ich2-integration/mpi​ch2-60.tgz

The thread discusses Intel MPI 3.1 with the suggestion that the above methods for MPICH2 may work with the Intel product. The basic issue is that the standard "mpdboot" method has always been difficult to achieve tight integration with Grid Engine environments.

Fixing a berkeley db spool database

Posted by chris Tue, 11 Nov 2008 17:43:00 GMT

Per this thread on the users list, a recepie for rebuilding and re-verifying a Berkeley based binary SGE spool:

service sgemaster stop # on failover server service sgemaster stop # on master server cd $SGE_ROOT/default/spool cp -a spooldb spooldb.bak cd spooldb $SGE_ROOT/utilbin/l​x24-amd64/db_verify sge $SGE_ROOT/utilbin/l​x24-amd64/db_recover​ $SGE_ROOT/utilbin/l​x24-amd64/db_dump -f sge.out sge mv sge sge.old $SGE_ROOT/utilbin/l​x24-amd64/db_load -f sge.out sge $SGE_ROOT/utilbin/l​x24-amd64/db_verify sge service sgemaster start # on master server service sgemaster start # on failover server

Fixing a berkeley db spool database

Posted by chris Tue, 11 Nov 2008 17:43:00 GMT

Per this thread on the users list, a recepie for rebuilding and re-verifying a Berkeley based binary SGE spool:

service sgemaster stop # on failover server service sgemaster stop # on master server cd $SGE_ROOT/default/spool cp -a spooldb spooldb.bak cd spooldb $SGE_ROOT/utilbin/l​x24-amd64/db_verify sge $SGE_ROOT/utilbin/l​x24-amd64/db_recover​ $SGE_ROOT/utilbin/l​x24-amd64/db_dump -f sge.out sge mv sge sge.old $SGE_ROOT/utilbin/l​x24-amd64/db_load -f sge.out sge $SGE_ROOT/utilbin/l​x24-amd64/db_verify sge service sgemaster start # on master server service sgemaster start # on failover server

Grid Engine, workflows & virtualization

Posted by chris Fri, 07 Nov 2008 18:04:10 GMT

Another discussion happening recently on the SGE user list concerns how best to handle virtualization. That thread can be browsed here.

In a followup, Andreas is soliciting feedback from the wider community on how you want to see this area handled in future revisions of Grid Engine. Time to speak up if you have an opinion!

Read Andreas's request for feedback here.

Grid Engine & power saving

Posted by chris Fri, 07 Nov 2008 17:02:17 GMT

I'd guess that most people don't follow the SGE developer list all that closely. Sometimes the developer discussions cross over into areas that all users may be interested in.

There has been an interesting discussion on various ways to give SGE the ability to either directly trigger or otherwise interact with various systems that either switch nodes down into lower power states or even completely power them down/up as needed (Project Hedeby / SDM, etc.)

Automatic methods for powering up and down portions of clusters based on workload have been used for years now but the topic seems to be getting more interest and more backing. A few years ago I saw a neat solution that some people at Cornell Medical College had done -- they used PBS/Torque and had various IPMI scripts that powered nodes on or off depending on the size of the pending job list.

The developer thread (via MarkMail) is here. The CollabNet "Forum View" is here.

Help make grid engine better

Posted by chris Tue, 23 Sep 2008 15:33:29 GMT

The Grid Engine development team can't develop the product in a vacuum -- feedback, suggestions and input from real-world production users of Grid Engine is always critical.

If you are a serious Grid Engine user and care about the future direction of software development efforts, take the time to read this proposal for a new "Job Submission Verifier" enhancement. If the subject is of interest to how you use SGE, the developers would welcome comments, suggestions and feedback before Friday.

"Job Submission Verifier Specifications:"
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=25999

Follow the email thread via MarkMail if you are interested in what others are saying about this.

Help make grid engine better

Posted by chris Tue, 23 Sep 2008 15:33:29 GMT

The Grid Engine development team can't develop the product in a vacuum -- feedback, suggestions and input from real-world production users of Grid Engine is always critical.

If you are a serious Grid Engine user and care about the future direction of software development efforts, take the time to read this proposal for a new "Job Submission Verifier" enhancement. If the subject is of interest to how you use SGE, the developers would welcome comments, suggestions and feedback before Friday.

"Job Submission Verifier Specifications:"
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=25999

Follow the email thread via MarkMail if you are interested in what others are saying about this.

Fixing SGE email issues on Apple OS X

Posted by chris Tue, 23 Sep 2008 14:57:31 GMT

Are you in the following situation?

  1. /usr/bin/mail works perfectly from the command line
  2. /usr/bin/mail configured as the SGE mailer produces no email
  3. substituting a wrapper with extra logging also produces no logs or email

The only clue is in the spool logs:

09/10/2008 16:22:07|execd|xxx-fs01|E|mailer had timeout - killing
09/10/2008 16:22:07|execd|xxx-fs01|E|mailer exited with exit status= 1
09/10/2008 16:22:19|execd|xxx-fs01|E|mailer had timeout - killing
09/10/2008 16:22:19|execd|xxx-fs01|E|mailer exited with exit status= 1

Thanks to Valerio Luccio we have a workaround. The issue is apparently a conflict between one of the SGE supplied libraries that interferes with the mail MTA on OS X when SGE tries to invoke it. A trivial wrapper script that overrides the DYLD_LIBRARY_PATH environment variable is the fix:

#!/bin/sh
export DYLD_LIBRARY_PATH=/usr/lib
/usr/bin/mail -s "$2" $3

This solved a problem that had been bothering me for days, thanks Valerio - I owe you a beer if we ever end up at the same meeting or conference!

Fixing SGE email issues on Apple OS X

Posted by chris Tue, 23 Sep 2008 14:57:31 GMT

Are you in the following situation?

  1. /usr/bin/mail works perfectly from the command line
  2. /usr/bin/mail configured as the SGE mailer produces no email
  3. substituting a wrapper with extra logging also produces no logs or email

The only clue is in the spool logs:

09/10/2008 16:22:07|execd|xxx-fs01|E|mailer had timeout - killing
09/10/2008 16:22:07|execd|xxx-fs01|E|mailer exited with exit status= 1
09/10/2008 16:22:19|execd|xxx-fs01|E|mailer had timeout - killing
09/10/2008 16:22:19|execd|xxx-fs01|E|mailer exited with exit status= 1

Thanks to Valerio Luccio we have a workaround. The issue is apparently a conflict between one of the SGE supplied libraries that interferes with the mail MTA on OS X when SGE tries to invoke it. A trivial wrapper script that overrides the DYLD_LIBRARY_PATH environment variable is the fix:

#!/bin/sh
export DYLD_LIBRARY_PATH=/usr/lib
/usr/bin/mail -s "$2" $3

This solved a problem that had been bothering me for days, thanks Valerio - I owe you a beer if we ever end up at the same meeting or conference!

MarkMail: Mine the grid engine maillist archives

Posted by chris Wed, 17 Sep 2008 11:48:31 GMT

MarkMail has just imorted all of the Grid Engine mailing lists from http://gridengine.sunsource.net into their archive, search, index and database system. Initial results are pretty impressive based on a few minutes of searching and experimentation -- seems like a great way to search the mailing lists for answers and info.

Click on the image above and you'll be take to a search on the term 'rqs'. Leave a comment with your impressions if you are so inclined.

MarkMail: Mine the grid engine maillist archives

Posted by chris Wed, 17 Sep 2008 11:48:31 GMT

MarkMail has just imorted all of the Grid Engine mailing lists from http://gridengine.sunsource.net into their archive, search, index and database system. Initial results are pretty impressive based on a few minutes of searching and experimentation -- seems like a great way to search the mailing lists for answers and info.

Click on the image above and you'll be take to a search on the term 'rqs'. Leave a comment with your impressions if you are so inclined.

Bugfix madness

Posted by chris Fri, 12 Sep 2008 11:51:46 GMT

jana-bugfix-1.png

Wow, this is what my inbox looked like this morning -- a massive influx of resolved SGE issues via the SGE issues mailing list. Go Jana!

Older posts: 1 2 3 ... 8