Why upgrade? DanT explains SGE from 5.x through 6.2 and beyond
Dan has posted a great overview of how Grid Engine has changed since the version 5.x days, couched in the context of answering the "Why should I upgrade SGE?" questions that often come up.
I won't even excerpt it, the full article is well worth a read:
http://blogs.sun.com/templedf/entry/why_upgrade
Feedback needed: Obsolete options and parameters considered for removal
Grid Engine developers posted a list today of SGE configuration parameters and client arguments that are being considered for removal from the product because they are either obsolete or they duplicate settings found elsewhere.
The developers are seeking feedback and comments on their plans - if you have any please drop a line to the users@gridengine.sunsource.net mailing list. The current roadmap calls for these methods to be marked as 'deprecated' in the SGE 6.2 release with total removal planned for a future post-6.2 release.
The message can be found here:
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=25045
The full list of items being considered for removal can also be found after the jump ...
The parameters planned to obsolete are: host_conf(5) - processors obsolete, same as num_proc from the complex list sched_conf(5) - algorithm just default is allowed, no additional algorithms are planed - params JC_FILTER huge performance impact plus may lead to wrong scheduling decisions sge_conf(5) - reprioritize redundant because hard bound to reprioritize_interval in sched_conf(5) - shell_start_mode obsolete, value from queue_conf(5) is used - set_token_cmd no known AFS support - pag_cmd no known AFS support - token_extend_time no known AFS support - qmaster_params DISABLE_AUTO_RESCHEDULING equivalent to default reschedule_unknown=0:0:0 - qmaster_params merge ACCT_RESERVED_USAGE and SHARETREE_RESERVED_USAGE We can't imaging a use case to have these values separated - finished_jobs qstat -j does not work with successful finished jobs. Code seems to work only with jobs going into error state. user(5) - delete_time change to internal, not changeable/visible field Implicit set by auto_user_delete_time qconf(1) - sep option obsolete, same as num_proc - ks option obsolete, same as -kt scheduler qmod(1) - c option depreciated, use -cj or -cq - r option depreciated, use -rj or -rq - s option depreciated, use -sj or -sq - us option depreciated, use -usj or -usq
SGE 6.2 beta 2 is out
6.2b2 came out yesterday:
http://gridengine.sunsource.net/news/GE62beta2-announce.html
The list of bug fixes made since SGE 6.2 Beta 1 is online at http://gridengine.sunsource.net/project/gridengine/62patches.txt.
This is the latest beta release of SGE 6.2 and we really need more eyeballs and testers on this release to flesh out any remaining issues before 6.2 goes officially out the door.
There are some differences in 6.2 both in the install procedure as well as the daemons (sge_schedd is gone! -- It's now a thread within sge_qmaster). I posted a screencast recording of the SGE 6.2 Beta 1 installation a while back: http://gridengine.info/articles/2008/05/16/screencast-live-install-of-sge6-2-beta for those that may be interested in watching what the new install process looks like.
SGE 6.2 beta 2 is out
6.2b2 came out yesterday:
http://gridengine.sunsource.net/news/GE62beta2-announce.html
The list of bug fixes made since SGE 6.2 Beta 1 is online at http://gridengine.sunsource.net/project/gridengine/62patches.txt.
This is the latest beta release of SGE 6.2 and we really need more eyeballs and testers on this release to flesh out any remaining issues before 6.2 goes officially out the door.
There are some differences in 6.2 both in the install procedure as well as the daemons (sge_schedd is gone! -- It's now a thread within sge_qmaster). I posted a screencast recording of the SGE 6.2 Beta 1 installation a while back: http://gridengine.info/articles/2008/05/16/screencast-live-install-of-sge6-2-beta for those that may be interested in watching what the new install process looks like.
June 2008 SGE Workshops
Consider this post a plug for the upcoming June 2008 SGE User and SGE Admin workshops that are being held in the Boston, MA USA area.
More details here:
http://blog.bioteam.net/2008/03/22/sge-training/
SGE 6.2 beta binaries are available for testing
I'm not going to waste time copying the release announcement into a blog post. The full announcement can be read here:
http://gridengine.sunsource.net/servlets/ReadMsg?list=announce&msgNo=94
Lots of significant changes in the product itself. I also love the migration of manuals and docs to the new http://wikis.sun.com/display/GridEngine site.
Please remember that the reason for this beta release is to allow you to test 6.2 before it officially goes out the door in final form. The more people we have working on and stress-testing 6.2 the less chance there will be an inconvenient or unexpected upgrade issue, bug or glitch. The developers have good testbed environments and testsuites but they can't simulate all the different ways and methods that we use (and abuse!) SGE to get work done. Help make the 6.2 release a big success by testing now and providing feedback.
SGE 6.2 goes beta next week (your help needed)
SGE 6.2 is being released in Beta form next week and the developers are asking for people to make some time if possible to fully test out the beta snapshot of the latest major SGE point release.
Andy's full note can be found here (well worth reading in full ...):
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=24426
In my mind, I'm most excited about the following:
- Advance Reservations & array job inter-dependencies
- The scheduler is now a thread within the qmaster!
- The JVM running within the qmaster
- SGE moving all docs into wiki form!
RHEL5.2/Centos5 kernel update may cause problems
This is a heads up for RedHat Enterprise Linux (RHEL) users as well as for users (like myself) of the various Centos variants.
There is a recent patch for RHEL that changes the inode data structure exposed to NFS clients from 32 bits to 64 bits in size. The basic summary of this issue is that many applications may not handle this change gracefully (such as one report with the SGE linux binaries.)
RHEL and modern Centos users should probably pay attention to (by subscribing as CC: contacts) to this issue:
http://gridengine.sunsource.net/issues/show_bug.cgi?id=2543
A RedHat bug report discussing the issue in more detail is here:
"Large inode number patch breaks applications"
https://bugzilla.redhat.com/show_bug.cgi?id=241348
6.1 leak found; schedd_job_info is not your friend
Anyone interested in the memory leak that has been bothering some 6.1 users should check out the comments associated with Issue #2464:
http://gridengine.sunsource.net/issues/show_bug.cgi?id=2464
Among the interesting things you'll see are:
- A great example of motivated SGE users and developers working together to track down a hard to find problem
- Interesting comments on the potential "unfixible" (my words) nature of the schedd_job_info messages
- A really cool workaround for getting job scheduler messages with schedd_job_info=FALSE
In a nutshell, there is a problem in the schedd_job_info framework that can cause massive resource utilization on the qmaster machine. This happens in particular on larger systems or places with large numbers of queue instances. This can also pop up on systems with jobs that are pending due to un-fulfillable resource requests. This explains why I saw the memory leak on my small testbed cluster -- I have a number of "pend forever" jobs in the queue for demonstration purposes.
The fix is to disable schedd_job_info. This is potentially problematic though as that feature is pretty much my goto-first action for troubleshooting job dispatch problems.
However, in a recent update comment to this issue, andreas added a possible tip for getting scheduling messages about a job in a way that that puts far less load on the system AND does not require schedd_job_info=TRUE:
qalter -w v
Remember though that comments found in a bug report are not "gospel" so don't read this as news that schedd_job_info is forever broken or going away. Expect to see this and other issues discussed as part of the SGE Roadmap. You are attending the May 2008 SGE Workshop, right?
Release 6.1u4 is out
Congratulations to the SGE developer team!
Big news today -- 6.1u4 was just announced; hopefully addressing some persistent issues people have been having with the previous releases. The plaintext list of fixed issues can be found here:
http://gridengine.sunsource.net/project/gridengine/61patches.txt
The full announcement is here:
http://gridengine.sunsource.net/news/GE61u4-announce.html
I've been unable to keep 6.1u3 running consistently on a small test system, probably due to the same memory leak others have been reporting. There is a chance that a subtle leak still exists or at least has not been fully tracked down in 6.1u4 but multiple people are working diligently on this. Best bet is to monitor the users mailing list to see the feedback.
CFP: Open Source Grid & Cluster Conference 2008
Reminder: Call for Participation closes Friday, March 21
OPEN SOURCE GRID & CLUSTER CONFERENCE 2008
Featuring: GlobusWorld, Grid Engine Workshop, Rocks Cluster Workshop
May 13 - 15, 2008 in Oakland, California
http://www.OpenSourceGridCluster.org
DEADLINE FOR ABSTRACT SUBMISSIONS: March 21, 2008
Whether you are a Grid or Cluster expert with technical advice to share, or a leader with visions for the future of open source Grid and Cluster computing in research or industry, the Open Source Grid & Cluster Conference is the premier event for delivering your message to the Grid and Cluster community. In past years, hundreds of Grid and Cluster professionals from research and industry have attended individual events such as GlobusWorld, the Grid Engine Workshop, and Rocks-a-Palooza to discuss Grid and Cluster adoption issues, to receive training and exchange information related to these widely used Grid and Cluster software systems. This year the Globus, Grid Engine, and Rocks communities are joining forces to create the most comprehensive event on open source Grid and Cluster computing to date.
The Open Source Grid & Cluster Conference program will offer a wide variety of conference sessions, mini-symposiums, panel discussions, workshops, and tutorials. Speaking opportunities range from highly technical research, development, and deployment presentations to targeted panels on commercial and research adoption considerations. The Open Source Grid & Cluster Conference will run parallel tracks, some focused on Globus, Grid Engine, and Rocks community-specific topics, and others focused on cross-cutting and other open source Grid and Cluster software technologies and uses.
KEY DATES AND DEADLINES
Abstract submission deadline - March 21, 2008
Acceptance notification - April 15, 2008
Presentation Slides Due - April 30, 2008
SPEAKING TOPICS
Submissions should be centered on the theme of uses and implementation
of Open Source Software for Grid and Cluster Computing.
All proposals should be submitted online at http://www.OpenSourceGridCluster.org/CFP.html
Click on through for the submission guidelines ...
Questions should be sent to program@OpenSourceGridCluster.org
SUBMISSION GUIDELINES ---------------------
ABSTRACT GUIDELINES
All submissions must include an abstract of no more than 500 words,
and a brief bio for each presenter. Abstracts should be written so as
to be self-contained and to provide the technical substance required
for the program committee to evaluate the session's contribution to
the Open Source Grid and Cluster community. Please indicate whether
the proposed session is specific to just one of Globus, Grid Engine,
or Rocks. If the presentation was given at another conference, then
the name, date, and location of the event must be noted in the
submission. Abstracts should be submitted in plain text format either
as an attachment or in the main body of the e-mail. Abstracts and bios
for accepted submissions will be published on the Open Source Grid &
Cluster Conference website and in other conference material as the
description of the session. Presentation slides may be published on
the Conference website and distributed with conference material.
PRESENTATIONS
Presentation proposals may be submitted for individual time slots of
thirty minutes. Please be sure to allow ten minutes for Q&A within
this allotted time. Individual presentations will be grouped with
similar topic presentations to fill an entire session.
BUILD YOUR OWN SESSION
Participants are invited to organize their own, complete,
ninety-minute session, including but not limited to the following
categories. The submission must include an agenda, and the names and
associations of all participants.
Panel Session / Mini-Symposium: These sessions will enable conference attendees to learn from a group of experts on a particular topic. The session organizer may deliver an opening talk to set the context for the remainder of the session. Panelists will then give presentations designed to stimulate audience participation, on their preferably diverse opinions, experiences or expertise regarding the theme of the session. At least ten minutes should be reserved at the end for questions from the audience.
Birds-of-a-Feather (BOF) Sessions: These sessions will allow conference attendees to discuss focused subject areas. The session may include presentations and open discussion. Session organizers will be responsible for moderating these sessions and reporting on their outcomes.
WORKSHOPS AND TUTORIALS
Ample room is available for half-day and full-day pre-conference
(Monday) and post-conference (Friday) workshops and
tutorials. Workshops may include topical meetings with open
registration or community/group meetings with resricted attendance.
Tutorials may be on any topic related to the Open Source Grid and
Cluster theme of the conference. Submissions must include preferred
and minimum acceptable room size, and preferred and acceptable
times. An extra nominal fee may be required of attendees or the
organizer to cover additional costs such as A/V and food.
All proposals should be submitted online at http://www.OpenSourceGridCluster.org/CFP.html
Questions should be sent to program@OpenSourceGridCluster.org
New home for gridengine.info
Some behind-the-scenes info to report:
- We have a new server home and fresh OS: Centos 5 Linux running virtually under XenEnterprise
- The XenEnterprise infrastructure is running on top of some really sweet storage and server hardware from Silicon Mechanics
- xml-qstat.org (SVN repository, local demo and the website) are now hosted on the same system as gridengine.info
- Now using Ruby 1.8.6 and the latest stable RAILS environment to host the blog
- The OS refresh and RAILS environment update finally allowed us to upgrade to the latest Typo 5 based blogging engine
- wiki.gridengine.info also coexists on this system along with everything else
The only glitch so far is Typo (or RAILS, not sure …) and it’s inability to handle article tags that contain the "." character. This means all of the SGE version based tags like "6.1" currently fail to work. Even the "tag cloud" sidebar can’t handle "." characters so that has temporarily been moved off the blog. This will either be fixed or we’ll hack the database to replace the "." with something else.
Expect access to this site to possibly be flaky as new DNS information propagates outward.
Comments welcome – currently the blog is proxied behind Apache and is just using the basic built-in webserver that comes with the RAILS environment instead of the Lighttpd-with-FastCGI that I had hand built on the old server. I’d be interested in hearing comments on how fast or slow this new server is operating. If needed we’ll put the blog under Lighttpd/FastCGI again.
Grid Engine and Apple OS X Launchd
This is a follow-up post relating to the new Apple framework for starting, stopping and managing persistent daemons and services called "launchd". The issue of Grid Engine interoperability with the launchd framework has already been covered in a gridengine.info Wiki article.
The new news to report is that my coworker Bill Van Etten stumbled upon the SGE environment variable "SGE_ND" and realized that it could be useful for Apple launchd integration because launchd really hates daemons that fork off ASAP upon startup. By setting the "SGE_ND" variable to true, the daemons don't fork and can be better managed by launchd.
The new launchd scripts are discussed and available for download here:
http://blog.bioteam.net/2008/03/04/apple-os-x-105-launchd-scripts-for-grid-engine/
Feel free to use these scripts or simply refer to them when customizing your own. As always, feedback and comments would be appreciated. BioTeam remains committed to making sure SGE remains an excellent choice for use on OS X based systems.
DRMAA memory leak found & fixed
Most casual SGE users and admins probably find little cause to monitor the Grid Engine developer mailing list. A nice little success story has played out on the list recently with a user assisting the SGE dev team in quickly discovering, isolating and fixing a memory leak that has been in the codebase since the DRMAA 1.0 API release.
A user posted this message to the developer list, showing what appears to be a memory leak in in drmaa_run_job(). Andreas then replied asking if it was possible for the user to recreate the issue while running under the valgrind instrumentation framework.
In this follow-up thread, the user-provided valgrind data allowed Andreas to pinpoint the problem, file Issue #2497 with the bug tracking database and then post a preliminary patch that fixes the problem.
The patch still needs to undergo code review before it makes it officially into the Grid Engine codebase. Overall this is a nice little success story where a user was able to go the extra mile (by instrumenting under valgrind) in order to provide the developers exactly what they needed to quickly identify and fix things.
Kudos to James & Andreas.
Updated Quick Reference Guide
Thanks to significant assistance from Mark Olesen, there are new versions of the SGE Quick Reference Guide posted over at my employer blog site. The major changes are:
- Errors, typos and mistakes fixed
- A new layout, formatted for A4 paper sizes has been created
http://blog.bioteam.net/2008/02/06/grid-engine-quick-reference-guide/




XML Feeds