Why upgrade? DanT explains SGE from 5.x through 6.2 and beyond

Posted by chris Fri, 18 Jul 2008 18:58:52 GMT

Dan has posted a great overview of how Grid Engine has changed since the version 5.x days, couched in the context of answering the "Why should I upgrade SGE?" questions that often come up.

I won't even excerpt it, the full article is well worth a read:
http://blogs.sun.com/templedf/entry/why_upgrade

Feedback needed: Obsolete options and parameters considered for removal

Posted by chris Tue, 24 Jun 2008 12:22:41 GMT

Grid Engine developers posted a list today of SGE configuration parameters and client arguments that are being considered for removal from the product because they are either obsolete or they duplicate settings found elsewhere.

The developers are seeking feedback and comments on their plans - if you have any please drop a line to the users@gridengine.sunsource.net mailing list. The current roadmap calls for these methods to be marked as 'deprecated' in the SGE 6.2 release with total removal planned for a future post-6.2 release.

The message can be found here:
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=25045

The full list of items being considered for removal can also be found after the jump ...

The parameters planned to obsolete are:

host_conf(5)
- processors
   obsolete, same as num_proc from the complex list

sched_conf(5)
- algorithm
   just default is allowed, no additional algorithms are planed
- params JC_FILTER
   huge performance impact plus may lead to wrong scheduling decisions

sge_conf(5)
- reprioritize
   redundant because hard bound to reprioritize_interval in sched_conf(5)
- shell_start_mode
   obsolete, value from queue_conf(5) is used
- set_token_cmd
   no known AFS support
- pag_cmd
   no known AFS support
- token_extend_time
   no known AFS support
- qmaster_params DISABLE_AUTO_RESCHEDULING
   equivalent to default reschedule_unknown=0:0:0
- qmaster_params merge ACCT_RESERVED_USAGE and SHARETREE_RESERVED_USAGE
   We can't imaging a use case to have these values separated
- finished_jobs
   qstat -j does not work with successful finished jobs. Code seems
   to work only with jobs going into error state.

user(5)
- delete_time
   change to internal, not changeable/visible field
   Implicit set by auto_user_delete_time

qconf(1)
- sep option
   obsolete, same as num_proc
- ks option
   obsolete, same as -kt scheduler

qmod(1)
- c option
   depreciated, use -cj or -cq
- r option
   depreciated, use -rj or -rq
- s option
   depreciated, use -sj or -sq
- us option
   depreciated, use -usj or -usq

SGE 6.2 beta 2 is out

Posted by chris Thu, 19 Jun 2008 11:04:12 GMT

6.2b2 came out yesterday:

http://gridengine.sunsource.net/news/GE62beta2-announce.html

The list of bug fixes made since SGE 6.2 Beta 1 is online at http://gridengine.sunsource.net/project/gridengine/62patches.txt.

This is the latest beta release of SGE 6.2 and we really need more eyeballs and testers on this release to flesh out any remaining issues before 6.2 goes officially out the door.

There are some differences in 6.2 both in the install procedure as well as the daemons (sge_schedd is gone! -- It's now a thread within sge_qmaster). I posted a screencast recording of the SGE 6.2 Beta 1 installation a while back: http://gridengine.info/articles/2008/05/16/screencast-live-install-of-sge6-2-beta for those that may be interested in watching what the new install process looks like.

SGE 6.2 beta 2 is out

Posted by chris Thu, 19 Jun 2008 11:04:12 GMT

6.2b2 came out yesterday:

http://gridengine.sunsource.net/news/GE62beta2-announce.html

The list of bug fixes made since SGE 6.2 Beta 1 is online at http://gridengine.sunsource.net/project/gridengine/62patches.txt.

This is the latest beta release of SGE 6.2 and we really need more eyeballs and testers on this release to flesh out any remaining issues before 6.2 goes officially out the door.

There are some differences in 6.2 both in the install procedure as well as the daemons (sge_schedd is gone! -- It's now a thread within sge_qmaster). I posted a screencast recording of the SGE 6.2 Beta 1 installation a while back: http://gridengine.info/articles/2008/05/16/screencast-live-install-of-sge6-2-beta for those that may be interested in watching what the new install process looks like.

June 2008 SGE Workshops

Posted by chris Fri, 23 May 2008 13:53:49 GMT

Consider this post a plug for the upcoming June 2008 SGE User and SGE Admin workshops that are being held in the Boston, MA USA area.

More details here:
http://blog.bioteam.net/2008/03/22/sge-training/

SGE 6.2 beta binaries are available for testing

Posted by chris Tue, 13 May 2008 14:24:12 GMT

I'm not going to waste time copying the release announcement into a blog post. The full announcement can be read here:

http://gridengine.sunsource.net/servlets/ReadMsg?list=announce&msgNo=94

Lots of significant changes in the product itself. I also love the migration of manuals and docs to the new http://wikis.sun.com/display/GridEngine site.

Please remember that the reason for this beta release is to allow you to test 6.2 before it officially goes out the door in final form. The more people we have working on and stress-testing 6.2 the less chance there will be an inconvenient or unexpected upgrade issue, bug or glitch. The developers have good testbed environments and testsuites but they can't simulate all the different ways and methods that we use (and abuse!) SGE to get work done. Help make the 6.2 release a big success by testing now and providing feedback.

SGE 6.2 goes beta next week (your help needed)

Posted by chris Mon, 05 May 2008 14:00:43 GMT

SGE 6.2 is being released in Beta form next week and the developers are asking for people to make some time if possible to fully test out the beta snapshot of the latest major SGE point release.

Andy's full note can be found here (well worth reading in full ...):
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=24426

In my mind, I'm most excited about the following:

  • Advance Reservations & array job inter-dependencies
  • The scheduler is now a thread within the qmaster!
  • The JVM running within the qmaster
  • SGE moving all docs into wiki form!

RHEL5.2/Centos5 kernel update may cause problems

Posted by chris Mon, 21 Apr 2008 16:20:25 GMT

This is a heads up for RedHat Enterprise Linux (RHEL) users as well as for users (like myself) of the various Centos variants.

There is a recent patch for RHEL that changes the inode data structure exposed to NFS clients from 32 bits to 64 bits in size. The basic summary of this issue is that many applications may not handle this change gracefully (such as one report with the SGE linux binaries.)

RHEL and modern Centos users should probably pay attention to (by subscribing as CC: contacts) to this issue:
http://gridengine.sunsource.net/issues/show_bug.cgi?id=2543

A RedHat bug report discussing the issue in more detail is here:
"Large inode number patch breaks applications"
https://bugzilla.redhat.com/show_bug.cgi?id=241348

6.1 leak found; schedd_job_info is not your friend

Posted by chris Thu, 10 Apr 2008 15:07:00 GMT

Anyone interested in the memory leak that has been bothering some 6.1 users should check out the comments associated with Issue #2464:
http://gridengine.sunsource.net/issues/show_bug.cgi?id=2464

Among the interesting things you'll see are:

  • A great example of motivated SGE users and developers working together to track down a hard to find problem
  • Interesting comments on the potential "unfixible" (my words) nature of the schedd_job_info messages
  • A really cool workaround for getting job scheduler messages with schedd_job_info=FALSE

In a nutshell, there is a problem in the schedd_job_info framework that can cause massive resource utilization on the qmaster machine. This happens in particular on larger systems or places with large numbers of queue instances. This can also pop up on systems with jobs that are pending due to un-fulfillable resource requests. This explains why I saw the memory leak on my small testbed cluster -- I have a number of "pend forever" jobs in the queue for demonstration purposes.

The fix is to disable schedd_job_info. This is potentially problematic though as that feature is pretty much my goto-first action for troubleshooting job dispatch problems.

However, in a recent update comment to this issue, andreas added a possible tip for getting scheduling messages about a job in a way that that puts far less load on the system AND does not require schedd_job_info=TRUE:

qalter -w v  

Remember though that comments found in a bug report are not "gospel" so don't read this as news that schedd_job_info is forever broken or going away. Expect to see this and other issues discussed as part of the SGE Roadmap. You are attending the May 2008 SGE Workshop, right?

Release 6.1u4 is out

Posted by chris Fri, 04 Apr 2008 14:06:11 GMT

Congratulations to the SGE developer team!

Big news today -- 6.1u4 was just announced; hopefully addressing some persistent issues people have been having with the previous releases. The plaintext list of fixed issues can be found here:
http://gridengine.sunsource.net/project/gridengine/61patches.txt

The full announcement is here:
http://gridengine.sunsource.net/news/GE61u4-announce.html

I've been unable to keep 6.1u3 running consistently on a small test system, probably due to the same memory leak others have been reporting. There is a chance that a subtle leak still exists or at least has not been fully tracked down in 6.1u4 but multiple people are working diligently on this. Best bet is to monitor the users mailing list to see the feedback.

CFP: Open Source Grid & Cluster Conference 2008

Posted by chris Wed, 19 Mar 2008 14:15:28 GMT

Reminder: Call for Participation closes Friday, March 21

OPEN SOURCE GRID & CLUSTER CONFERENCE 2008

Featuring: GlobusWorld, Grid Engine Workshop, Rocks Cluster Workshop

May 13 - 15, 2008 in Oakland, California
http://www.OpenSourceGridCluster.org

DEADLINE FOR ABSTRACT SUBMISSIONS: March 21, 2008

Whether you are a Grid or Cluster expert with technical advice to share, or a leader with visions for the future of open source Grid and Cluster computing in research or industry, the Open Source Grid & Cluster Conference is the premier event for delivering your message to the Grid and Cluster community. In past years, hundreds of Grid and Cluster professionals from research and industry have attended individual events such as GlobusWorld, the Grid Engine Workshop, and Rocks-a-Palooza to discuss Grid and Cluster adoption issues, to receive training and exchange information related to these widely used Grid and Cluster software systems. This year the Globus, Grid Engine, and Rocks communities are joining forces to create the most comprehensive event on open source Grid and Cluster computing to date.

The Open Source Grid & Cluster Conference program will offer a wide variety of conference sessions, mini-symposiums, panel discussions, workshops, and tutorials. Speaking opportunities range from highly technical research, development, and deployment presentations to targeted panels on commercial and research adoption considerations. The Open Source Grid & Cluster Conference will run parallel tracks, some focused on Globus, Grid Engine, and Rocks community-specific topics, and others focused on cross-cutting and other open source Grid and Cluster software technologies and uses.

KEY DATES AND DEADLINES
Abstract submission deadline - March 21, 2008
Acceptance notification - April 15, 2008
Presentation Slides Due - April 30, 2008
SPEAKING TOPICS
Submissions should be centered on the theme of uses and implementation of Open Source Software for Grid and Cluster Computing.

All proposals should be submitted online at http://www.OpenSourceGridCluster.org/CFP.html

Click on through for the submission guidelines ...

Questions should be sent to program@OpenSourceGridCluster.org

SUBMISSION GUIDELINES ---------------------

ABSTRACT GUIDELINES
All submissions must include an abstract of no more than 500 words, and a brief bio for each presenter. Abstracts should be written so as to be self-contained and to provide the technical substance required for the program committee to evaluate the session's contribution to the Open Source Grid and Cluster community. Please indicate whether the proposed session is specific to just one of Globus, Grid Engine, or Rocks. If the presentation was given at another conference, then the name, date, and location of the event must be noted in the submission. Abstracts should be submitted in plain text format either as an attachment or in the main body of the e-mail. Abstracts and bios for accepted submissions will be published on the Open Source Grid & Cluster Conference website and in other conference material as the description of the session. Presentation slides may be published on the Conference website and distributed with conference material.

PRESENTATIONS
Presentation proposals may be submitted for individual time slots of thirty minutes. Please be sure to allow ten minutes for Q&A within this allotted time. Individual presentations will be grouped with similar topic presentations to fill an entire session.

BUILD YOUR OWN SESSION
Participants are invited to organize their own, complete, ninety-minute session, including but not limited to the following categories. The submission must include an agenda, and the names and associations of all participants.

Panel Session / Mini-Symposium: These sessions will enable conference attendees to learn from a group of experts on a particular topic. The session organizer may deliver an opening talk to set the context for the remainder of the session. Panelists will then give presentations designed to stimulate audience participation, on their preferably diverse opinions, experiences or expertise regarding the theme of the session. At least ten minutes should be reserved at the end for questions from the audience.

Birds-of-a-Feather (BOF) Sessions: These sessions will allow conference attendees to discuss focused subject areas. The session may include presentations and open discussion. Session organizers will be responsible for moderating these sessions and reporting on their outcomes.

WORKSHOPS AND TUTORIALS
Ample room is available for half-day and full-day pre-conference (Monday) and post-conference (Friday) workshops and tutorials. Workshops may include topical meetings with open registration or community/group meetings with resricted attendance. Tutorials may be on any topic related to the Open Source Grid and Cluster theme of the conference. Submissions must include preferred and minimum acceptable room size, and preferred and acceptable times. An extra nominal fee may be required of attendees or the organizer to cover additional costs such as A/V and food.

All proposals should be submitted online at http://www.OpenSourceGridCluster.org/CFP.html

Questions should be sent to program@OpenSourceGridCluster.org

New home for gridengine.info

Posted by chris Wed, 05 Mar 2008 23:47:00 GMT

Some behind-the-scenes info to report:

  • We have a new server home and fresh OS: Centos 5 Linux running virtually under XenEnterprise
  • The XenEnterprise infrastructure is running on top of some really sweet storage and server hardware from Silicon Mechanics
  • xml-qstat.org (SVN repository, local demo and the website) are now hosted on the same system as gridengine.info
  • Now using Ruby 1.8.6 and the latest stable RAILS environment to host the blog
  • The OS refresh and RAILS environment update finally allowed us to upgrade to the latest Typo 5 based blogging engine
  • wiki.gridengine.info also coexists on this system along with everything else

The only glitch so far is Typo (or RAILS, not sure …) and it’s inability to handle article tags that contain the "." character. This means all of the SGE version based tags like "6.1" currently fail to work. Even the "tag cloud" sidebar can’t handle "." characters so that has temporarily been moved off the blog. This will either be fixed or we’ll hack the database to replace the "." with something else.

Expect access to this site to possibly be flaky as new DNS information propagates outward.

Comments welcome – currently the blog is proxied behind Apache and is just using the basic built-in webserver that comes with the RAILS environment instead of the Lighttpd-with-FastCGI that I had hand built on the old server. I’d be interested in hearing comments on how fast or slow this new server is operating. If needed we’ll put the blog under Lighttpd/FastCGI again.

 

 

Grid Engine and Apple OS X Launchd

Posted by chris Tue, 04 Mar 2008 16:20:42 GMT

This is a follow-up post relating to the new Apple framework for starting, stopping and managing persistent daemons and services called "launchd". The issue of Grid Engine interoperability with the launchd framework has already been covered in a gridengine.info Wiki article.

The new news to report is that my coworker Bill Van Etten stumbled upon the SGE environment variable "SGE_ND" and realized that it could be useful for Apple launchd integration because launchd really hates daemons that fork off ASAP upon startup. By setting the "SGE_ND" variable to true, the daemons don't fork and can be better managed by launchd.

The new launchd scripts are discussed and available for download here:
http://blog.bioteam.net/2008/03/04/apple-os-x-105-launchd-scripts-for-grid-engine/

Feel free to use these scripts or simply refer to them when customizing your own. As always, feedback and comments would be appreciated. BioTeam remains committed to making sure SGE remains an excellent choice for use on OS X based systems.

DRMAA memory leak found & fixed

Posted by chris Tue, 19 Feb 2008 18:31:49 GMT

Most casual SGE users and admins probably find little cause to monitor the Grid Engine developer mailing list. A nice little success story has played out on the list recently with a user assisting the SGE dev team in quickly discovering, isolating and fixing a memory leak that has been in the codebase since the DRMAA 1.0 API release.

A user posted this message to the developer list, showing what appears to be a memory leak in in drmaa_run_job(). Andreas then replied asking if it was possible for the user to recreate the issue while running under the valgrind instrumentation framework.

In this follow-up thread, the user-provided valgrind data allowed Andreas to pinpoint the problem, file Issue #2497 with the bug tracking database and then post a preliminary patch that fixes the problem.

The patch still needs to undergo code review before it makes it officially into the Grid Engine codebase. Overall this is a nice little success story where a user was able to go the extra mile (by instrumenting under valgrind) in order to provide the developers exactly what they needed to quickly identify and fix things.

Kudos to James & Andreas.

Updated Quick Reference Guide

Posted by chris Mon, 18 Feb 2008 19:27:33 GMT

Thanks to significant assistance from Mark Olesen, there are new versions of the SGE Quick Reference Guide posted over at my employer blog site. The major changes are:

  • Errors, typos and mistakes fixed
  • A new layout, formatted for A4 paper sizes has been created
The version 3.0 PDFs can be found here:
http://blog.bioteam.net/2008/02/06/grid-engine-quick-reference-guide/

Older posts: 1 2 3 ... 6