Grid Engine 6.1 Advance Reservation Snapshot 3

Posted by chris Thu, 08 Nov 2007 20:51:19 GMT

From the announcement:

... This snapshot is based on the Grid Engine 6.1 Update 2 release enhanced with the Advance Reservation (AR) feature. AR will be included in the next major update release of Grid Engine. The functionality allows users or administrators to reserve particular resources for future use. These reserved resources are only available for special jobs and the scheduler ensures the availability of the resource when the start time is reached.

The design of the Grid Engine Advance Reservation can be found here: AdvanceReservationSpecification.html

Anyone interested in testing this snapshot release should understand that this needs to be done as a clean install. There is no upgrade procedure and the #3 snapshot is incompatible with previous snapshot releases

Also- it's called Advance Reservation, not Advanced Reservation as 2007 Workshop attendees are all well aware of. DanT has some interesting things to say about this snapshot over on his blog, including details on what other features went into this release.

Help shape Advanced Reservation functionality for SGE-6.2

Posted by chris Fri, 12 Jan 2007 22:18:57 GMT

If you are at all interested in the topic of Advanced Reservation scheduling within Grid Engine, then please take the time to look at (and comment upon) the following draft functional specification document:

Functional Specification Document for 6.2 Advance Reservation

Comments and feedback should be sent to the Developer mailing list. A thread has already been started.

Advanced Reservation plugin for Grid Engine

Posted by chris Wed, 25 Oct 2006 21:54:22 GMT

Yoshio Tanaka posts the following:

... We are pleased to announce that advance-reservation plugin module
called PluS version 1.0.0 RC 1 is now available for download at the
PluS home page at:
  http://www.g-lambda.net/plus/ .

PluS (Plug-in Advance Reservation Manager for Torque and Grid Engine)
adds an advance-reservation function to Torque and Grid Engine.
For SGE, one of the following operations will be performed based on
the startup option.

(1) SGE queue base version
  - The SGE schedule is not replaced, and the reservation function is
    realized simply by managing the reservation queues.

(2) SGE self scheduling version
  - The original SGE scheduler is replaced by the PluS SGE scheduler
    which realizes the reservation management function and the job
    scheduling function.

...

The package is released under the Apache 2 License. It appears that the system has mainly been developed and tested on the following configuration: Linux 2.6.x, Intel x86, glibc 2.3.3, SGE 6.0u8

The HTML version of the PluS Manual is online here:
http://www.g-lambda.net/plus/wp-content/uploads/2006/10/manual.html.

The http://www.g-lambda.net/plus/ site contains a link to a PDF from a IEEE conference paper covering the system in more technical detail.

Resource Reservation vs Backfilling

Posted by chris Mon, 24 Jul 2006 20:53:00 GMT

A list message posted by Andreas back in June has a link to an overlooked yet quite interesting Grid Engine Design document. It includes the following definition of terms:

   Resource Reservation 
      A job-specific reservation created by the scheduler for pending 
      jobs. During the reservation the resources are blocked for lower 
      priority jobs.

   Backfilling
      The process of starting jobs of the job priority list despite of 
      higher priority pending jobs that might own a future reservation 
      with the same resource. Thus backfilling has a meaning only in the 
      context of Resource Reservation or Advance Reservation.

   Advance Reservation
      A reservation (possibly independent of a particular job) that can 
      be requested by a user or administrator and gets created by the 
      scheduler. The reservation causes the associated resources be blocked
      for other jobs.

   Preemption
      The process of interrupting job executions in order to free resources
      for particular jobs.

… good terms to know, especially when reading through the SGE docs and mailing list messages. The entire document makes for interesting reading.

Resource reservation prevents parallel job starvation

Posted by chris Wed, 31 May 2006 13:20:00 GMT

In a recent mailing list post, Rui Ramos describes a commonly encountered resource allocation problem:

… I’m making some tests and if i have queue that’s full and have this list of jobs waiting

jobA 4 slots
jobB 1 slot
jobB.1 1 slot
jobB.2 1 slot

Let’s say that the jobs of type B are very quick and a user submits 2000 of them. On the other hand, we have a job that requires 4 slots. But each time we have a free slot it starts a job of type B. following this the jobA only executes when all jobB are finished. Unless the GridEngine can make some kind of slot reservation for jobs with higher priority ? Is this native in the N1GE scheduler, do we need to set it up ?

For people with clusters that run a mix of serial and parallel job, this can be a common problem. The serial jobs zip in and out of the execution slots fast enough that there are never enough free slots at any given scheduling interval to satisfy the demands of pending parallel jobs that need multiple slots in order to execute.

The end result is that the larger parallel jobs languish or “starve” in the pending list for very long periods of time.

The mailing list thread contains some useful replies:

Reuti provides a solution:

what you need is “resource reservation”. Just turn on the reservation in the scheduler “qconf -msconf” by setting “max_reservation 20” or an appropriate value and submit the parallel job with “-R y”.

… and Andreas provides a link to the resource reservation specification document that provides more information about Rui’s problem under the heading of “large parallel job starvation problem”:

   ... Resource reservation can be used to guarantee resources are dedicated 
   to jobs in jobs priority order. A good example which helps to comprehend 
   the problem solved with resource reservaiton/backfilling is the so-called 
   "large parallel job starvation problem". In this scenario there is one 
   high priority pending job (possibly parallel) A that requires a larger quota 
   of a particular resource and a stream of smaller and lower priority jobs B(i) 
   requiring a smaller quota of the same resource.
 
   Without resource reservation an assignment for A can not be guaranteed
   assumed the stream of B(i) jobs does not stop - even if job A actually
   has higher priority than the B(i) jobs:

        A      
        |                     
    +---+----+--------+--------+--------+--------+--------+   +----------+
    |  B(0)  | B(2)   | B(4)   | B(6)   | B(8)   | B(10)  |   |          |
    +---+----+---+----+---+----+---+----+---+----+---+----+---+    A     |
        | B(1)   | B(3)   | B(5)   | B(7)   | B(9)   | B(11)  |          |
        +--------+--------+--------+--------+--------+--------+----------+-->
        
    
   With resource reservation job A gets a reservation that blocks lower 
   priority B(i) jobs and thus guarantees resources will be available for
   A as soon as possible:

        A
        |                     
    +---+----+----------+--------+
    |  B(0)  |          |  B(2)  |   ...
    +---+----+    A     +--------+--------+
        |    |          |  B(1)  |  B(3)  |  ...
        +----+----------+--------+--------+------------------------------->