Meet (and thank) Hin-Tak Leung

Posted by chris Tue, 20 Feb 2007 19:54:06 GMT

People who don't follow the development mailing list or monitor activities surrounding the Issues Database may not be aware of the recent significant contributions of Hin-Tak Leung from the Wellcome Trust Case Control Consortium. People working in genomics and life science informatics have long recognized the world class stature of the programs and people funded by the Wellcome Trust charity.

On December 2nd, 2006 Hin-Tak Leung posted a lengthy & detailed message to the development list outlining various patches, fixes and enhancements that had been implemented locally at the WTCCC. All of these user and usage-driven changes were offered back to the SGE community. The email message supplemented patches and supporting information already uploaded into various items within the SGE Issue Tracking database.

An article on this site has already covered two of the major enhancements made to the Qmon binary -- variable width columns for the qmon job control pane as well as the addition of "qhost-like" status details within the cluster queue pane. Here are some details on the other fixes and enhancements made:

Hin-Tak was kind enough to respond to an unsolicited email message requesting some additional background and professional information.

Read on to learn more about one of the newer open-source contributors to Grid Engine ...

Hin-Tak says:

"... the work is intended to be public domain - the project is funded primarily by the Wellcome Trust (http://www.wellcome.ac.uk/), a charitable fundation. There is just a small timing issue as we have just reached a good milestone of having received all the genotyping samples back from Affymetrix and all the individual disease PIs are eager to get their hands on the preliminary results, and I cannot talk about specific scientific findings.

The Case-Control Consortium is formed by researchers of about 8 disease groups. About two years ago some UK researchers thought of exploiting the increasing affordable genotyping techology to do whole-genome disease-assiciation studies. To do it cost-effectively, they banded together (to use the same control groups for comparison - the national blood donor samples and the 1958 cohort - a government statistics of a random selection of people born in a particular week that year) to form the consortium, to share some of the logistic and technical expertise for the task. My boss, Professor David Clayton and our department head, Professor John Todd (http://www-gene.cimr.cam.ac.uk/todd/DIL.shtml) are two of the strongest driving force of the project.

My "real" office is in the Diabetes and Inflammation Laboratory in Cambridge, and officially I belong to the WTCCC (statistical) analysis group, and have a special interest in identifying the genetic causes of Type 1 diabetes, among the 8 diseases.

The biological phrase (sample collection/preparation) happened mostly before I joined the project. I was trained as a research physicist and had a few interesting years of academic research (in Cambridge), before I went off to as an IT contractor in the commercial telecom industory, and then driver and management software development with optical storage devices/jukeboxes for data archiving purposes (e.g. in banks, medical institutes, law-inforcements, etc).

Just over a year ago, the project was recruiting for expertise required for the statistical analysis phrase of the project. I talked myself into the 2-year post I am currently in, despite not having a statistics background, just based on my having a reasonable research-level mathematics background and programming skills not usually found on scientific researchers.

It has been an interesting year; I haven't managed to do as much statistics as I should be doing and I get side-tracked by computing issues easily; but on the other hand, I have been able to so some interesting and unusual things, like some some low-level C-codes in the snpMatrix package http://www-gene.cimr.cam.ac.uk/clayton/software/ which we are writing during the last year for analysing genotype data, and the grid engine improvements.

In the open-source world, I am known to hang out with the ghostscript folks and the linuxprinting.org folks, but the qmon GUI change is probably more related to CXterm http://sourceforge.net/projects/cxterm/, an orphaned piece of software I "adopted" with two other people, and some commercial Java programming background in the telecom area, and having played with haploview (http://www.broad.mit.edu/mpg/haploview/) helped too."

Scheduler Policies for Job Prioritization

Posted by chris Thu, 20 Oct 2005 17:20:11 GMT

Charu has written a great 23-page Sun BluePrint™ Document entitled "Scheduler Policies for Job Prioritization in N1 Grid Engine".

Excerpts are below, if you like the blueprint be sure to leave positive feedback on the Sun site.

... This article describes the tools and techniques for resource management that are available in the N1 Grid Engine 6 software, and explains how to use them effectively. It discusses the prioritization policies in the N1 Grid Engine 6 software, describes how they fit with the new resource aggregation methods, and makes recommendations for how to map real-life resource allocation schemes to N1 Grid configurations.

Pretty pictures explain Functional vs Sharetree scheduling 1

Posted by chris Fri, 30 Sep 2005 21:21:00 GMT

I saw versions of these images in Charu’s presentation slide deck a long time ago. They did a good job visually explaining the scheduling behavior differences in Grid Engine Sharetree vs Functional share policies. Now that they appear in a publicly accessible PDF file1 I can shamelessly excerpt them:

1Source: http://www.sun.com/products-n-solutions/edu/whitepapers/pdf/web_services_for_HPC.pdf

Click the “Read more” link for more information and bigger versions of the images …

Sharetree behavior

The key bit of information here is to note how the entitlement shares allowed to Project B actually dip BELOW the 50% threshold in the later stages of the time series. This is because the SGE Scheduler “remembers” past usage (see earlier in the graphic where Project B is using WAY MORE than 50% of available cluster resources) and is compensating Project A for the previous excess usage of Project B. Over time, as the graph shows, the SGE Scheduler works to bring harmony to the assigned 50-50 split of cluster resources between two projects.

Functional policy behavior

The key bit of information here concerning the functional share policy is that there is no “memory” of past usage by Project B. Early on in the time series, Project B is allowed to take advantage of “extra” available idle resources. As soon as Project A starts wanting to do work again, the Grid Engine scheduler starts enforcing the 50-50 entitlement split. Project A never gets “compensated” for letting Project B use more than its allocated share because the Grid Engine scheduler does not consider past usage within the Functional policy.

Summary

The Sharetree Policy “remembers” past usage and works to enforce the configured resource allocation entitlements as averaged over time. This may include compensating some users/groups/projects temporarily with “extra” entitlements to make up for times when other users/groups/projects were using more than their configured entitlements.

The Functional Policy will also allow “extra” entitlements if cluster resources are idle or otherwise available. It will not, however, penalize or compensate anyone for prior usage. When things are busy, the scheduler will attempt to enforce it’s allocation policies exactly as they have been configured.

Related article

I wrote a mini-Howto showing how to do percentage based resource allocation between different Department groups on a Grid Engine cluster. You can find it online at http://bioteam.net/dag/sge6-funct-share-dept.html. There is some additional information there about the different scheduling polices that may or may not be of some use.

FlexLM licensing and grid engine - a new HowTo draft

Posted by chris Fri, 30 Sep 2005 14:02:00 GMT

A thread about managing licenses with grid engine got Mark to post a draft version of a new HowTO document he is working on (along with a teaser mention of some other code he has in the works).

The draft document is in perl POD format. Translated html and text versions can be found here:

http://gridengine.info/files/Mark_Olesen-HowTo-Licenses-n1ge.html

http://gridengine.info/files/Mark_Olesen-HowTo-Licenses-n1ge.txt

Allowing user jobs to take over entire nodes

Posted by chris Thu, 22 Sep 2005 20:34:00 GMT

A fairly common use case in life science clustering is the following situation:

  • User has a cluster of 2-processor Apple or Linux machines

  • The default “all.q” cluster queue is in use, including the standard practice of setting “slots=2” so that no more than 2 jobs can run at any one time on a dual processor node.

The default configuration works fine but occasionally end-users want to run multithreaded applications capable of efficiently using more than one CPU.

Since the applications are often memory, CPU and IO intensive the logical request from the end-user is:

How do I guarantee my job will get sole access to a compute node so it does not have to compete with another running job for resources?”.



Update 9/28/2005

A cleaner method…

Sean Dilda posted an easier method than what I describe below. His method involves simply creating a parallel environment (PE) object called “threaded” that people can invoke when they want access to more than one CPU. The PE configuration looks like this:
$ qconf -sp threaded
pe_name           threaded
slots             1024
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $pe_slots
control_slaves    FALSE
job_is_first_task TRUE
urgency_slots     min

If this PE is associated with a cluster queue, users can use “-pe threaded 2” with qsub/qrsh to get exclusive access to a machine.

This approach has the advantage of being (a) easier to setup and (b) conceptually easier for people to understand.




The original method described is shown below



One method of doing this involves editing the Grid Engine complex to create a new User-defined resource that is both requestable and consumable.

Allowing the resource to be requestable simply means that users can request it at job submission time.

Allowing the resource to be consumable (rather than a fixed, static value) means that as jobs requesting the resource are dispatched for execution, the grid engine scheduler will decrease the count of remaining available units of that resource. When zero units of the resource are available, jobs requesting that resource can’t run and will wait in the pending list until the resource becomes available again.

In a nutshell

We are going to create a user requestable consumable resource called “greedy”. The value of greedy will be set to the integer value of “2” which equals the CPU and slot count in our compute nodes.

The “greedy” resource will be associated with the “all.q” cluster queue (rather than being a host-specific or global resource).

Users can then submit jobs using the additional syntax ”-hard -l greedy=2” during job submission. The end result is that when the job is dispatched to a compute node, the value of the greedy resource on that queue instance will drop to 0 which will block any additional jobs from flowing to that machine (even though a job slot is available).

Implementation

First we have to create the new user-defined resource within the Grid Engine complex:

(1) Run the command ”qconf -mc”; when the editor opens insert a new line into the complex record that looks like this:

greedy gr INT <= YES YES 1 0

(2) Now verify that the new resource exists via the ”qconf -sc” command:

#name               shortcut   type        relop requestable consumable default  urgency 
#----------------------------------------------------------------------------------------
arch                a          RESTRING    ==    YES         NO         NONE     0
calendar            c          RESTRING    ==    YES         NO         NONE     0
cpu                 cpu        DOUBLE      >=    YES         NO         0        0
greedy              gr         INT         <=    YES         YES        1        0
...

(3) Edit the configuration for the cluster queue (“all.q” in this example) by issuing the command ”qconf -mq all.q”. When the editor opens, find the line that says ”complex_values NONE” and replace it with ”complex_values greedy=2”. Verify that this change has been accepted by running the command ”qconf -sq all.q” and observing the newly made change. You can also verify that this resource is now associated with all the queue instances of the cluster queue by running the command ”qstat -f -F | grep greedy”.

Testing

The example below shows that jobs submitted with “qsub -hard -l greedy=2 ” have exclusive access to the compute nodes. Even though there are job slots available the scheduler is still holding jobs in the pending list while the greedy jobs are running:
inquiry:~ root# qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@inquiry.bioteam.net      BIP   1/2       0.31     darwin        
    488 0.55500 Sleeper    www          r     09/22/2005 10:34:38     1        
----------------------------------------------------------------------------
all.q@node001.cluster.private  BIP   1/2       0.09     darwin        
    487 0.55500 Sleeper    www          r     09/22/2005 10:34:38     1        
----------------------------------------------------------------------------
all.q@node002.cluster.private  BIP   0/2       -NA-     darwin        au
----------------------------------------------------------------------------
all.q@node003.cluster.private  BIP   0/2       -NA-     darwin        au

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    489 0.55500 Sleeper    www          qw    09/22/2005 10:32:35     1        
    490 0.55500 Sleeper    www          qw    09/22/2005 10:32:36     1        
A further test is run by running ”qstat -j” on one of the pending jobs:
...
scheduling info:
queue instance "all.q@node002.cluster.private" dropped because it is temporarily not available
queue instance "all.q@node003.cluster.private" dropped because it is temporarily not available
(-l greedy=2) cannot run in queue instance "all.q@node001.cluster.private" because it offers only qc:greedy=0.000000
(-l greedy=2) cannot run in queue instance "all.q@inquiry.bioteam.net" because it offers only qc:greedy=0.000000
...
Submitting and scheduling normal jobs (those that do not request the greedy resource) happens as expected. Here a bunch of regularly submitted sleeper jobs occupy all available job slots:
inquiry:~/greedtest www$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@inquiry.bioteam.net      BIP   2/2       0.01     darwin        
    491 0.55500 Sleeper    www          r     09/22/2005 10:59:02     1        
    493 0.55500 Sleeper    www          r     09/22/2005 10:59:04     1        
----------------------------------------------------------------------------
all.q@node001.cluster.private  BIP   2/2       0.11     darwin        
    492 0.55500 Sleeper    www          r     09/22/2005 10:59:02     1        
    494 0.55500 Sleeper    www          r     09/22/2005 10:59:04     1        
----------------------------------------------------------------------------
all.q@node002.cluster.private  BIP   0/2       -NA-     darwin        au
----------------------------------------------------------------------------
all.q@node003.cluster.private  BIP   0/2       -NA-     darwin        au

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
    495 0.55500 Sleeper    www          qw    09/22/2005 10:59:02     1        
    496 0.55500 Sleeper    www          qw    09/22/2005 10:59:03     1