New user contributed accounting script

Posted by chris Wed, 16 May 2007 02:05:35 GMT

A new "pull statistics from the SGE accounting log file" script has been posted to the SGE community. Olivier Blondel took Joe Landman's "usage.pl" script and modified it to suit his own needs. The script can be found embedded inline with Olivier's post to the users mailing list.

Simple perl reporting tool for SGE accounting data

Posted by chris Wed, 11 Oct 2006 12:55:19 GMT

Joe at Scalable Informatics is offering up a "quick -n- simple" reporting script for Grid Engine accounting and usage data.

Usage examples:

[landman@minicc ~]$ ./usage.pl
Total usage: (in units of second(s))
        wallclock  :       46733.000 second(s)
        user time  :        1600.000 second(s) [3.42%]
        system time:          17.000 second(s) [0.04%]
        cpu time   :       70379.000 second(s) [150.60%]

user            wallclock       user time       system time     cpu time
       memory          percent of total time
landman         46733.000       1600.000        17.000
70379.000       0.000           100.000

"job dropped because of user limitations"

Posted by chris Thu, 09 Mar 2006 19:03:58 GMT

Consider this snippet more search engine fodder for people web searching on particular error messages.

Recently a user asked on the mailing list about encountering job submission rejection messages that say:

job dropped because of user limitation

That particular rejection message is tied to 2 different configuration parameters that can be hard coded into grid engine:

max_u_jobs
The number of active (not finished) jobs which each Grid Engine user can have in the system simultaneously is controlled by this parameter. A value greater than 0 defines the limit. The default value 0 means "unlimited". If the max_u_jobs limit is exceeded by a job submission then the submission command exits with exit status 25 and an appropri- ate error message.
max_jobs
The number of active (not finished) jobs simultaneously allowed in Grid Engine is controlled by this parameter. A value greater than 0 defines the limit. The default value 0 means "unlimited". If the max_jobs limit is exceeded by a job submission then the submission command exits with exit status 25 and an appropriate error message.

Commentary: There are certainly use cases for which these parameters are the best solution but ... before using either of them, consider if one of the SGE resource allocation policy mechanisms can accomplish the same goals. Hard coding global constraints on jobs can negatively affect flexibility and overall system utilization.

Reuti on Gaussian G03-D.01 Integration

Posted by chris Thu, 16 Feb 2006 22:00:41 GMT

This is another one of those short blog posts that serve mostly as index fodder for search engines. Hopefully this will be a shortcut for someone searching on Gaussian SGE integration. The SGE mailing lists are generally not indexed well by the various net crawlers.

Reuti has posted some comments and code snippets on Gaussian G03-D.01 Integration. The full message can be read here:

http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=14600

grouping jobs to nodes via wildcard PE's

Posted by chris Wed, 15 Feb 2006 02:06:08 GMT

Grid Engine 6 introduced a better resource request syntax, including use of the wildcard "*" character. Some people on the SGE mailing list have reporting using wildcard selectors on Parallel Environments to enforce some really interesting grouping behavior within the grid engine job scheduler. In effect, one of the things this method allows one to do is control the hostgroups to which parallel jobs of different sizes will be dispatched to.

Take this mailing list question as an example...

...We have a cluster composed of several "subclusters". Each subcluster has
8 nodes and is connected over a first switch to the master switch.


        subcluster 1                         subcluster 2         ...
n11 n12 n13 n14 n15 n16 n17 n18      n21 n22 n23 n24 n25 n26 n27 n28
 |   |   |   |   |   |   |   |        |   |   |   |   |   |   |   |
 |   |   |   |   |   |   |   |        |   |   |   |   |   |   |   |
-------------------------------      -------------------------------
        switch 1                             switch 2
-------------------------------      -------------------------------
           |                                    |
           |                                    |
          ----------------------------------------
                        master switch
          ----------------------------------------
                              |
                              |
                       -------------
                        master node
                       -------------

One of the applications running on the cluster needs 8 nodes. We want to
configure the queue (queues?) to allocate only a full subcluster to a
job and not to spawn over to another subcluster.

Reuti provides a really slick solution ...

  1. Create a hostgroup for each subcluster
  2. Create a PE for each subcluster ('mpi_a' and 'mpi_b')
  3. Create 2 queues, each associated with a subcluster hostgroup and one of the newly create PE environments
  4. Submit jobs via: 'qsub -pe "mpi* 8"'

The end result is that parallel jobs will only land within one particular subcluster, keeping all network communication within a single switch (presumably the reason for the subcluster grouping in the first place).

Reuti goes on to explain how this can be used for grouping non-parallel jobs -- some reconfiguration of the queue sorting mechanism and sequence numbers will allow one subcluster be "filled" with serial jobs before job slots are used from the other subcluster (a wise move since this keeps the 2nd subcluster free for larger parallel jobs).

sorting qstat output

Posted by chris Sun, 12 Feb 2006 21:07:42 GMT

A user recently asked the mailing list for suggestions on sorting the full output of qstat by job start time.

Reuti replied back with a link to his most excellent script, a bash script called "status" that makes heavy use of awk under the hood. The script works with both SGE 5.3 and 6.x versions of qstat.

The script is hosted on the download section of the SGE project website:
http://gridengine.sunsource.net/servlets/ProjectDocumentView?documentID=8&showInfo=true

After downloading the script, usage is trivial. To sort output by job start time one would do:

 ./status -s time -a
Running jobs:
job-ID  # name                      owner      start time          running in
-----------------------------------------------------------------------------
   561  1 Job7458                   www        01/08/2006 18:59:05 all.q      (stalled)
   653  1 A11510113941883           www        02/08/2006 09:13:58 all.q      
   657  1 A11541113941889           www        02/08/2006 09:14:54 all.q      

Waiting jobs:
job-ID  # name                      owner      submit time        
------------------------------------------------------------------
   562  1 Job7458.cleanup           www        01/08/2006 17:38:14 (hold)
   654  1 btpymol                   www        02/08/2006 09:13:59 (Error)
   654  1 btpymol                   www        02/08/2006 09:13:59 (Error)
   655  1 merge                     www        02/08/2006 09:13:59 (hold)
   656  1 cleanup                   www        02/08/2006 09:13:59 (hold)
   658  1 btrasmol                  www        02/08/2006 09:14:55 (Error)
   658  1 btrasmol                  www        02/08/2006 09:14:55 (Error)
   658  1 btrasmol                  www        02/08/2006 09:14:55 (Error)
   659  1 merge                     www        02/08/2006 09:14:55 (hold)
   660  1 cleanup                   www        02/08/2006 09:14:55 (hold)
   407  1 impossibleJob             www        11/28/2005 09:58:42 

Easy setup of equal user fairshare policy

Posted by chris Tue, 17 Jan 2006 15:20:00 GMT

Reuti posted this link again on the users list and it really caught my eye. It really does represent the fastest/easiest way for an SGE admin to set up a basic resource allocation policy that shares resources equally (and automatically) among all users.

The link:
http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=8319

It boils down to 2 simple configuration actions:

  1. Make 2 changes in the main SGE configuration ('qconf -mconf'):
    • enforce_user auto
    • auto_user_fshare 100

  2. Make 1 change in the SGE scheduler configuration ('qconf -msconf'):
    • weight_tickets_functional 10000

Setting queue level nice values

Posted by chris Tue, 01 Nov 2005 00:45:00 GMT

A user recently asked:

... I know it is possible to submit the jobs with nice in each submit sge script but is it possible to fix a nice value by queue ?

Reuti replied with the quick & simple procedure:

  1. Use "qconf -mq " to set the cluster queue "priority" parameter to the nice value you wish to use
  2. Use "qconf -msconf" to make sure "reprioritize_interval=0:0:0"
  3. Use "qconf -mconf" to make sure "reprioritize=0"

A quick test to verify the commands, setting priority=15 and sure enough the test.sh script was running with an altered nice level:

USER  PID   TIME    UID  PPID CPU   NI  COMMAND
dag   6843  7:12PM  501  6679  0    31  - sge_shepherd-5 -bg
dag   6844  7:12PM  501  6843  0    16  -sh /opt/sge/default/spool/chrisdag/job_scripts/5

Removing empty job output/error files automatically

Posted by chris Wed, 19 Oct 2005 22:03:00 GMT

In a thread dealing with some DRMAA issues, Reuti posted a quick little shell script that can be used as an epilog. Grid Engine supports "prolog" and "epilog" actions at the cluster queue level. These hooks are used to run scripts or perform an action before ('prolog') or after ('epilog') a job is run.

The shell script checks the Grid Engine standard output (STDOUT) and standard error (STDERR) output files and deletes any that are non-zero in size empty. This reduces clutter in job output directories while also preserving any STDOUT/STDERR files that actually contain information.


#!/bin/sh

## Delete the STDOUT and STDERR files (.o and .e) if they are empty
##  ( we do not want to delete non-empty files, they may contain useful
##    troubleshooting or debug information ... )
##

[ -r "$SGE_STDOUT_PATH" -a -f "$SGE_STDOUT_PATH" ] && [ ! -s "$SGE_STDOUT_PATH" ] && rm -f $SGE_STDO
UT_PATH
[ -r "$SGE_STDERR_PATH" -a -f "$SGE_STDERR_PATH" ] && [ ! -s "$SGE_STDERR_PATH" ] && rm -f $SGE_STDE
RR_PATH

In action ...

After saving this script and adding it to the epilog parameter of a cluster queue configuration, the $SGE_ROOT/examples/jobs/simple.sh script was run (all it does is print a datestamp to STDOUT before and after sleeping for 20 seconds) the following was observed:

While the job was running:

bioadmin@b7:~/test> ls -l
total 8
-rwxr-xr-x  1 bioadmin bioadmin 1529 2005-10-19 17:37 simple.sh
-rw-r--r--  1 bioadmin bioadmin    0 2005-10-19 17:37 simple.sh.e2
-rw-r--r--  1 bioadmin bioadmin   29 2005-10-19 17:37 simple.sh.o2
bioadmin@b7:~/test> qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@b7.training.bioteam.net  BIP   1/4       0.01     lx24-x86      
      2 0.55500 simple.sh  bioadmin     r     10/19/2005 17:37:29     1
And after the job completes:
bioadmin@b7:~/test> ls -l
total 8
-rwxr-xr-x  1 bioadmin bioadmin 1529 2005-10-19 17:37 simple.sh
-rw-r--r--  1 bioadmin bioadmin   58 2005-10-19 17:37 simple.sh.o2

No muss, no fuss. The empty .e STDERR file was blown away automatically after the job completed. Any wiki-fidlers reading this post may want to add this code to the Snippets section of the wiki.