grouping jobs to nodes via wildcard PE's

Posted by chris Wed, 15 Feb 2006 02:06:08 GMT

Grid Engine 6 introduced a better resource request syntax, including use of the wildcard "*" character. Some people on the SGE mailing list have reporting using wildcard selectors on Parallel Environments to enforce some really interesting grouping behavior within the grid engine job scheduler. In effect, one of the things this method allows one to do is control the hostgroups to which parallel jobs of different sizes will be dispatched to.

Take this mailing list question as an example...

...We have a cluster composed of several "subclusters". Each subcluster has
8 nodes and is connected over a first switch to the master switch.


        subcluster 1                         subcluster 2         ...
n11 n12 n13 n14 n15 n16 n17 n18      n21 n22 n23 n24 n25 n26 n27 n28
 |   |   |   |   |   |   |   |        |   |   |   |   |   |   |   |
 |   |   |   |   |   |   |   |        |   |   |   |   |   |   |   |
-------------------------------      -------------------------------
        switch 1                             switch 2
-------------------------------      -------------------------------
           |                                    |
           |                                    |
          ----------------------------------------
                        master switch
          ----------------------------------------
                              |
                              |
                       -------------
                        master node
                       -------------

One of the applications running on the cluster needs 8 nodes. We want to
configure the queue (queues?) to allocate only a full subcluster to a
job and not to spawn over to another subcluster.

Reuti provides a really slick solution ...

  1. Create a hostgroup for each subcluster
  2. Create a PE for each subcluster ('mpi_a' and 'mpi_b')
  3. Create 2 queues, each associated with a subcluster hostgroup and one of the newly create PE environments
  4. Submit jobs via: 'qsub -pe "mpi* 8"'

The end result is that parallel jobs will only land within one particular subcluster, keeping all network communication within a single switch (presumably the reason for the subcluster grouping in the first place).

Reuti goes on to explain how this can be used for grouping non-parallel jobs -- some reconfiguration of the queue sorting mechanism and sequence numbers will allow one subcluster be "filled" with serial jobs before job slots are used from the other subcluster (a wise move since this keeps the 2nd subcluster free for larger parallel jobs).