berkeley spooling vs. "classic" spooling

Posted by chris Wed, 08 Apr 2009 13:33:26 GMT

If you administer a large SGE system or a small one with very high rates of job throughput then you are already familiar with the importance of the spooling subsystem and the decisions that need to be made regarding what methods to implement. I've long been a public fan of classic spooling on clusters with less than 100 nodes and classic tends to be the standard spooling method we choose on new cluster deployments in our industry.

A recent interesting thread on the SGE mailing list covers some important news regarding spooling, including some tests that the SGE team has run internally as well as a published benchmark kit for you to do your own testing.

Excerpt:

A note on BDB spooling results. Strictly spoken we are comparing apples and oranges when comparing the SGE classic spooling performance with BDB spooling. While BDB opens the database with the O_DSYNC flag to ensure maximum data integrity in case of outages we don't use that flag with SGE classic spooling. A quick check has shown that you even would not want to wait for the end of the first test when the SGE classic spooling code would use the O_DSYNC flag for file operations.

The key messages from the tests results below are:

  • severe bottlenecks with "classic" NFS spooling
  • impressive performance improvements with new NFS server and client systems running Solaris 10 and ZFS
  • local classic spooling on a UFS filesystem is a no-go option (a critical UFS bug fix last year caused a performance break-in with the classic spooling)
  • moderate performance improvements with NFSv4 vs. NFSv3

Read the thread and download the benchmark files here: http://gridengine.sunsource.net/files/documents/7/196/test_spooling_performance.tar.gz. Please help the SGE team make the product better and better -- the more "real world" feedback that is captured, the better!