<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>gridengine.info : Tag spooling, everything about spooling</title>
    <link>http://gridengine.info/tag/spooling.rss</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>tracking Grid Engine news, bugs, howtos and best practices</description>
    <item>
      <title>Fixing a berkeley db spool database</title>
      <description>&lt;p&gt;Per this &lt;a href="http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&amp;dsMessageId=75815"&gt;thread&lt;/a&gt; on the users list, a recepie for rebuilding and re-verifying a Berkeley based binary SGE spool:
&lt;/p&gt;
&lt;p&gt;&lt;pre&gt;&lt;blockquote&gt;
service sgemaster stop # on failover server
service sgemaster stop # on master server

cd $SGE_ROOT/default/spool
cp -a spooldb spooldb.bak

cd spooldb
$SGE_ROOT/utilbin/l&amp;#8203;x24-amd64/db_verify sge
$SGE_ROOT/utilbin/l&amp;#8203;x24-amd64/db_recover&amp;#8203;
$SGE_ROOT/utilbin/l&amp;#8203;x24-amd64/db_dump -f sge.out sge
mv sge sge.old
$SGE_ROOT/utilbin/l&amp;#8203;x24-amd64/db_load -f sge.out sge
$SGE_ROOT/utilbin/l&amp;#8203;x24-amd64/db_verify sge


service sgemaster start # on master server
service sgemaster start # on failover server
&lt;/pre&gt;&lt;/blockquote&gt;
&lt;/p&gt;


</description>
      <pubDate>Tue, 11 Nov 2008 12:43:00 -0500</pubDate>
      <guid isPermaLink="false">urn:uuid:5d5fe409-1d8d-4b89-ab09-f5ade5ae4b57</guid>
      <author>dag@sonsorol.org (chris)</author>
      <comments>http://gridengine.info/2008/11/11/fixing-a-berkeley-db-spool-database#comments</comments>
      <category>Administration</category>
      <category>MailList Bits</category>
      <category>berkeleydb</category>
      <category>spooling</category>
      <link>http://gridengine.info/2008/11/11/fixing-a-berkeley-db-spool-database</link>
    </item>
    <item>
      <title>Why I love classic spooling</title>
      <description>&lt;p&gt;
&lt;img src="http://gridengine.info/misc/corrupt-sge-5.png" /&gt;
&lt;/p&gt;

&lt;p&gt;I had a fascinating SGE troubleshooting situation this morning. At first it started off as a normal "&lt;em&gt;why does SGE refuse to start?&lt;/em&gt;" issue after a system OS update. The initial errors are very similar to the standard sorts of errors one sees when firewalls, DNS or hostname issues are breaking things:
&lt;pre&gt;
mbgxsrv1:~ root# /common/sge/default/common/sgemaster start
   starting sge_qmaster
   starting sge_schedd
error: commlib error: got read error (closing  
"mbgxsrv1.xxx.xxx.xxx/qmaster/1")
error: commlib error: can't connect to service (Connection refused)
error: getting configuration: unable to contact qmaster using port  
701 on host "mbgxsrv1.xxx.xxx.xxx"
error: can't get configuration from qmaster -- backgrounding
mbgxsrv1:~ root#
&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;It turns out the root cause was much more interesting. A couple of critical SGE spool files had been turned into binary gibberish, possibly caused by a SAN reboot but we are not quite sure. On startup, the qmaster was unable to read in critical configuration data and would bomb out with errors.&lt;/p&gt;

&lt;p&gt;This is where the use of &lt;strong&gt;classic spooling&lt;/strong&gt; by the  organization saved the day. Working deep within the SGE spool directory, I was able to manually fix, replace and repair a couple of files including the "qmaster/cqueues/all.q" file and the "qmaster/hostgroups/@allhosts" files. The fix took a few minutes to effect and SGE started up instantly and without error.
&lt;/p&gt;
&lt;p&gt;What does classic spooling have to do with this? Glad you asked! Had this site been running SGE in the default "berkeley spooling" mode then the files that I was able to quickly find and fix inplace would have been locked inside some binary BDB-formatted database -- inaccessible and unfixable without deep knowledge of Berkley-DB command line and troubleshooting tools. Had this been a berkeley-based spooling system it would have been faster to simply wipe the SGE install and perform a new one from scratch.&lt;/p&gt;
&lt;p&gt;
This is why I'm a strong proponent of classic mode spooling. When berkeley-db spooling is used, you are giving up the beauty, utility and accessibility of ASCII text formatted state and spool files in exchange for "performance" that most users will never notice or realize (those of you that run tens of thousands of jobs per day will disagree but I'm talking about averages here ...). 
&lt;/p&gt;
&lt;p&gt;My general rule now is to use classic mode spooling by default on clusters smaller than 32 nodes in size and on any cluster where I know the daily job throughput is not going to be extremely high. In general I think most users should start with classic mode spooling and only move to Berkeley-DB based spooling when they are comfortable enough with the system to (a) handle a reinstall and (b) actually gain from the performance that berkeley-db spooling offers.&lt;/p&gt;

&lt;p&gt;Read on for more details on this particular incident ... &lt;/p&gt;






&lt;p&gt;&lt;strong&gt;Gory Details&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the qmaster messages file entry that clued us into a possible configuration state problem:
&lt;pre&gt;
01/24/2008 09:50:20|qmaster|mbgxsrv1|W|conf_version not found on  
reading spool file
01/24/2008 09:50:20|qmaster|mbgxsrv1|W|only a single value is allowed  
for configuration attribute "Your"
01/24/2008 09:50:21|qmaster|mbgxsrv1|W|conf_version not found on  
reading spool file
01/24/2008 09:50:21|qmaster|mbgxsrv1|W|only a single value is allowed  
for configuration attribute "qtype"
01/24/2008 09:50:22|qmaster|mbgxsrv1|E|missing configuration  
attribute "group_name"
01/24/2008 09:50:22|qmaster|mbgxsrv1|C|!!!!!!!!!! lGetHost(): got  
NULL element for HGRP_name !!!!!!!!!!
&lt;/pre&gt;

&lt;p&gt;This is what the spool file "qmaster/cqueues/all.q" looked like when I tried to view it in my terminal window:&lt;br/&gt;
&lt;img src="http://gridengine.info/misc/corrupt-sge-3.png"/&gt;
&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;This is what the queue configuration should have looked like:
&lt;pre&gt;
name              all.q
hostlist           @computeNodes
seq_no             0
load_thresholds    np_load_avg=1.75
suspend_thresholds NONE
nsuspend           1
suspend_interval   00:05:00
priority           0
min_cpu_interval   00:05:00
processors         UNDEFINED
qtype              BATCH INTERACTIVE
ckpt_list          NONE
pe_list            make mpich
rerun              FALSE
slots              2
tmpdir             /tmp
shell              /bin/csh
prolog             NONE

 ... snip ...

&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;The "fix" consisted of these steps:
&lt;ol&gt;
&lt;li&gt;Replace the corrupt @allhosts file by copying the known-good @testNodes file in its place&lt;/li&gt;
&lt;li&gt;Manually edit the "new" @allhosts file to properly set the name and group members&lt;/li&gt;
&lt;li&gt;Replace the corrupt all.q cqueues file by overwriting it with a copy of the test.q file which was not corrupted&lt;/li&gt;
&lt;li&gt;Manually edit the new "all.q" file to properly set the qname and other parameters&lt;/li&gt;
&lt;/ol&gt;
After those minor hand-edits using the unix copy command and a text editor Grid Engine was able to start up fine. Overall the fix took about 10 minutes to implement once we identified the 2 corrupt files.
&lt;/p&gt;



</description>
      <pubDate>Thu, 24 Jan 2008 12:33:14 -0500</pubDate>
      <guid isPermaLink="false">urn:uuid:29e900e9-c6c2-4c3a-941b-d6f6edf9f6f7</guid>
      <author>dag@sonsorol.org (chris)</author>
      <comments>http://gridengine.info/2008/01/24/why-i-love-classic-spooling#comments</comments>
      <category>Administration</category>
      <category>spooling</category>
      <category>classic spooling</category>
      <category>rants</category>
      <link>http://gridengine.info/2008/01/24/why-i-love-classic-spooling</link>
    </item>
  </channel>
</rss>
