Understanding queue error state 'E'
Working at my day job I usually handle SGE related questions from our customers and clients. This morning after responding to a support request concerning a SGE queue in state "E" I got curious and started trying to learn how often we had been asked this. It turns out that I've probably sent ~25-30 unique responses on this specific subject and each time my written response was different. This post is an attempt to create a single article that I can point people at as needed ...
Seeing "E" in the state column of qstat?
E state errors usually mean that an attempt to start a job failed in a spectacular manner and the Grid Engine qmaster decided to close off the queue instance to new jobs.
This is an important Grid Engine protective measure designed to keep your remaining pending jobs from a "black hole" draining effect in which they all successively get dispatched to the "bad" node die instantly with errors.
There are different causes to state E -- in most cases the root cause is is some large, systemic hardware or OS level error or misconfiguration. Typical examples include:
- The username of the job submitter does not exist on the execution host (extremely common)
- Shared filesystem failure
- Parallel jobs: syntax errors or bad commands in "start_proc_args" or "stop_proc_args" as defined within the parallel environment (PE)
- Serial jobs: syntax errors or a "prolog" or "epilog" script that does not exit with status code 0
- Serious path or path_alias problems (paths that exist on the submit host are different on remote execution host or have been improperly aliased
- Network, routing or DNS errors that are interfering with LDAP, NIS or DNS
I have seen a few cases of actual jobs crashing and causing queue instance state "E". Usually this seems to occur when the job itself has crashed and taken out its parent process (the 'sge_shepard' deamon). If your job is bombing bad enough to wipe out the parent sge_shepard process then SGE will usually toggle the queue instance into "E" state. This is still a fairly rare occurance so if you are trying to debug this situation I'd recommend first looking at Hardware and OS level issues before looking too closely at the job as a root-cause.
State "E" does not go away automatically
One big message to impart is that E states are persistent and never go away on their own (unlike many SGE queue and job states which clear automatically). State "E" will persist through hardware reboots and Grid Engine restart efforts. The state has to be manually be cleared by a Grid Engine administrator. Again, the reason for this is that SGE wants a human to investigate the root cause first in case there is potential for the "black hole" effect mentioned above.
If you think this was a transient problem you can clear the queues and see what happens with your pending jobs --- the command is "qmod -c (queue instance)".
To globally clear all E states in your SGE cluster:
qmod -c '*'
Troubleshooting and Diagnosing
- qstat -explain E
- Examine the node itself and OS logs with an eye towards entries relating to permissions, failures or access errors
- Try to login to the node in question using a username associated with a failed job. This will help diagnose any username, authentication or access issues
- Look in the job output directory if it is available. Output from failed jobs can be extremely useful, especially if there is a path, ENV or permission problem
- Examine the SGE logs with particular focus on the messages file created by the sge_exced on the execution host in question
- If all else fails, SGE daemons will write log files to /tmp when they can't write to their normal spool location. Seeing recent SGE event data in /tmp instead of your normal spool location is a good indication of filesystem or permission errors
I'll try to keep this page updated in the future with new information and troubleshooting hints

XML Feeds