Grid Engine User Stories ::: DNA Productions
Chris Dagdigian
chris@bioteam.net
The open source grid engine project does not have marketing staff or a PR budget. The is the first of what will hopefully become a series of profiles concentrating on users and organizations doing interesting things with SGE. Please send me feedback, comments and suggestions for future profiles!
Making Movies With Grid Engine 6
Company Background
DNA Productions, Inc. has created award winning 2D and 3D animated projects in Dallas, Texas since 1987. Examples of DNA’s projects include the Oscar nominated, ”Jimmy Neutron: Boy Genius” feature film for Paramount and Nickelodeon and “The Adventures of Jimmy Neutron, Boy Genius” television series, which currently airs on Nickelodeon. DNA also animated the Emmy nominated holiday special, ”Olive, the Other Reindeer” for the Curiosity Company and Fox. And, DNA wrote, directed and produced the 3D Christmas special, ”Santa vs. the Snowman” for O Entertainment, which runs in Imax theaters during the holiday season.
DNA is currently in development on a new animated feature film called ”The Ant Bully” scheduled for early August 2006 theatrical and 3D IMAX release. In production for over two years, the project has made novel and interesting use of render farm and workflow systems built on top of open-source Grid Engine 6.
The Ant Bully (2006)
Quick Facts
- Company: DNA Productions, Dallas TX
- Business: 3D animation projects and features
- CPUs managed by Grid Engine: 1,400
- Storage: 73TB Isilon clustered storage
- Average # SGE jobs per day: 150,000
- Average # “shots” per week: 100
- Seriously Cool: Grid Engine prolog and epilog scripts report job parameters, start times, job array index position, usage data and exit status information to a central SQL database. The data is piped to web dashboards that provide producers and animators with “percentage shot complete” and “percentage frame complete” information. The density of job related information retained in the SQL database is enough to allow for 100% accurate re-execution of any prior job for any reason.
Animation Workload and Grid Engine Workflow
The central work unit on the render farm is a “shot”. The entire feature film can be described and broken down into a series of thousands and thousands of required shots. A shot contains a number of frames and each frame has roughly 10 different layers reflecting characteristics such as lighting, shadow, texture maps and more. The number of frames per shot varies from dozens to thousands. For this particular feature there are 1600 shots with an average length of 86 frames.
Each shot with all included frames and layers can be rendered independently and during the course of production a shot may be rendered and re-rendered many times. It is also possible to simply revisit/redo a layer or frame within a shot group.
Shots are rigidly named (example: ’xy_1_100_030_00_v005’) and the naming scheme is 100% consistent and enforced across all groups and departments. This extends to the naming of the Grid Engine jobs and their output directories as well as to the physical layout of the multi-terabyte Isilon clustered storage system.
The storage layout is similar to what this author has seen at scientific organizations involved with massive genome sequencing efforts. At a sequencing facility, knowing the name of the contig, experiment or clone ID would allow someone to efficiently traverse the shared filesystem to find or load the relevant raw and derived data. At DNA Productions, the Isilon NAS directory structure is laid out to allow someone knowing only a shot name to find exactly the files, textures, metadata or media required. This allows for multiple, disconnected workflow and production systems to know *exactly* where to read and write files.
The render farm nodes started out with Fedora Core 2 Linux and a modern 2.6.x kernel although newly acquired 64bit AMD systems are running Fedora Core 4. The workstations and render farm nodes all have gigabit connections to the core network and shared storage. There is no routing, DNS or network topology difference between a animator’s Linux workstation and a cluster node, allowing for production workstations to join the rendering grid as needed.
The production applications are nearly all commercial in nature. Some are FLEXlm licensed and others have proprietary licensing server systems. Example commercial applications include: Maya, Houdini, Massive and the Pixar RenderMan tools. All of the tools are well wrapped for submission to Grid Engine. Licenses are not exclusive to the cluster as many of the same applications are also run on workstations.
Grid Engine Configuration
When it was first set up, the Grid Engine 6.0u4 Grid Engine installation was effectively a default install with FIFO scheduling. The only major changes from built-in defaults were the use of classic spooling, a ”job_load_adjustments np_load_avg=1.00” scheduling parameter tweak and a hard-coded ”maxujobs=40” constraint.
Over time, the configuration has been subjected to various rounds of optimization and enhancement effort. Currently there are 10 cluster queues defined, with 14 host groups and 6 custom defined complex resources. The primary resource allocation implementation is based on the Functional Policy mechanism. The maxujobs=100 parameter is still set during the day but is can automatically grow up to 250 during the night based on system load. The POSIX Priority policy is used at job submission time to help rank jobs by importance. Some jobs are submitted automatically with hold conditions that prevent execution until evening hours.
Workflow
There are two very interesting systems that DNA Productions has integrated with Grid Engine. The first is a set of graphical Grid Engine wrapper tools that automate the process of submitting Grid Engine jobs. These wrapper tools completely hide the standard SGE binaries (”qsub”, ”qrsh”) from the users allowing a production member to (for instance) take a scene description file and send it out for rendering.
The GUI wrapper automatically creates shell scripts containing the embedded qsub and job array commands and submits it to the cluster. Because the qsub commands are programmatically generated without user interaction they can rigidly enforce standard job naming and output location conventions as well as contain embedded resource requests for things such as software license tokens. The wrappers provide two key advantages – they hide a significant amount of cluster related complexity from the end user while also ensuring that jobs are submitted in a uniform way that is consistent with production guidelines and workflow requirements.
The characteristics of the jobs themselves are very interesting. Average job runtime goes from several minutes to at most a few hours, except for special cases such as simulation runs and other special FX or testing efforts. The average job array contains roughly ~10 tasks within it.
What this means from a Grid Engine perspective is that there is enough "churn" within the system (active jobs completing and draining from the system) to allow the configured resource allocation policies to work rapidly. To put it another way - producers or department heads can easily pick out and prioritize pending Grid Engine jobs that are associated with critical shots.
The second integrated system is very impressive and is yet un-named.
Integrated with Grid Engine via the use of Prolog and Epilog scripts is the ability to pipe job submission & execution related information into a SQL database BEFORE and AFTER any SGE job or array element task completes. The prolog script will connect to the SQL database during job dispatch and persistently store specific information related to job submission (Job_ID, start time, host, task_id, and logpath). As each task completes, the epilog script will store the Job_ID, end_time, and array task_id. Augmenting the prolog and epilog scripts are database entries made by the DNA “farmwrapper” job submission engine. The end result is that enough information is captured in the database to enable any job to be re-created exactly and resubmitted to the render farm.
Screenshots
The first image shows a screen capture from “DNA FarmTV” a dashboard status application that displays real time information about Grid Engine usage, users, jobs and errors [Large View] . The second image shows a screenshot from "RUSTY", one of the internally developed Grid Engine wrappers that can automatically submit jobs to the render farm [Large View].::: Conversation with Andre Thomas, DNA Head of Rendering :::
Andre Thomas from DNA Productions was kind enough to respond to a series of emailed questions. His replies are below.
Tell us about yourself, your background, how you got into animation and any interesting projects you have done in the past
I’ve been in this business for over 12 years now and have worked on movies like, Men in Black, Independence Day, Con Air, Tomorrow Never Dies and Valiant. I’ve always been very good in math and had a strong interest in computers since I was a teenager, although I didn’t get a chance to turn my passion into a profession until after having become a Toolmaker and studying Hotel & Restaurant management. It was a bit of luck when a friend asked me if I knew how to use computers, because a colleague of his needed a digital artist and thus I started in a small boutique shop over 12 years ago.
What do you do at DNA Productions?
I am the supervising Shading TD [Technical Director], and the Head of Rendering, supervising 12 TD’s and wranglers, I am responsible for ensuring that the farm is available, operating efficiently and any problems get resolved very quickly, which includes my team proactively identifying problems with renders and solving them, this entails providing support for all departments that are using the farm. We have developed an extensive training program since the members of our team have to be able to resolve issues from any department. This requires extensive knowledge of all tools and the full production pipeline.
What kind of software is used on this project?
Houdini, Maya, PRman, Nuke, and Massive are the main commercial applications, this is complimented by a slew of software/tools written in house using a variety of languages, Perl, Python, C/C++, Tcl/Tk, PHP
Does DNA develop rendering software or plugins internally?
The rendering software used is Pixar’s PRman, however we have developed numerous plugins in house.
The film is being produced for simultaneous release in traditional film format as well as for 3D projection in IMAX theaters, how does that affect the workflow and rendering process?
By doing the film in IMAX 3D and on traditional film, we basically increased the amount of rendering to be done by at least 50%, and in the more difficult cases doubling it or going even higher. We also required a dedicated team of artists to deal with the challenges of producing the stereoscopic images for 3D IMAX projection, plus of course, by delivering both formats at the same time we needed to ensure that everybody was able to get their frames rendered on time, putting more strain on the grid, which handled it very easily.
How is Grid Engine doing?
We started out with 400 processors on the grid and perhaps a 10% utilization, by the end we had to increase twice, once by 600 CPUs and again by 400 CPUs, reaching almost a utilization of 80% average.
Any problems with Grid Engine?
We had one day where a user managed to submit over 200k jobs in 10 minutes, which brought the grid to its knees, other than that all of our problems are either hardware/software related issues, or power problems.
On your busiest day, about how many jobs were processed?
We pushed 185k jobs through the grid one day without many problems.
Can you discuss any resource allocation policies you may be using within the render farm?
We are using a functional policy that allocates tickets by department, if a shot/job needs to get a higher priority we assign override tickets to push it through. By default every user gets a priority of -1000 with higher priority jobs (as determined by management) receiving a priority of 0.
Is Grid Engine completely wrapped up or abstracted by your workflow and shot rendering tools? Do any members of the project use grid engine at the command-line?
Nobody uses it from the command line, the Grid is completely wrapped for the end user, giving us better control.
DNA Productions seems to have built an impressive animation production and workflow system on top of Grid Engine. How was this done? Did DNA write the tools in house or obtain them externally?
We use several commercially available packages such as Houdini, PRman, Maya and Massive which are all extended with in house build tools, we have very talented experienced TD’s (technical directors) and programmers, who wrote various scripts/tools/plugins to complement these packages and tie them all together.
How do the animators, FX people and artists use the system?
Depending on what department a user is in and which software he/she is using we have custom built submission tools, making it very easy to set-up final render settings and submit to the grid. Once a job has been submitted a user can go to an internal web page to see the status of his/her job plus view the log files. On top of that, we have built a system that scans automatically log-files from each job, allowing us to detect errors and problems immediately and thus resolve them. Having nearly finished the production we have thus far solved more then 30,000 problems. On average we receive about 100-200 tickets (problem notifications) per day. We’re also scanning the grid for jobs which have been running for more then 3 hours, resulting in another ‘ticket’ for us to look at. Since we average job processing times of 10-20 minutes, a 3+ hour run time indicates a potential problem and has to be investigated.
What sorts of tools do you have for producers and other people who need to know big-picture information about shot and render farm status?
We have an internal web site called Farm TV [see screenshots] that allows anybody to view in real time the state of the farm, how many jobs, what type of jobs, current average age of job, max jobs per user, top users and tickets generated. In addition, we have several web pages that display statistical information about the Grid, such as usage patterns.
With the film release date only months away, how do you feel about the system DNA has built? Would you use it again? What changes or improvements are you thinking about for upcoming efforts?
Having used different systems in the past I’ve been very impressed with SGE, especially it’s stability and versatility. I would definitely use it again. As for improvements, I would develop a tighter integration into web pages/database whereby we have a real time interface (like qmon) to the grid for end-users or alternatively re-write qmon since it does become very sluggish as the queue gets more loaded and it doesn’t allow for a great deal of customization.
Anything else you’d like to add?
When first presented with SGE it seemed very daunting and a huge task to tackle to understand all it’s intricacies, even today I do not believe I know every feature/nuance of the the grid, I can only recommend to anyone stick with it and even get some experts in at the beginning that can consult you and help you make sense of it all, we did hire experts from the BioTeam to consult for us and having only done so for a few days, we were up and running in no time at all.
On top of that using Grid Engine allowed us to build a very efficient system, which enabled us to significantly help other departments by lighting shots, and doing FX which wouldn’t normally fall under our realm.

XML Feeds