aug2003.tar

Beowulf Batch Processors and Job Schedulers

Edward L. Haletky and Patrick Lampert

Beowulf and grid technology provide an attractive mechanism to build powerful compute clusters with inexpensive off-the-shelf components. Yet this technology also introduces new and complex scheduling problems. How do you distribute your work over the cluster? Can you schedule your work for times of lower activity? These issues are addressed by a variety of job-scheduling and load-balancing software tools that we will examine in this article. We review seven different queuing engines that can manage your resources, schedule jobs, and even interlock runs based on execution dependencies. These seven systems range from the simplest cron-related tools to grid engines.

We will examine ease of installation, configuration, creation of a single queue, the steps required to submit an extremely simple job, as well as the steps to dispatch jobs based on time of day without using cron to enable and disable queues. Furthermore, we will comment upon availability of multi-node capability, support, and security, then present a simple chart to assist you in picking your Job Scheduling/Queuing Software.

The systems discussed are: at(1)/bbq(1), Clusterware/Load Sharing Facility (LSF), Condor, Generic Network Queuing System (GNQS), GNU Queue, Open Portable Batch System (OpenPBS), and the Sun Grid Engine (SGE). Although two of these systems are primarily grid engines (Condor and SGE), they all provide a way to queue up jobs for execution as the resources allow.

Each system was installed upon a Scyld Beowulf cluster with six slave nodes comprising off-the-shelf spare parts. Installation of the queuing agents took place on the master node leaving the six slave nodes as computational nodes. To test that the queues were running, we used a simple uptime script that follows. This allowed us to test that everything was run as required by viewing the various output and log files:

#!/bin/sh
date
bpsh -p -a /usr/bin/uptime | /usr/bin/sort -n

The date in the script allows us to verify when the job submittal occurred. Since each queuing agent is different, the research was originally performed as a learning exercise to determine what is different between them and to determine which is best for each type of job. We discovered that, although these systems differ, each contains a mechanism to submit a job, get the status of a job, and to kill a job. However, the means to configure a queue, the installation, and the names of the commands are remarkably inconsistent. We also found that the default Scyld installation had to be modified to allow the queuing agents to be installed on all nodes of the cluster. Any comments on the ability to run queuing agents on each node of the cluster are purely from the documentation provided; we were not able to readily verify this functionality.

At(1) bbq(1)

At(1) has a dual distinction. Not only is it the oldest queuing agent, it is also the one that will be available on the most operating systems. It follows the "keep it simple..." rules attributed to older Unix programs and, because of that, spans hardware and time itself. At(1) manages time very easily, you can specify when a job is to be run in a myriad of ways, and it allows jobs to be queued into named queues of a single letter. By default there is a "do it now" queue and a batch queue. The batch command will process jobs as system load permits. While in a cluster this means the load of the master node, it is still very useful if you use your master node as the only interactive node and mainly for job scheduling. At(1) cannot be used across nodes to manage resources.

Installation

At(1) should be installed by default and atd should be running. However, you can use /sbin/chkconfig and /etc/service to verify and start/stop atd; it is part of the at-3.1.8-24_Scyld RPM.

Use

Run testscript at or during a specific time range:

at -f ./testscript now (at a specific time)
No method to specify a time range

Run testscript as load allows (< 0.8):

batch -f ./testscript

Run testscript using a specific queue:

batch -q [A-z] -f ./testscript

To remove a job:

atrm jobId To list queues:

atq
bbq

or use the Scyld Webmin interface for bbq.

Support

In general, you would go to man pages or the Web for support of at(1)/bbq(1).

Notes

At(1) has basic queuing mechanisms for firing off jobs at specific times (time of day test) and based on load average. Although you cannot combine them, you can either fire off jobs at time of day or based on load average. You can also specify queues with a single character from a-z or A-Z. The niceness of the run goes up as the queue designator goes up. So submittal in a "Z" queue will have more CPU than a run in an "a" queue. At(1) by default uses the "a" queue, and batch by default uses the "b" queue. There is no way to specify configurations for a queue. Uppercase queues are treated as batched at the time of "now". If users are competing for resources, you may want to limit them to only use one queue. However, there is no way to enforce this behavior.

Security

At(1) has a basic security concept. You can specify whether a user can use at/batch to submit jobs but you cannot specify which queue they can use. You use the files /etc/at.allow and /etc/at.deny to specify which users have access to at/batch. If either file is empty or non-existent, then all users can access the commands; however, if you specify even one user, you must specify them all. If your username is not in /etc/at.allow, then it is automatically denied and vice versa for /etc/at.deny.

Clusterware/LSF

Clusterware and LSF are for fee products of Platform Computing (http://www.platform.com). Clusterware is a subset of the full LSF package. While Clusterware is specific to Beowulf, usage of the queuing agent is so similar to LSF that we will cover both abilities and differences in this section. Because this is a fee product, you will need to purchase a license. Once you have that, you are ready to begin.

Installation

Download and unpack lsf5.0_lsfinstall.tar.Z, and download the media package but do not unpack it. We recommend that you review the PDF file that comes with the installation. You will need to substitute Clusterware for wherever you see LSF, when installing Clusterware. You will need to write a startup script for the license manager (lmgrd) if you do not already have one, and edit the license file to represent the directory containing the program lsf_ld. If you install LSF on the slave nodes, then they can reference the master server. A default queue is automatically created.

Use

Run testscript at or during a specific time range:

bsub [-b mm:dd:hh:mm] ./testscript

To run job during a specific time range:

Edit lsb.queues and modify the queue by adding RUN_WINDOW = 9:00,11:00 to the queue definition and then reconfigure the system using badmin reconfig. This queue will only run jobs from 9 to 11 AM:

bsub -q timedq ./testscript

Run testscript using a specific queue:

bsub [-q queuename] ./testscript

To remove a job:

bkill jobId

To list queues:

bqueues (lists queues)
bjobs -a (lists jobs in queues)

Notes

Instead of "RUN_WINDOW" you may want to use "DISPATCH_WINDOW" for the timed queue, as RUN_WINDOW will suspend jobs that do not finish within the time specified. LSF also has configuration options to operate with NQS and documentation on interoperability with condor and specific Beowulf items (Clusterware). LSF will also run jobs dependent upon the completion of other jobs.

Support

Platform Computing's support is wonderful. We had several problems regarding the license file, which they answered in a timely fashion. They also called us back and asked us to undo a change that was not beneficial to the system and later on could cause problems. This type of proactive support is highly appreciated.

Security

There is host, user, and group-level access controls, as well as time of day. Additionally, there are some security options that are dependent on other options and can change the run behavior. Tie these standard options to the timed queue options and you have a fairly complete security picture.

Condor

Condor is a grid engine available from:

http://www.cs.wisc.edu/condor/

All grid engines include a way to schedule and submit jobs to the many systems connected to a grid. Although this is a grid tool, it is included here for its abilities in submitting jobs. Its multi-architecture approach and job control language (JCL) are extremely powerful if you have multiple types of CPUs within your cluster, which is our case. Condor has limited queues, called Universes, that cannot be modified much, and it is mainly used for matching up machines with specific types of jobs. You can run Condor daemons on every node of your cluster. Condor will then choose the best hosts on which to run jobs based on a classified ad (class ad) used to advertise machine capabilities. The Condor testscript JCL is:

Universe = vanilla
Executable = /home/elh/testscript
Output = /home/elh/out.$(Process)
Error = /home/elh/err.$(Process)
Log = /home/elh/log.$(Process)
Arguments =
Queue

Installation

We installed Condor as a single machine entity by running condor_install after creating a Condor user and adding a /etc/hosts entry for the master node (which is required for Scyld Beowulf) but not for Condor. Next, we copied the default central manager version of condor_config.local to the appropriate directory and modified the file. We modified the condor_config file to NOT "glidein" (join) a grid, to change the CONDOR_STARTD options to start all the appropriate daemons for a local configuration, and to always allow processes to run. You can install Condor on each node of the cluster.

Use

Run testscript at or during a specific time:

RunHours = (ClockMin > 600 && ClockMin < 660)
START = $(RunHours)

condor_submit condor_test

Run testscript using a specific queue:

condor_submit condor_test (condor_test contains the universe specification.)

To remove a job:

condor_rm

To list queues:

condor_q

Notes

The condor_test JCL was not difficult to understand, but the language can be very dense and, for complex jobs, a little daunting. Because Condor is used for a grid, it must include such things as architecture and operating system. There was a hole in the cluster at test time (node 3 was down), and the complete test output was never returned to us. Condor has multiple universes but not multiple queues within a universe. Each universe is strictly defined based on the type of job submitted (Shell scripts, MPI, PVM, and other programs). Note that Condor will only allow one job to run at a time, no matter which universe is in use. Items queued up to run during the allowed run window were properly run.

Support

There is quite a large online documentation repository for Condor, but some of it is incomplete. Mail to "condor_admin" was queued and responded to within 24 hours. The total time to answer our security-related question was roughly a week because of delays in email and initial understanding of the problem. The configuration file is fairly well documented.

Security

Condor has host-level access control to the pool of machines available via the Condor daemon and its universes. There is a way to only allow certain users administrative access to the system but no way to block a universe by user except with an extremely complex START variable involving the RemoteOwner macro. Condor will "call home" to cs.wisc.edu and send information about your Condor install unless you set "CONDOR_DEVELOPERS_COLLECTOR" to "NONE", which is an option inside condor_config.

Generic Network Queuing System (GNQS)

GNQS is the granddaddy of all queuing tools and shows its age, as there have been no improvements in quite a while. Even so, GNQS will compile and run on a surprising number of systems. It is available from:

http://www.gnqs.org/

Installation

Installation is very straightforward and is self-contained in the SETUP script that you run to do everything. Additionally, the script contains all the necessary commands for setting up a default batch queue, executing the server, and other tasks that you must do by hand to complete the install. We did have to create our own daemon startup file, however. To run NQS on every node requires NQS to be installed on every node and all daemons to be running. You would specify redirect queues so that the scheduling would take place on the master node. Setting up a redirect queue is rather complex, yet there is a small example from which to extrapolate a suitable solution.

Use

Run testscript at or during a specific time:

Not possible.

Run testscript using a specific queue:

qsub -q [queuename] ./testscript

To remove a job:

qdel jobId

To list queues:

 
qstat [-a]

Notes

The qmgr program can be used to manage the complete system, and there are many options. Most of these options guard system resources.

Support

Support for GNQS is purely what is available in the package and on the Web. The manual pages are very good.

Security

You can specify user- and group-level permissions for each queue created, but not host-level permissions.

GNU Queue

GNU Queue is a relatively new item in the list of queuing agents; however, this open source project has had no active development in more than a year, as can be seen by the need to fix the source to allow it to run on Red Hat 7.3 based Scyld Beowulf. After fixing two source issues (one missing file, and incorrect programming for Red Hat 7.3 based Scyld Beowulf), GNU Queue finally executed. GNU Queue is primarily a load balancing tool.

Installation

We modified define.h to add #include <time.h> as the very first line and then ensured that the setrlimit code in queued.c (which is already conditional for Solaris) be conditional for Red Hat 7.3 based Scyld Beowulf. We used the configure option -enable-root=YES to allow a root install of the program so more than one user could use it. Also, we created our own startup file. You would install GNU Queue (queued) on each node of the cluster and let it manage your MPI, or other jobs based on load average. Additionally, you will need to create a GNU Queue host file to tell queued whom to contact.

Use

Run testscript at or during a specific time:

cd /var/queue; cp -r wait timedq

Edit:

timedq/profile

Modify timesched line and make it read:

timesched Any,1100-1200

Restart GNU Queue daemon.

Run testscript:

queue -q -d timedq -- /home/elh/testscript

Run testscript using a specific queue:

queue <-i|-q -d queuename> -- /home/elh/testscript

To remove a job:

task_control -k jobId (if compiled)

To list queues:

View the queuestat file in queue directory

Notes

Using a time of Any or Tu (tuesday) worked for the timed Q test; yet, whenever we add a 24-hour clock time for a low and a high value, it ceased to work. There is no documentation for a timed queue, but you can read through the source code. Any, Wk, Evening, Night, Sa, Su, Mo, Tu, We, Th, Fr, and low-high 24-hour times are supported as arguments to the timesched and timestop keywords. Each element on the line is comma-separated so you can create very complex timings. You may want to redirect the output per "man queue" in order to not stall the job if there is output. Jobs submitted prior to the timesched are not run, yet will let jobs be submitted between the times specified.

Support

Support is limited on the Web site (http://www.gnuqueue.org) and to understand all the options you must read the code.

Security

GNU Queue has no mechanism to limit or control the use of the system. If you can issue the queue command, you can use the tool.

Open Portable Batch System (OpenPBS)

OpenPBS, found at http://www.openpbs.org, provides a very good cluster batch processing system. OpenPBS has a for fee version named PBSpro that is not reviewed here. OpenPBS uses the Tcl/Tk scripting language, allowing other schedulers to be added to the program suite quite easily. This is also the case for the Maui Scheduler for OpenPBS.

Installation

Installation of OpenPBS is quite simple and requires that two RPMS files be installed on the system. To get xpbs (the graphical interface) to work, we had to run the program, exit, and then edit the resultant .xpbsrc file to change all instances of the server to be the name chosen for the machine. By default, the workq is created. We chose the server name by changing the files /usr/spool/PBS/server_name and /usr/spool/PBS/mom_priv/config as appropriate. Other documents describe how to run pbs_mom on each cluster node in order to get the most out of OpenPBS resource management.

Use

Run testscript at or during a specific time:

qsub -a datetime [-q queue] ./testscript (at specific time)
No way to specify a time range

Run testscript using a specific queue:

qsub [-q queue] ./testscript

To remove a job:

qdel jobID

To list queues:

qstat -a
xpbs

Notes

The command syntax for OpenPBS is similar to GNQS, which is its ancestor. There is even a tool to convert job scripts from GNQS to PBS. Additionally, OpenPBS has a lot of controls governing job dependencies (e.g., My job will not run until these other jobs have run). These controls can be intricate and are an area of strength for OpenPBS.

Support

Support is via the Web site unless you want to purchase PBSpro, then there is support via Veridian Systems. There is also an extremely active mailing list that will happily give you answers to your OpenPBS questions. Additionally, monitoring the list gives a good feel for the OpenPBS power configurations.

Security

Security is available in the form of host-, group-, and user-level access to a queue providing a comprehensive access control for the OpenPBS system. There are also mechanisms to force the use of routing queues to further protect and choose the appropriate queue for a job.

Sun Grid Engine (SGE)

SGE can be found at http://gridengine.sunsource.net and provides another grid computing facility. Unlike Condor, it has no limitations as a queuing agent.

Installation

Installation was straightforward as we created the sgeadmin account, updated /etc/services to reference the SGE daemons, then unpacked all the files into the installation directory. Please note that we did unpack the documentation tarball first to get the instructions. We then stepped through these instructions and had to provide arguments to setfileperm.sh because we ran it as root, which is undocumented. Once we ran install_qmaster, selected all the defaults (including the GID range that was suggested), and ran install_execd, the tool was ready to use. We also created an appropriate startup file for SGE. For each slave node, you need to run install_execd to install the execution environment.

Use

Run testscript at or during a specific time:

Submit job at specific time:
qsub -a datetime [-q queue] ./testscript (at specific time)
Submit job during specific time range:
Define calendar file according to calendar_conf(5):

# cat domy
calendar_name domy
year 1.1.2000=on
week 1-24=off 12-13=on

Run qconf -mq courageous.q and modify the calendar line to read calendar domy instead of calendar NONE. This allows a job to be run on between 12 and 1pm:

qsub [-q queue] ./testscript

Run testscript using a specific queue:

qsub [-q queue] ./testscript

To remove a job:

qdel jobID

To list queues:

qstat -f

Notes

SGE provides a wide range of grid and cluster control facilities that exceed the abilities of Condor. Its calendar functionality is well documented yet still confusing because any time entered is denied by default. It took some trial and error to get this correct. A job queued before the runtime window of the queue will run at the appropriate time.

Support

The documentation for SGE is quite complete and answered all our questions. There is further documentation on the SGE Web site where you can purchase the Enterprise Edition and get a few more features and enterprise-level support.

Security

SGE will support user, group, and host access control lists to a queue. Groups are set using the user-level access list control by specifying the group with the @ symbol preceding the name. SGE has a complete set of queue controls. You can also specify projects using the Enterprise version of SGE. Projects appear to be a broader scope than groups.

Matrix of Programs

Grading

Each system was graded using a 5-point scale with 5 being the highest. A 0 score implies that the feature was not available at all.

Installation

The measurement is based on the complexity of the install (e.g., were any manual modifications required), code changes, new scripts, and any other changes that were required to get a basic queue to work. In this case, the use of an RPM command to do everything was considered to be a 5. Modification of code was considered to be a 1.

SimpleQ

How difficult was it to set up and run the commands to submit a simple job to a queue? We looked at general job management within a queue. A 1 would be that the job could never be submitted. A 5 was considered to have many job management capabilities.

TimedQ

We have two options under TimedQ. The first was the ability to submit a job at a given time and the other was the ability to schedule jobs to dispatch and run during a specific time. A 0 would imply that there is no way to do either, while a 5 would imply that both mechanisms are available and the jobs would be suspended if the job ran outside the runtime window. Note that we were looking for facilities for timed base queues inside the tool, not the use of cron to make this work.

MultiNode

Can you monitor the load and other resources on multiple nodes for submittal of jobs via the system? A 0 implies there is no mechanism, while a 5 implies that there is and that the steps to do so were not exceedingly complex (e.g., no queue-specific setups to make it work). We did not verify these steps and only comment on what the documentation states.

Support

How good is the support for the tool? A 1 implies we had to read the source to figure things out, and a 5 implies that questions were answered immediately and the support was also proactive.

Security

What queue security was available as a part of the tool? A 1 implies no security at the queue level, while a 5 implies that host, user, group, and resource access controls are available. No queuing agent will run an arbitrary job as anything but the user who submitted the job.

Score

We added up the numbers and divided by 6 to get an unweighted score from 0-5. (See Table 1.)

Conclusion

We were not surprised by the results except for GNU Queue. GNU Queue is a very simple queuing agent that runs jobs based on system load. What surprised us was how easy the code was to read and understand. Although the code is not the best documentation for any system, in this case it did assist us in getting our tests to run. GNU Queue would never be a good choice for someone who needs a well-supported and vetted utility. Furthermore, we understand that just about all the functionality for each of these tests can be performed with a little shell scripting, with the possible exception of security. For that, you need to be a little more creative and hide programs, and deny access based on user and groups. However, that's not what we wanted to know about each tool. We wanted to know whether it was a self-contained unit and whether it could do it all.

Clusterware/LSF, OpenPBS, and SGE all have numerous features and, again, we were not surprised by the scores. We did happen to find that Clusterware/LSF has a more easily understood dispatch/run timed queue capability and much better support. Also, we found support in LSF for working with NQS, and documentation within Condor for using LSF to perform a wider range of features. OpenPBS mentions a way to convert NQS scripts to OpenPBS, while SGE does not mention any other package. Features of interoperability will help in the larger world where you have multiple clusters using slightly different tools.

In our tests, we learned much about these queuing agents and we think the choice of one over another depends on your own capabilities and needs. Each tool meets a need, and the specific needs for your cluster should be your first consideration when choosing a tool. Although at(1) may be sufficient for some systems, others may need the resources, job interlock checking, or queue security of another tool.

Edward L. Haletky graduated from Purdue University in 1988 with a degree in Aeronautical and Astronautical Engineering. Since then, he has worked programming graphics and other lower-level libraries on various Unix platforms. He currently works for Hewlett-Packard in the High Performance Technical Computing team and as a security consultant for the virtual office community.

Patrick M. Lampert graduated from the University of Massachusetts in 1981 with a degree in Applied Mathematics. Since then, he has worked in software development and support with various platforms. He currently works for Hewlett Packard Corporation in the High Performance Computing Expert Team, supporting development tools, and Tru64 UNIX and Linux Clustering technologies.