Job Submit Templates

I modified the job submit template on OpenLava Web to enable custom job submission forms, it is simple to implement, simply subclass OLWSubmit and include the fields you actually need.

Job Submission Template for Consume Resources
Job Submission Template for Consume Resources

If you want to have custom fields, then you can create them, but you must override _get_args() and return a dictionary of arguments to pass to Submit. You may want to do this generally if you need to build a complex command to pass to Submit from the various fields.

The form will automatically show up in the Submit drop down list, you can defined friendly_name as a class attribute, this will be the text displayed in the drop down. The form’s submit() method is passed any JSON data that was posted, and can be overidden if you need to manipulate it. If you don’t the default for a json request is to pass it directly to the submit function.

Web Interface for OpenLava

Openlava is a GPL fork of LSF 4, and as such is a pretty fully featured job scheduler.  I use frequently LSF in my day job, and use Openlava as a substitute when I just need a familiar environment to test things out.  I felt Openlava lacked a good scripting interface, as (like LSF) it provides only a C API for job management, so I wrote some Python bindings for Openlava.

To test it, I wrote a web interface to Openlava that enables users to view detailed information on jobs, queues, users, and hosts, and for administrators to perform management activities such as opening and closing queues and hosts.  Users can submit new jobs too using a web form.

In addition to the web interface, there is a RESTful API, that can be used to request information programmatically.  To test the API, I wrote a client tools package that removes any complexity from dealing with the remote cluster, and a bunch of command line tools that mimic the behavior of the Openlava CLI enabling remote submission and management of  the remote cluster.

You can view a demo of the web interface, view the documentation, or download the server or client tools from Github.

LSF Queue Statistics

Something that is rarely discussed when talking about cluster efficiency is the amount of CPU time that is wasted.  When I say wasted, I mean a job that is running, but does not run to completion.  CPU time can be wasted in a number of ways, as an end user, I can let my job run for a week before realizing that my input file had the wrong value, I cancel the job, but I’ve essentially wasted a week of CPU without producing anything useful.

As an administrator, I can kill a job, either by accident, or due to necessity, but assuming the job cannot checkpoint, it is still CPU time that was unproductive.

Hardware failures can also waste huge amounts of resources, the aim of the game may be commodity computing, but a node failing 3 weeks into a 128core job can really dent efficiency.  To highlight this I wrote a script using the LSF accounting library.  It produces some stats that show per queue and per user how much time jobs have run for based on their exit status.

Name: batchq
 Total Jobs:      20000
 Failed Jobs:     297
 Total Wait Time: 719 days, 8:56:32
 Total Wall Time: 407 days, 21:27:31
 Total CPU Time:  407 days, 21:27:31
 Total Terminated CPU Time: 3 days, 13:51:15
Name: joe
 Total Jobs:      4144
 Failed Jobs:     294
 Total Wait Time: 687 days, 20:27:17
 Total Wall Time: 388 days, 7:29:02
 Total CPU Time:  388 days, 7:29:02
 Total Terminated CPU Time: 3 days, 13:51:15
Name: fred
 Total Jobs:      3
 Failed Jobs:     2
 Total Wait Time: 5:46:34
 Total Wall Time: 0:10:56
 Total CPU Time:  0:10:56
 Total Terminated CPU Time: 0:00:00
Name: barney
 Total Jobs:      4
 Failed Jobs:     0
 Total Wait Time: 0:13:41
 Total Wall Time: 0:14:31
 Total CPU Time:  0:14:31
 Total Terminated CPU Time: 0:00:00

It works quite simply by iterating through the accounting file and adding the wall time, cpu time etc, and checking the termination status of the job, and if it is not zero, adding that to the failed job count.

# Dictionary containing an entry for each queue, which is in itself a dictionary
# containing the stats for the queue
qs={}
us={}

for i in AcctFile(acctf):
    # If the queue does not have an entry in the dictionary, then create
    # one now.
    if not i.queue in qs:
        qs[i.queue]={
                'name':i.queue,
                'numJobs':0,
                'numFJobs':0,
                'waitTime':datetime.timedelta(0),
                'runTime':datetime.timedelta(0),
                'wallTime':datetime.timedelta(0),
                'wasteTime':datetime.timedelta(0),
                }
    # Based on the queue, increment the timers and counters accordingly
    # increment the number of jobs
    qs[i.queue]['numJobs']+=1
    # Add the time the job had to wait before it was started
    qs[i.queue]['waitTime']+=i.waitTime
    # Work out the CPU time, this is the wall clock time multiplied by the
    # number of slots.
    qs[i.queue]['runTime']+=(i.numProcessors*i.runTime)
    # Add the wall clock time
    qs[i.queue]['wallTime']+=i.runTime
    # If the terminfo number is >0, then it was not a normal exit status.  Add
    # the cpu time to the wasted time.
    if i.termInfo.number>0:
        qs[i.queue]['wasteTime']+=(i.numProcessors*i.runTime)
        qs[i.queue]['numFJobs']+=1

Once the stats are gathered, its just a case of pretty printing them out:

# Print out a summary per queue.
for q in qs.values():
    print "Name: %s" % q['name']
    print " Total Jobs:      %d" % q['numJobs']
    print " Failed Jobs:     %d" % q['numFJobs']
    print " Total Wait Time: %s" % q['waitTime']
    print " Total Wall Time: %s" % q['wallTime']
    print " Total CPU Time:  %s" % q['runTime']
    print " Total Terminated CPU Time: %s" %q['wasteTime']

You can download jobStats.py directly, and you will also need the LSF Python tools.