LSF Queue Statistics

Something that is rarely discussed when talking about cluster efficiency is the amount of CPU time that is wasted.  When I say wasted, I mean a job that is running, but does not run to completion.  CPU time can be wasted in a number of ways, as an end user, I can let my job run for a week before realizing that my input file had the wrong value, I cancel the job, but I’ve essentially wasted a week of CPU without producing anything useful.

As an administrator, I can kill a job, either by accident, or due to necessity, but assuming the job cannot checkpoint, it is still CPU time that was unproductive.

Hardware failures can also waste huge amounts of resources, the aim of the game may be commodity computing, but a node failing 3 weeks into a 128core job can really dent efficiency.  To highlight this I wrote a script using the LSF accounting library.  It produces some stats that show per queue and per user how much time jobs have run for based on their exit status.

Name: batchq
 Total Jobs:      20000
 Failed Jobs:     297
 Total Wait Time: 719 days, 8:56:32
 Total Wall Time: 407 days, 21:27:31
 Total CPU Time:  407 days, 21:27:31
 Total Terminated CPU Time: 3 days, 13:51:15
Name: joe
 Total Jobs:      4144
 Failed Jobs:     294
 Total Wait Time: 687 days, 20:27:17
 Total Wall Time: 388 days, 7:29:02
 Total CPU Time:  388 days, 7:29:02
 Total Terminated CPU Time: 3 days, 13:51:15
Name: fred
 Total Jobs:      3
 Failed Jobs:     2
 Total Wait Time: 5:46:34
 Total Wall Time: 0:10:56
 Total CPU Time:  0:10:56
 Total Terminated CPU Time: 0:00:00
Name: barney
 Total Jobs:      4
 Failed Jobs:     0
 Total Wait Time: 0:13:41
 Total Wall Time: 0:14:31
 Total CPU Time:  0:14:31
 Total Terminated CPU Time: 0:00:00

It works quite simply by iterating through the accounting file and adding the wall time, cpu time etc, and checking the termination status of the job, and if it is not zero, adding that to the failed job count.

# Dictionary containing an entry for each queue, which is in itself a dictionary
# containing the stats for the queue
qs={}
us={}

for i in AcctFile(acctf):
    # If the queue does not have an entry in the dictionary, then create
    # one now.
    if not i.queue in qs:
        qs[i.queue]={
                'name':i.queue,
                'numJobs':0,
                'numFJobs':0,
                'waitTime':datetime.timedelta(0),
                'runTime':datetime.timedelta(0),
                'wallTime':datetime.timedelta(0),
                'wasteTime':datetime.timedelta(0),
                }
    # Based on the queue, increment the timers and counters accordingly
    # increment the number of jobs
    qs[i.queue]['numJobs']+=1
    # Add the time the job had to wait before it was started
    qs[i.queue]['waitTime']+=i.waitTime
    # Work out the CPU time, this is the wall clock time multiplied by the
    # number of slots.
    qs[i.queue]['runTime']+=(i.numProcessors*i.runTime)
    # Add the wall clock time
    qs[i.queue]['wallTime']+=i.runTime
    # If the terminfo number is >0, then it was not a normal exit status.  Add
    # the cpu time to the wasted time.
    if i.termInfo.number>0:
        qs[i.queue]['wasteTime']+=(i.numProcessors*i.runTime)
        qs[i.queue]['numFJobs']+=1

Once the stats are gathered, its just a case of pretty printing them out:

# Print out a summary per queue.
for q in qs.values():
    print "Name: %s" % q['name']
    print " Total Jobs:      %d" % q['numJobs']
    print " Failed Jobs:     %d" % q['numFJobs']
    print " Total Wait Time: %s" % q['waitTime']
    print " Total Wall Time: %s" % q['wallTime']
    print " Total CPU Time:  %s" % q['runTime']
    print " Total Terminated CPU Time: %s" %q['wasteTime']

You can download jobStats.py directly, and you will also need the LSF Python tools.