Openlava is a GPL fork of LSF 4, and as such is a pretty fully featured job scheduler.  I use frequently LSF in my day job, and use Openlava as a substitute when I just need a familiar environment to test things out.  I felt Openlava lacked a good scripting interface, as (like LSF) it provides only a C API for job management, so I wrote some Python bindings for Openlava.

To test it, I wrote a web interface to Openlava that enables users to view detailed information on jobs, queues, users, and hosts, and for administrators to perform management activities such as opening and closing queues and hosts.  Users can submit new jobs too using a web form.

In addition to the web interface, there is a RESTful API, that can be used to request information programmatically.  To test the API, I wrote a client tools package that removes any complexity from dealing with the remote cluster, and a bunch of command line tools that mimic the behavior of the Openlava CLI enabling remote submission and management of  the remote cluster.

You can view a demo of the web interface, view the documentation, or download the server or client tools from Github.

I have a pretty accurate drawing of the Seven, from the drawings i’ve created models in OpenFoam.  I want to do the same for the Fury, however I don’t have drawings of the bodywork.  Drawing freehand in Solidworks would take me forever, so I decided to try scanning the car using an Xbox Kinect and the openkinect libraries.

The first thing that you need to do is get a point cloud from the xbox kinect, this is not quite straightforward as it seems. First the sync_get_dept() returns a disparity image, not cartesian points, so some conversion is required.  The openKinect wiki has most of the background information required.  The calibkinect library can do most of the calibration work, this can be formatted into whatever format suits you best.  I chose wavefront’s format as it stores the point normals and can be read by both MeshLab and Paraview.

Getting a single disparity image is quite easy, however stitching many together is more challenging.  Fundamentally in order to scan an object you need to know the following:

  • Where the camera is relative to a fixed point in space
  • Which direction the camera is pointing.

Assuming you know these two things, you can do a fairly simple matrix transformation on the points to move them into the appropriate location. Initially I tried to find a clever way to triangulate the position and direction of the kinect, it would be great to be able use it like a wand, however this requires significant levels of accuracy, and the only sensor on board is a tilt sensor.  GPS wont cut it, I had some ideas about using a bunch of cameras, or ultrasonic sensors, but in the end I shelved that idea for now.

Instead I decided the easiest way to get a usable scan was to do things the old fashioned way.  I only need to scan half the car, as its symetrical.  As such I created a box around the car.  This allowed me to position the camera at a known location.  I used a tripod to hold the kinect, and took images at one meter intervals from about one and a half meters away from the car.

I then cleaned each mesh in MeshLab so that I just have the car, and no background.  Then loaded each one into paraview and applied transformations to each file until they were lined up based on the measurements I’d taken earlier.  This worked pretty well but required some manual adjustments for the first image in each direction.

The end result is good enough for a first attempt.  I’ll load the file into Solidworks and start drawing.

OpenKinect is an open source suite for interfacing with the xbox kinect.  The instructions for building on OS X were a little vague, and the python bindings didn’t build out of the box.  This is how I built it.  Building libfreenect is mostly the same as in the documentation, some paths differ but its fairly obvious what is required.

  1. Download and install xcode. First you first need to purchase xcode from the App Store, this will then download, and in the applications folder you will have an app called Install XCode.  Run the installer, it will install the toolkit, compilers and ide etc.  You need this so you can compile packages.
  2. Install mac ports, details on how to do so can be found on the macports web site.
  3. Use port to install cython and numpy:
    port install py2.7-cython py2.7-numpy
  4. Use port to install git, libusb, cmake and libtool:
    sudo port install git-core
    sudo port install libusb-devel
    sudo port install libtool
    sudo port install libusb-devel
  5. Download the libfreenect using
    git:git clone git://github.com/OpenKinect/libfreenect.gitThis will create a new directory called libfreenect
  6. Change into this directory and create a new directory called build and change into it.
    cd libfreenect
    mkdir build
    cd build
  7. Run ccmake, this will pop up a curses based menu, hit c to configure, change any paths that are required, then hit g to generate.
    ccmake ..
  8. Generate the make file, build and install libfreenect.
    cmake ..
    make
    make install

Now that libfreenect is installed, its time to make the python bindings, change into the libfreenect/wrappers/python directory, and build it like you would any other python package.  You need to make sure your library include paths are set correctly however.

  1. Change into the python wrappers dir.
    cd ../wrappers/python
  2. Modify setup.py as follows: add /opt/local/lib to the runtime_library_dirs array, add /opt/local/include and /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/include to the extra_compile_args array.
  3. Run python setup.py install as normal

Test the installation using python and import freenect.

 

 

Something that is rarely discussed when talking about cluster efficiency is the amount of CPU time that is wasted.  When I say wasted, I mean a job that is running, but does not run to completion.  CPU time can be wasted in a number of ways, as an end user, I can let my job run for a week before realizing that my input file had the wrong value, I cancel the job, but I’ve essentially wasted a week of CPU without producing anything useful.

As an administrator, I can kill a job, either by accident, or due to necessity, but assuming the job cannot checkpoint, it is still CPU time that was unproductive.

Hardware failures can also waste huge amounts of resources, the aim of the game may be commodity computing, but a node failing 3 weeks into a 128core job can really dent efficiency.  To highlight this I wrote a script using the LSF accounting library.  It produces some stats that show per queue and per user how much time jobs have run for based on their exit status.

Name: batchq
 Total Jobs:      20000
 Failed Jobs:     297
 Total Wait Time: 719 days, 8:56:32
 Total Wall Time: 407 days, 21:27:31
 Total CPU Time:  407 days, 21:27:31
 Total Terminated CPU Time: 3 days, 13:51:15
Name: joe
 Total Jobs:      4144
 Failed Jobs:     294
 Total Wait Time: 687 days, 20:27:17
 Total Wall Time: 388 days, 7:29:02
 Total CPU Time:  388 days, 7:29:02
 Total Terminated CPU Time: 3 days, 13:51:15
Name: fred
 Total Jobs:      3
 Failed Jobs:     2
 Total Wait Time: 5:46:34
 Total Wall Time: 0:10:56
 Total CPU Time:  0:10:56
 Total Terminated CPU Time: 0:00:00
Name: barney
 Total Jobs:      4
 Failed Jobs:     0
 Total Wait Time: 0:13:41
 Total Wall Time: 0:14:31
 Total CPU Time:  0:14:31
 Total Terminated CPU Time: 0:00:00

It works quite simply by iterating through the accounting file and adding the wall time, cpu time etc, and checking the termination status of the job, and if it is not zero, adding that to the failed job count.

# Dictionary containing an entry for each queue, which is in itself a dictionary
# containing the stats for the queue
qs={}
us={}

for i in AcctFile(acctf):
    # If the queue does not have an entry in the dictionary, then create
    # one now.
    if not i.queue in qs:
        qs[i.queue]={
                'name':i.queue,
                'numJobs':0,
                'numFJobs':0,
                'waitTime':datetime.timedelta(0),
                'runTime':datetime.timedelta(0),
                'wallTime':datetime.timedelta(0),
                'wasteTime':datetime.timedelta(0),
                }
    # Based on the queue, increment the timers and counters accordingly
    # increment the number of jobs
    qs[i.queue]['numJobs']+=1
    # Add the time the job had to wait before it was started
    qs[i.queue]['waitTime']+=i.waitTime
    # Work out the CPU time, this is the wall clock time multiplied by the
    # number of slots.
    qs[i.queue]['runTime']+=(i.numProcessors*i.runTime)
    # Add the wall clock time
    qs[i.queue]['wallTime']+=i.runTime
    # If the terminfo number is >0, then it was not a normal exit status.  Add
    # the cpu time to the wasted time.
    if i.termInfo.number>0:
        qs[i.queue]['wasteTime']+=(i.numProcessors*i.runTime)
        qs[i.queue]['numFJobs']+=1

Once the stats are gathered, its just a case of pretty printing them out:

# Print out a summary per queue.
for q in qs.values():
    print "Name: %s" % q['name']
    print " Total Jobs:      %d" % q['numJobs']
    print " Failed Jobs:     %d" % q['numFJobs']
    print " Total Wait Time: %s" % q['waitTime']
    print " Total Wall Time: %s" % q['wallTime']
    print " Total CPU Time:  %s" % q['runTime']
    print " Total Terminated CPU Time: %s" %q['wasteTime']

You can download jobStats.py directly, and you will also need the LSF Python tools.