Wednesday, March 28, 2012

iobsub: get the compute next to the storage and go faster

We have been working with Chris Smith over at distributedbio.com to implement a nifty Openlava storage platform. With Chris' help we have invented an "iobsub" inside of openlava that will allow us to carefully place workloads based on your current directory. I've talked about Openlava before, this post provides an update to a neat trick we have been trying out.

First of source your environment:
[jcuff@iliadaccess03 ~]$ source /opt/openlava-2.0/etc/openlava-client.sh

Now check the queues:
[jcuff@iliadaccess03 ~]$ bqueues
QUEUE_NAME  PRIO STATUS       MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
storage     30   Open:Active   -    -    -    -     0     0     0     0

You can now use iobsub:
[jcuff@iliadaccess03 ~]$ iobsub -q -I uname -a
Cannot find fileserver for current working directory.

Opps - silly me, I'm in my home directory which is an EMC NS960, there will be no scheduling of jobs on that puppy ;-). Let's change directories to a server I know has compute attached to it...

Check it out, the code knows what directory you are in, finds the appropriate machine to launch on, no hosts need be selected, it is all automagic!
[jcuff@iliadaccess03 ~]$ cd /n/rcss1

[jcuff@iliadaccess03 rcss1]$ iobsub uname -a
Submitting job to run on fileserver rcss1
Job <3734> is submitted to default queue .

[jcuff@iliadaccess03 rcss1]$ bjobs -d
JOBID USER  STAT  QUEUE    FROM_HOST   EXEC_HOST JOB_NAME SUBMIT_TIME
3734  jcuff DONE  storage  iliadaccess rcss1     uname -a Mar 28 10:55

We use a "ut" flags and "bio/rpc" make up the secret sauce that shows how heavy the box is loaded and manages the schedule, and a JOB starter that places things appropriately. Here's a snap from lsload:
[jcuff@iliadaccess03 jcuff]$ lsload -l | awk '{print $6"\t"$14"\t"$15}'
ut    bio     rpc
0%    9.6     0.4
0%    0.6     0.1
0%    0.8     0.0
5%    176.9   272.5
0%    8.1     0.3
0%    78.8    68.9
0%    3.7     2.4
1%    479.1   1900.2
9%    424.6   26.1
0%    0.1     0.0
0%    5.3     0.9
11%   269.1   0.3

That's pretty sweet eh? Ok time to do some "useful" work. Here's an IO from the NFS system to gzip a 1G file generated from a dd of /dev/urandom...
[jcuff@iliadaccess03 jcuff]$ cat go.sh
/usr/bin/time gzip -c test.dat > test.dat.gz

[jcuff@iliadaccess03 jcuff]$ time ./go.sh

real 1m20.787s
user 0m42.239s
sys 0m2.154s

Compared to our nifty iobsub:
[jcuff@iliadaccess03 jcuff]$ iobsub -o out.dat go.sh
Submitting job to run on fileserver rcss1
Job <3735> is submitted to default queue .

real 0m58.412s
user 0m56.946s
sys 0m1.457s

22 seconds faster, but fair enough gzip is pretty heavy cpu bound. So we ran another experiment with bwa, which if you do it wrong can really pound on NFS storage! We set up 50 individual bwa jobs and ran them over NFS and then did the same through Openlava and our new iobsub. The jobs in the Openlava test ran directly on the storage but only 8 at any one time (there are only 8 procs on the storage box). The other test we let run 50 way parallel over NFS - more cpu should be faster right? Turns out for IO bound code (as we all know) this is certainly not the case.

Less in this case certainly means more!

Then we looked at the total time it took from start to finish for all 50 jobs to complete on both versions of the system:

NFS 50 way //: 63 min
Openlava 8 cores: 45 min

So that’s a 18 min difference favoring Openlava, and a 28.5% decrease in runtime. Not bad. Confirms our result that 50way // vs controlled and scheduled code running directly attached to the storage even with only 8 procs is faster!

We have had a few of our friendly researchers work on this, and most of them are seeing the benefit. We will work with David Bigagli to get our code folded into the distribution so all can benefit from "iobsub".

So far this looks like a great win for Openlava and for us!


[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff