Wednesday, March 28, 2012

iobsub: get the compute next to the storage and go faster

We have been working with Chris Smith over at distributedbio.com to implement a nifty Openlava storage platform. With Chris' help we have invented an "iobsub" inside of openlava that will allow us to carefully place workloads based on your current directory. I've talked about Openlava before, this post provides an update to a neat trick we have been trying out.

First of source your environment:
[jcuff@iliadaccess03 ~]$ source /opt/openlava-2.0/etc/openlava-client.sh

Now check the queues:
[jcuff@iliadaccess03 ~]$ bqueues
QUEUE_NAME  PRIO STATUS       MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
storage     30   Open:Active   -    -    -    -     0     0     0     0

You can now use iobsub:
[jcuff@iliadaccess03 ~]$ iobsub -q -I uname -a
Cannot find fileserver for current working directory.

Opps - silly me, I'm in my home directory which is an EMC NS960, there will be no scheduling of jobs on that puppy ;-). Let's change directories to a server I know has compute attached to it...

Check it out, the code knows what directory you are in, finds the appropriate machine to launch on, no hosts need be selected, it is all automagic!
[jcuff@iliadaccess03 ~]$ cd /n/rcss1

[jcuff@iliadaccess03 rcss1]$ iobsub uname -a
Submitting job to run on fileserver rcss1
Job <3734> is submitted to default queue .

[jcuff@iliadaccess03 rcss1]$ bjobs -d
JOBID USER  STAT  QUEUE    FROM_HOST   EXEC_HOST JOB_NAME SUBMIT_TIME
3734  jcuff DONE  storage  iliadaccess rcss1     uname -a Mar 28 10:55

We use a "ut" flags and "bio/rpc" make up the secret sauce that shows how heavy the box is loaded and manages the schedule, and a JOB starter that places things appropriately. Here's a snap from lsload:
[jcuff@iliadaccess03 jcuff]$ lsload -l | awk '{print $6"\t"$14"\t"$15}'
ut    bio     rpc
0%    9.6     0.4
0%    0.6     0.1
0%    0.8     0.0
5%    176.9   272.5
0%    8.1     0.3
0%    78.8    68.9
0%    3.7     2.4
1%    479.1   1900.2
9%    424.6   26.1
0%    0.1     0.0
0%    5.3     0.9
11%   269.1   0.3

That's pretty sweet eh? Ok time to do some "useful" work. Here's an IO from the NFS system to gzip a 1G file generated from a dd of /dev/urandom...
[jcuff@iliadaccess03 jcuff]$ cat go.sh
/usr/bin/time gzip -c test.dat > test.dat.gz

[jcuff@iliadaccess03 jcuff]$ time ./go.sh

real 1m20.787s
user 0m42.239s
sys 0m2.154s

Compared to our nifty iobsub:
[jcuff@iliadaccess03 jcuff]$ iobsub -o out.dat go.sh
Submitting job to run on fileserver rcss1
Job <3735> is submitted to default queue .

real 0m58.412s
user 0m56.946s
sys 0m1.457s

22 seconds faster, but fair enough gzip is pretty heavy cpu bound. So we ran another experiment with bwa, which if you do it wrong can really pound on NFS storage! We set up 50 individual bwa jobs and ran them over NFS and then did the same through Openlava and our new iobsub. The jobs in the Openlava test ran directly on the storage but only 8 at any one time (there are only 8 procs on the storage box). The other test we let run 50 way parallel over NFS - more cpu should be faster right? Turns out for IO bound code (as we all know) this is certainly not the case.

Less in this case certainly means more!

Then we looked at the total time it took from start to finish for all 50 jobs to complete on both versions of the system:

NFS 50 way //: 63 min
Openlava 8 cores: 45 min

So that’s a 18 min difference favoring Openlava, and a 28.5% decrease in runtime. Not bad. Confirms our result that 50way // vs controlled and scheduled code running directly attached to the storage even with only 8 procs is faster!

We have had a few of our friendly researchers work on this, and most of them are seeing the benefit. We will work with David Bigagli to get our code folded into the distribution so all can benefit from "iobsub".

So far this looks like a great win for Openlava and for us!

Friday, March 16, 2012

local /scratch turned up to 11: thanks much mr rdma!

Update on: http://blog.jcuff.net/2012/03/glusterfs-80tb-in-seconds-flat.html

Here's another recipe:

Take six little 500G spindles that each manage about 100MB/s on six little compute boxes add 3oz of our new found friend RDMA, and a soup├žon of IB network we just happened to find lying around the kitchen ;-)
[root@itc011 gfs]# gluster volume info ibgfs

Volume Name: ibgfs
Type: Stripe
Status: Started
Number of Bricks: 6
Transport-type: rdma
Bricks:
Brick1: itc031:/scratch/gfs
Brick2: itc032:/scratch/gfs
Brick3: itc012:/scratch/gfs
Brick4: itc021:/scratch/gfs
Brick5: itc011:/scratch/gfs
Brick6: itc022:/scratch/gfs

We love it when things scale in mostly straight lines for size + scale
[root@itc071 /test]# df -H .
Filesystem             Size   Used  Avail Use% Mounted on
glusterfs#itc022:/ibgfs
                       2.8T    54G   2.6T   3% /test

[root@itc071 /test]# time dd if=/dev/zero of=test3.dat bs=1024k count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 14.1454 seconds, 741 MB/s

Thursday, March 15, 2012

glusterfs @ 80TB in seconds flat

tl;dr:
[root@rcss1 /]# df -H /rcss_gfs
Filesystem             Size   Used  Avail Use% Mounted on
rcss1:/rcss             80T   137M    80T   1% /rcss_gfs

It was all so very, very complicated!

We started off with 4 C2100's each with 20TB local disk:
[root@rcss1 /rcss_gfs]# df -H /rcss1
Filesystem             Size   Used  Avail Use% Mounted on
/dev/sdb                20T    11G    20T   1% /rcss1

then off we went... first install the code:
[root@rcss1 /]# yum install glusterfs-server


[root@rcss1 /]#  /etc/init.d/glusterd start
Starting glusterd:                                         [  OK  ]

[root@rcss1 /]# chkconfig glusterd on

Probe the peers, and connect them:
[root@rcss1 /]# gluster peer probe rcss2
Probe successful
[root@rcss1 /]# gluster peer probe rcss3
Probe successful
[root@rcss1 /]# gluster peer probe rcss4
Probe successful

[root@rcss1 /]# gluster peer status
Number of Peers: 3

Hostname: rcss2
Uuid: 0d1170c3-b16b-4084-9ecf-c22865fd2fa8
State: Peer in Cluster (Connected)

Hostname: rcss3
Uuid: 171362a0-d767-4fd6-a3f6-561a43fd2b69
State: Peer in Cluster (Connected)

Hostname: rcss4
Uuid: 899be5de-f560-4cc2-ac58-10a4bdaa5574
State: Peer in Cluster (Connected)

Make a file system, we are striping here:
[root@rcss1 /]# gluster volume create stripe 4 transport tcp rcss rcss1:/rcss1 rcss2:/rcss2 rcss3:/rcss3 rcss4:/rcss4
Creation of volume rcss has been successful. 
Please start the volume to access data.

[root@rcss1 /]# gluster volume info

Volume Name: rcss
Type: Stripe
Status: Created
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: rcss1:/rcss1
Brick2: rcss2:/rcss2
Brick3: rcss3:/rcss3
Brick4: rcss4:/rcss4


[root@rcss1 /]# gluster volume start rcss
Starting volume rcss has been successful


[root@rcss1 /]# mkdir /rcss_gfs

[root@rcss1 /]# mount -t glusterfs rcss1:/rcss /rcss_gfs/

[root@rcss1 /]# df -H /rcss_gfs
Filesystem             Size   Used  Avail Use% Mounted on
rcss1:/rcss             80T   137M    80T   1% /rcss_gfs

Test it:
[[root@rcss1 /rcss_gfs]# dd if=/dev/zero of=test2.dat bs=1024k count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 77.4103 s, 135 MB/s
It took longer to write this posting than building a clustered filesystem.

Update... client testing, 6 manage to fill the inter tubes ;-)





[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff