Wednesday, December 17, 2014

Of big microscopes and even bigger data...

We recently installed one of these awesome electron microscopes... In the center I help PI, we are imaging brains, but more about that another time. Right now this is all about getting this thing running, and running at speed, and some lovely UNIX geekery... I don't get anywhere near enough time these days to get my paws on a CLI, but I needed to stick my nose into this one!


It's extremely cool looking eh? However, it needs a fair amount of horsepower to just even "catch" the data that streams off it.  It is also a scientific instrument so of course the file system obviously ends up being more than just a little bit hairy.  For reference, here is the output of a single "run":
[root@storage]# du -sh .
6.6T .

[root@storage]# ls -RU | wc -l
1,425,686

[root@storage]# find -P . -type f | rev | cut -d/ -f2- | rev |  cut -d/ -f1-2 | cut -d/ -f2- | sort | uniq -c
  65226 001
  62994 002
  67458 003
  67954 004
  65226 005
  62994 006
  67458 007
  67954 008
  65226 009
  62994 010
  67458 011
  67954 012
  65226 013
  62994 014
  67458 015
  67954 016
  65226 017
  62994 018
  67458 019
  67954 020
  65226 021
   8559 022

So 1/2 million files in 6.6T with ca. 65,000 per dir and each image is about 612K...

Please stop me if you have heard *any* of this before :-)

Hehehehe :-)

Anyway, well, so our first task was to catch this stuff.  It flies in from the instrument at a rate of about 3TB an hour, out of eight distinct and separate windows acquisition servers writing out directly to a CIFS mount -- yeah I know, hashtag awesome right? More on SAMBA tuning at scale in another post...

So we benched our storage, a MD3260 with a couple of MD3260e expansion bricks making for a nice 0.6PB single image file system made out of 180 spindles tucked behind an R720.

Nothing too exotic, and at 3TB/hr design spec we need only a dedicated 10G, so we double bagged it, and popped a pair of 10 gee bee cards in the box, span up some LACP and so we were off to the races!

Until we weren't... Do you see the problem here:
41252 be/4 root 0.00 B/s  247.65 M/s  0.00 % 78.54 % dd if=/dev/zero of=test.dat
bs=1024k count=1000000000

Yeah, so that's 250MB/s peak, on the box, with no network in the way, direct to disk with caches working - which is about 0.9TB/hour...  Oh and this was also at about the same time the imaging center director called our group telling us that the microscope was broken, and that he thinks our network is very broken... Yep - a bad day in paradise this sure was turning out to be... and I couldn't even blame it on the network this time! :-)

So we are really not doing so well here. I poked about inside some of our other boxes... we run loads of this stuff... At first I was seeing the same results on some, and on others we were just fine... until I stumbled across one that was pulling nearly 800MB/s... I looked just a little closer at the config for the one that was working as I thought it was...

BLOCK SIZE!!
The default shipping is 4K, which is no good for streaming writes!


Arrggh - the sort of things we used to worry about in the '90s was back and in full effect. Flipped the button, all better.  Still not quite seeing decent performance though, the design spec with this number of spindles and 4 x 6Gb/s SAS wires should peak at... urrm types into google...


So I bust out a copy of the awesome bwm-ng:


Doh! We are not striping, so only using one of the four available 6Gb/s SAS lines... LVM has two modes of operation, Linear and Striping... we were using the Linear one, which was no good... so let's go fix it!
[root@storage]# umount /fs

[root@storage]# lvremove /dev/store_md32xx_vg/store_md32xx_lv
Do you really want to remove active logical volume store_md32xx_lv? [y/n]: y
  Logical volume "store_md32xx_lv" successfully removed

[root@storage]# lvcreate --extents 100%FREE --stripes 10 --stripesize 256 --name store_md32xx_lv store_md32xx_vg
  Logical volume "store_md32xx_lv" created

[root@storage]# lvs --segments
  LV                  VG                  Attr       #Str Type    SSize  
  store_md32xx_lv     store_md32xx_vg     -wi-a-----   10 striped 545.73t
  lv_root             vg_root             -wi-ao----    1 linear    1.09t

[root@storage]# mkfs.xfs /dev/mapper/store_md32xx_vg-store_md32xx_lv 

[root@storage]# mount -a

[root@storage fs]# dd if=/dev/zero of=test.dat bs=1024k count=1000000

Yay! we are now cpu bound on this dd… :-)

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
17773 root      20   0  103m 2780 1656 R 100.0  0.0   8:26.73 dd 

And now we are striped - so much better, nice balance across luns!


Major kudos to Justin Weissig's great corner of the internet for helping out!


We also removed the cache mirror - this is a catch it as fast as you can system, we can take the risk of controller issues for this application (don't try this at home kids!)... we put it back in the end, but wanted to make sure it was not a bottle.


And here we are all finished running eight at a time and pushing 2.6GB/s (woot!):


At this rate we can also support two more microscopes all off the same kit... with each of them running at full tilt!

How's about them apples for some serious price/performance eh? :-)


Oh and one more thing...

Hashtag BIG DATA... :-)


[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff