Thursday, February 5, 2015

Please, purchase my storage solution....

CUE: Story opens, In a small office at a research computing department:

Endearing Storage Vendor: ".... so, now you have seen our technology, you will want to purchase our one of a kind "storage solution" you will be inordinately happy and immediately absolved of any and all future storage issues... forever. We guarantee it! We would truly love to partner with you, we have a unique, one of a kind system. Once we install your system, you can basically take two weeks off, but also in the meantime we will arrange to get your hair to grow back, and I will buy you many beautiful steak dinners... you are feeling very sleepy... but very satisfied with your decision to partner with us, it is a one of a kind product, did I tell you that our CTO invented....."

(beautiful harp music plays in the background)

Research Computing Director [Dreaming] : Oh wow, this stuff sounds absolutely fantastic, I bet I could finally sleep at night, the milk would never spill or go sour anymore. Life would finally be full of unicorns and rainbows! I so much want to live in this fantastic land of flawless storage, unlimited capacity, endless feature sets, complete 100.1% reliability and uptime, oh it's going to be so utterly awesome. In this world storage never, ever goes bad. Hold that thought I NEED to live in this world!! I MUST buy this storage array... I have to raise a PO!....

Endearing Storage Vendor: When I click my fingers you will awake, refreshed and ready to place your purchase order... 3..... 2...... 1.....

[Click] (director wide awake)

Research Computing Director [Sweating]: Whoah! What! Hang on! Wait Nooooo!!

CUE: Fade to black...

So, all joking aside I've been doing this job, and jobs much like it for years. I actually do know the exact storage system it is that exists in this dream from our little story above. And, well so given we are all friends here I'll take a moment to share the answer with you, let's keep it our little secret though, we should not let this trade secret get out.

Ok, so are you ready? Ok, so it's this one:

Not a single one of them!

Yep - you heard right folks, not a single one. I know I'm like a total heretic right?

You've probably all heard the endearing storage vendor promises... I have them in the archive, somewhere, let me go dig them out for you...

CUE: The clip of "The top 50 Most Endearing Storage Vendor Quotes":

"The competition are light years behind our technology! They are slower, more expensive, and totally unreliable, I mean they basically have NO clue! Our CTO literally invented the binary system!"

"That custom Linux kernel you hand rolled may be clever, but it does not scale. Our custom fork of Plan9 we use to power our ARM powered ASICS - it's quite literally lightyears ahead of the competition"

"Here, how about this... you can try our storage for no cost. I'll ask my manager so you can have a little bit for free - don't worry, we can talk price after your first petabyte migration"

"We vet every single patch upgrade before we release to our customers - rolling upgrades from any point release result in zero downtime"

"This storage will basically NEVER fail - people like Harva... oops sorry I can't disclose our clients, but they think it's totally wonderful, I can set up a call with Dr. X, he will totally vouch for how awesome we are."

"Let's not talk price just yet, let me show you how we use quantum laser effects to increase our redundancy and reliability"

"The next version has a completely redesigned API and REST interface, oh and it will be a seamless data in place update - don't worry"

"I want to take a little time to explain to you about our differential value"

"Let me take a moment to explain how we use a stronger steel frame for our cabinets, it is a key differentiator"

"The drives have a perpetual motion device as bearings, you can basically think of them as "physical flash drives"

"We run one of the two top advanced storage manufacturing plants located south of Basildon"

"Our disk magnets are sourced from an ancient salt mine just south of Las Vegas"

"We are in one of the top one worldwide soda manufacturers, we would tell you but we keep our clients confidential"

"We are unique in the market place. Our product is one of a kind. You need to understand our differential value. Let me set up a call with our CTO, so he can explain how this works at a deep technical level. Did we tell you our CTO invented the binary system?"

"You guys shouldn't waste your time building your own storage. We have an end to end solution for you."

"Putting all your storage under our single name space with our amazing technology will just make everything easier."

"Did I tell you already that our CTO invented the binary system?"

"Would you mind if I called some of your Faculty directly so I can show them our value? I don't want to go over your head or anything, but I really need to show them the value of our system, so they can see why you should buy this system."

"... and this was when our founders invented magnetism"

"Great question! Cluster quorum is maintained by a remote software as a service cloud"

"Our storage array was certified by the TSA, and is in use at 5 of the national airlines that fly out of Canada, we could tell you be we want to keep our clients confidential"

"Through our technology we have effectively achieved 200 nines of reliability, and 800 days of uptime a year"

"We have essentially redesigned how RAID works, let us show you the following algebra..."

"It is essentially a software defined storage stack written into a dedicated FPGA so it's very flexible..."

"You basically don't need backups any more!"

"Great question! I'll circle back with engineering and get right back to you - Steve be sure to take a note on that - great question!"

"I'll skip over these marketing slides so we can do a deep dive on our technology... oh just one thing while we are here, we do as you can see from this slide sell to all of your competitors, but anyway, let's get to the technology, oh and this customer here purchased 500 petabytes, ok moving on..."

"We call this feature RAID ONE MILLION. Yeah I know right? It really is literally that good."

"Cache coherence is on our roadmap"

"Hey let's get a round table with your engineering team. I'll bring our top people in so we can show your team our differential value, once your engineers see this they will be ready to convince you to purchase this storage."

"Great question! File locking is absolutely due for the next release"

Oh and the best ever...?

"This product literally pays for itself!"

So... I dunno about you, but unless this disk array prints twenty freaking dollar bills, that thing ain't paying for anything, least of all itself!

So as I said, it's been my day job to be "sold" to for a number of years now. I've quite possibly heard them all. They also say the easiest thing in the world is to sell a sales man, and I've been told that I'm a bit of a sales man, or at least I've been seen to play one on the T.V...

Even so...



p.s. I shall never, ever disclose my sources of "ESV" tee hee :-)

Wednesday, December 17, 2014

Of big microscopes and even bigger data...

We recently installed one of these awesome electron microscopes... In the center I help PI, we are imaging brains, but more about that another time. Right now this is all about getting this thing running, and running at speed, and some lovely UNIX geekery... I don't get anywhere near enough time these days to get my paws on a CLI, but I needed to stick my nose into this one!

It's extremely cool looking eh? However, it needs a fair amount of horsepower to just even "catch" the data that streams off it.  It is also a scientific instrument so of course the file system obviously ends up being more than just a little bit hairy.  For reference, here is the output of a single "run":
[root@storage]# du -sh .
6.6T .

[root@storage]# ls -RU | wc -l

[root@storage]# find -P . -type f | rev | cut -d/ -f2- | rev |  cut -d/ -f1-2 | cut -d/ -f2- | sort | uniq -c
  65226 001
  62994 002
  67458 003
  67954 004
  65226 005
  62994 006
  67458 007
  67954 008
  65226 009
  62994 010
  67458 011
  67954 012
  65226 013
  62994 014
  67458 015
  67954 016
  65226 017
  62994 018
  67458 019
  67954 020
  65226 021
   8559 022

So 1/2 million files in 6.6T with ca. 65,000 per dir and each image is about 612K...

Please stop me if you have heard *any* of this before :-)

Hehehehe :-)

Anyway, well, so our first task was to catch this stuff.  It flies in from the instrument at a rate of about 3TB an hour, out of eight distinct and separate windows acquisition servers writing out directly to a CIFS mount -- yeah I know, hashtag awesome right? More on SAMBA tuning at scale in another post...

So we benched our storage, a MD3260 with a couple of MD3260e expansion bricks making for a nice 0.6PB single image file system made out of 180 spindles tucked behind an R720.

Nothing too exotic, and at 3TB/hr design spec we need only a dedicated 10G, so we double bagged it, and popped a pair of 10 gee bee cards in the box, span up some LACP and so we were off to the races!

Until we weren't... Do you see the problem here:
41252 be/4 root 0.00 B/s  247.65 M/s  0.00 % 78.54 % dd if=/dev/zero of=test.dat
bs=1024k count=1000000000

Yeah, so that's 250MB/s peak, on the box, with no network in the way, direct to disk with caches working - which is about 0.9TB/hour...  Oh and this was also at about the same time the imaging center director called our group telling us that the microscope was broken, and that he thinks our network is very broken... Yep - a bad day in paradise this sure was turning out to be... and I couldn't even blame it on the network this time! :-)

So we are really not doing so well here. I poked about inside some of our other boxes... we run loads of this stuff... At first I was seeing the same results on some, and on others we were just fine... until I stumbled across one that was pulling nearly 800MB/s... I looked just a little closer at the config for the one that was working as I thought it was...

The default shipping is 4K, which is no good for streaming writes!

Arrggh - the sort of things we used to worry about in the '90s was back and in full effect. Flipped the button, all better.  Still not quite seeing decent performance though, the design spec with this number of spindles and 4 x 6Gb/s SAS wires should peak at... urrm types into google...

So I bust out a copy of the awesome bwm-ng:

Doh! We are not striping, so only using one of the four available 6Gb/s SAS lines... LVM has two modes of operation, Linear and Striping... we were using the Linear one, which was no good... so let's go fix it!
[root@storage]# umount /fs

[root@storage]# lvremove /dev/store_md32xx_vg/store_md32xx_lv
Do you really want to remove active logical volume store_md32xx_lv? [y/n]: y
  Logical volume "store_md32xx_lv" successfully removed

[root@storage]# lvcreate --extents 100%FREE --stripes 10 --stripesize 256 --name store_md32xx_lv store_md32xx_vg
  Logical volume "store_md32xx_lv" created

[root@storage]# lvs --segments
  LV                  VG                  Attr       #Str Type    SSize  
  store_md32xx_lv     store_md32xx_vg     -wi-a-----   10 striped 545.73t
  lv_root             vg_root             -wi-ao----    1 linear    1.09t

[root@storage]# mkfs.xfs /dev/mapper/store_md32xx_vg-store_md32xx_lv 

[root@storage]# mount -a

[root@storage fs]# dd if=/dev/zero of=test.dat bs=1024k count=1000000

Yay! we are now cpu bound on this dd… :-)

17773 root      20   0  103m 2780 1656 R 100.0  0.0   8:26.73 dd 

And now we are striped - so much better, nice balance across luns!

Major kudos to Justin Weissig's great corner of the internet for helping out!

We also removed the cache mirror - this is a catch it as fast as you can system, we can take the risk of controller issues for this application (don't try this at home kids!)... we put it back in the end, but wanted to make sure it was not a bottle.

And here we are all finished running eight at a time and pushing 2.6GB/s (woot!):

At this rate we can also support two more microscopes all off the same kit... with each of them running at full tilt!

How's about them apples for some serious price/performance eh? :-)

Oh and one more thing...

Hashtag BIG DATA... :-)

Thursday, October 23, 2014

OdyBot And Pointy Haired Alerting! AKA: Grumpy old man shakes fist at web two dot oh!

Hi all,

It's been ages since I last posted. We have been super busy in the day job running our monster high performance computing infrastructure and keeping lots of petabytes spinning and many scientists and researchers happy. I wanted to quickly note that we recently had a new addition to our group, supporting the Odyssey cluster.

Let me introduce:


Behind the scenes there are all sorts of fun activities, like checking that our data centers are neat and tidy and doing lots of awesome science:

from Harvard FAS Research Computing on Vimeo.

and sometimes just chilling out riding a skateboard around the yard:

OdyBot Gets Schooled
from Harvard FAS Research Computing on Vimeo.

You can find out all about OdyBot over at

Meanwhile we have had a couple of integration issues back at the ranch making sure that our awesome RC support staff are able to answer questions as our community asks them, and I think my old school methods finally got the better of me yesterday...

So, we use two online web services Userlike and hipchat to provide our external "voice" for OdyBot and for our internal communications. We wanted a quick way to post an alert to our main chat room when the operator count became zero.

Simple eh?

Well, kinda... and given that I'm not a child of the web 2.0 world I went about it in true UNIX style. A python script, with a unix pipe to a perl script... I'm sure sometimes I just do this stuff to wind up my team. :-) Anyway, here's the hipchat part, based on the awesome script with the following 2 second changes to allow it to read from stdin, and quote out the <CR>'s for all that down stream JSON cleverness...
bash-3.2$ diff 
>             "message=s"=> \$optionMessage,
< while (<>){
<   $optionMessage .= $_;
< }
< $optionMessage=~s/\n/\\n/g;

The nice thing about this is you can quickly post random stdin stuffs to you chat rooms:

Then I busted out some extremely suspect python:
bash-3.2$ cat 
import httplib2
import os
import json

API_HOST = os.environ.get('API_HOST') or ''
c = 0 
h = httplib2.Http()

resp, content = h.request(
        'Authorization': API_TOKEN
if resp.status == 200:
    data = json.loads(content)

    print 'Error status=%s' % resp.status

for x in data:
    if (x['slots']['online'] == 1):
      c = c + 1

cc = 1;

if (c>0):
    print "OdyBot Operators Online..."
    for x in data:
      if (x['slots']['online'] == 1):
        print ("%d) %s" % (cc, x['name']))
        cc = cc + 1

    print "Warning!  No OdyBot Ops Are Online!!"

Which works by just querying the Userlike API:
bash-3.2$ ./ 
OdyBot Operators Online...
1) James Cuff
2) Bob Freeman
3) John Noss
4) Dan Caunt

And there you have it - simple alerting to the main chat room when you need it to tell you that folks are not on the wire taking our OdyBot support requests! This is actually pretty important, we absolutely don't want to have our community waiting, and we have invested a lot of time and effort into the OdyBot concept so that our community can contact us. We also have an open OdyBot community list that folks inside and outside of Harvard can use to post questions, tips and techniques, although it's only just starting to ramp up right now.

Question is:

Could you do this integration in an even more ghetto/bandaid/bailing wire fashion :-)

p.s. we are also testing SLACK as a replacement for hipchat as I type. For pretty much the exact same reasons we had to replace Zopim with Userlike yesterday - as you can see, even with shoddy perl and python scripts, it is all about the integration these days!

No matter how ghetto the methods ;-) #allhailtheunixpipe

Friday, July 18, 2014

Of style and science...

There are times in your career that you really, really remember.

This was one of those times.

My then head of department, the dearly departed and most wonderful Professor Dame Louise Johnson wrote this note to my D. Phil. supervisor Geoff. back in 1997.  Geoff. recently sent me a copy while clearing out space to move into his fabulous new building over in Dundee.

To this day, I love that Louise who was an absolute scientific powerhouse said of my research:

"we thought the science was fine"

Although more importantly, her feedback about their concern for my writing style was what has really stuck with me over the years!

Nowt's changed much! ;-)

Tuesday, June 24, 2014

Ohai Linux! So you are a network switch now...

Decided to see what the fuss was all about surrounding these open source switches. Plus the rocket powered turtle really did peak my interest ;-)

[ and ]

I built all of this on a CentOS release 6.5 (Final), and I wanted to build everything from source to really see how ONIE worked from the ground up. Don't try this at home kids, there is no reason to try and damage yourself.
git clone

Needed to add some deps, this was a little painful to find what was missing, much make fail, make, fail repeat, but this should be enough for most folks to run so you don't have to go through the iterations I did - this is a monster build. I learned a lot here, never having used "realpath" for example, or any of the syslinux kit which is fab!
sudo yum install realpath
sudo yum install gperf
sudo yum install stgit
sudo yum install texinfo
sudo yum install glibc-static
sudo yum install libexpat-devel
sudo yum install python-devel
sudo yum install fakeroot
sudo yum install syslinux syslinux-devel syslinux-extlinux syslinux-perl
sudo ln -s /usr/share/syslinux /usr/lib/syslinux

Oh and get a fresh autoconf if you are on CentOS 6.5
tar zxvf autoconf-latest.tar.gz
cd autoconf-2.69/
sudo make install

And away we go!
[jcuff@jcair-vm build-config]$ make -j4 MACHINE=kvm_x86_64 all recovery-iso

mkdir: created directory `/home/jcuff/onie/build'
mkdir: created directory `/home/jcuff/onie/build/images'
mkdir: created directory `/home/jcuff/onie/build/download'
mkdir: created directory `/home/jcuff/onie/build/kvm_x86_64-r0'
mkdir: created directory `/home/jcuff/onie/build/kvm_x86_64-r0/stamp'
mkdir: created directory `/home/jcuff/onie/build/kvm_x86_64-r0/initramfs'
==== Getting Linux ====
2014-06-11 14:50:54 URL: [65143140/65143140] -> "/home/jcuff/onie/build/download/linux-3.2.35.tar.xz" [1]
linux-3.2.35.tar.xz: OK

wheee! (get a large beverage, this bit takes a while!
[jcuff@jcair-vm build-config]$ ls -ltra ../build/images/
total 34212
drwxrwxr-x. 7 jcuff jcuff     4096 Jun 11 17:24 ..
-rw-rw-r--. 1 jcuff jcuff  3301792 Jun 11 18:29 kvm_x86_64-r0.vmlinuz
-rw-rw-r--. 1 jcuff jcuff  5284988 Jun 13 11:23 kvm_x86_64-r0.initrd
-rw-rw-r--. 1 jcuff jcuff  8603253 Jun 13 11:23 onie-updater-x86_64-kvm_x86_64-r0
drwxrwxr-x. 2 jcuff jcuff     4096 Jun 13 11:29 .
-rw-rw-r--. 1 jcuff jcuff 17825792 Jun 13 11:30 onie-recovery-x86_64-kvm_x86_64-r0.iso

Make a disk:
[root@jcair-vm onie]# dd if=/dev/zero of=/tmp/onie-x86-demo.img bs=1M count=256
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 0.272711 s, 984 MB/s

Spin up the kvm!
[root@jcair-vm onie]# sudo /usr/libexec/qemu-kvm -m 1024 -name onie -boot order=cd,once=d -cdrom /tmp/onie.iso -net nic,model=e1000 -vnc -vga std -drive file=/tmp/onie-x86-demo.img,media=disk,if=virtio,index=0 -serial telnet:localhost:9000,server

And you are golden!
ONIE: Starting ONIE Service Discovery
Info: Found static url: file:///lib/onie/onie-updater
ONIE: Executing installer: file:///lib/onie/onie-updater
Verifying image checksum ... OK.
Preparing image archive ... OK.
ONIE: Version       : master-201406241118-dirty
ONIE: Architecture  : x86_64
ONIE: Machine       : kvm_x86_64
ONIE: Machine Rev   : 0
ONIE: Config Version: 1
Installing ONIE on: /dev/vda
Pre installation hook
Post installation hook

Remove the CD from your config and you can now boot the live version, and if everything has worked out, the discovery process will work and you can now ping the UK from the USA...
ONIE: Rescue Mode ...
Version   : master-201406241118-dirty
Build Date: 2014-06-24T11:40-0400
Info: Mounting kernel filesystems... done.
Info: Mounting LABEL=ONIE-BOOT on /mnt/onie-boot ...
Running demonstration platform init pre_arch routines...
Running demonstration platform init post_arch routines...
Info: Using eth0 MAC address: 52:54:00:2b:63:f6
Info: eth0:  Checking link... up.
Info: Trying DHCPv4 on interface: eth0
ONIE: Using DHCPv4 addr: eth0: /
Starting: dropbear ssh daemon... done.
Starting: telnetd... done.
discover: Rescue mode detected.  Installer disabled.

Please press Enter to activate this console. 

ONIE:/ # onie-sysinfo -a
VM-1234567890 52:54:00:2b:63:f6 master-201406241118-dirty 42623 kvm_x86_64 0 x86_64-kvm_x86_64-r0 x86_64 1 gpt 2014-06-24T11:40-0400

ONIE:/ # ping
PING ( 56 data bytes
64 bytes from seq=0 ttl=61 time=108.473 ms
64 bytes from seq=1 ttl=61 time=103.824 ms
64 bytes from seq=2 ttl=61 time=103.238 ms
--- ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 103.238/105.178/108.473 ms

p.s. for extra twisted points this is ONIE running on linux KVM, inside virtualbox, on a mac on a pair of different layer three networks... it becomes a little confusing to run commands, but always makes me chuckle that a mac laptop is basically a little data center at this point :-)
jcair:~ jcuff$ uname -v

Darwin Kernel Version 13.2.0: Thu Apr 17 23:03:13 PDT 2014; root:xnu-2422.100.13~1/RELEASE_X86_64

jcair:~ jcuff$ ssh -p 2222 root@ ssh uname -a

Linux onie 3.2.35-onie+ #1 SMP Tue Jun 24 11:30:01 EDT 2014 x86_64 GNU/Linux


Monday, May 5, 2014

compressing DRAM with ZRAM for fun and profit?


Can you use compressed DRAM for science if you don't quite have enough memory?


I'm going to file this under "Great idea, but my execution is slightly suspect"

Anyway, here's an example set up of compressed swap files:
[root@jcair-vm ~]# modprobe zram

[root@jcair-vm ~]# mkswap /dev/zram0
Setting up swapspace version 1, size = 104860756 KiB
no label, UUID=58476253-ad5a-4595-9bec-60bd09d76d30

[root@jcair-vm ~]# mkswap /dev/zram1
Setting up swapspace version 1, size = 104860756 KiB
no label, UUID=ed5d0f85-0245-472e-902e-0e94a743cbe0

[root@jcair-vm ~]# swapon -p5 /dev/zram0 
[root@jcair-vm ~]# swapon -p5 /dev/zram1

[root@jcair-vm ~]# swapon -s
Filename                                Type            Size    Used    Priority
/dev/zram0                              partition       104860752       0       5
/dev/zram1                              partition       104860752       0       5

Clearly without the zram setup above, stress fails right out of the gate:
[root@jcair-vm ~]# stress --vm-bytes 2344600024 -m 2 --vm-keep
stress: info: [6063] dispatching hogs: 0 cpu, 0 io, 2 vm, 0 hdd
stress: FAIL: [6063] (415) -- worker 6065 got signal 9
stress: WARN: [6063] (417) now reaping child worker processes
stress: FAIL: [6063] (451) failed run completed in 10s

But, running a stress test with a memory allocation much bigger than the host seems to work just fine and dandy once we have our zram swap files like those noted above:
[root@jcair-vm ~]# stress --vm-bytes 2344600024 -m 2 --vm-keep
stress: info: [5383] dispatching hogs: 0 cpu, 0 io, 2 vm, 0 hdd

top - 10:41:17 up 13 days, 22:03,  4 users,  load average: 2.91, 0.87, 0.29
Tasks: 192 total,   4 running, 188 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us, 74.8%sy,  0.0%ni, 11.5%id,  0.0%wa,  0.1%hi, 13.4%si,  0.0%st
Mem:   3923468k total,  3840852k used,    82616k free,     5368k buffers
Swap: 209721504k total,   626964k used, 209094540k free,    36932k cached

 5385 root      20   0 2242m 1.2g  124 R 96.4 31.3   0:48.12 stress
 5384 root      20   0 2242m 1.2g  124 R 84.0 32.0   0:48.12 stress

Yay! So - this looks like it could work!

And so here we go with a genome aligner to see if this works. This will be a good test as it writes real data structures into memory, stress was doing a block fill. So first up let's try w/o enough ram:
[root@jcair-vm ~]# cat 
./bowtie2/bowtie2 -x ./hg19 -p 4  <( zcat Sample_cd1m_3rdrun_1_ATCACG.R1.fastq.gz)

[root@jcair-vm ~]# ./ 
Out of memory allocating the ebwt[] array for the Bowtie index.  Please try
again on a computer with more memory.

Error: Encountered internal Bowtie 2 exception (#1)

Command: /root/bowtie2/bowtie2-align-s --wrapper basic-0 -x ./hg19 -p 4 /dev/fd/63 
(ERR): bowtie2-align exited with value 1

Ok, fair enough, so we have a reproducer.

Let's now set up a run with the right amount of physical ram:
[root@jcair-vm ~]# ./bowtie2/bowtie2 -x ./hg19 -p 4 <(cat cuff.fastq) -S out.dat &

7467 root 20 0 3606m 3.3g 1848 S 389.3 58.3 51:37.25 bowtie2-align-s

And we have a result!
[root@jcair-vm ~]# time ./bowtie2/bowtie2 -x ./hg19 -p 4 <(cat cuff.fastq)  -S out.dat 
13558597 reads; of these:
  13558597 (100.00%) were unpaired; of these:
  11697457 (86.27%) aligned 0 times
    545196 (4.02%) aligned exactly 1 time
   1315944 (9.71%) aligned >1 times
13.73% overall alignment rate

Ok, so let's shrink the memory of the machine and see if we can run with zram.

Let's also set the same priority and do a round robin between physical swap and zram so each can write/read a block should be nice balanced I/O. The stress worked, so our theory is that data and in memory structures could compress and we should be able to get at least a 1:2 or 1:1.5 ratio out of the memory, I settled on a 3G machine with a 3G compression and some physical swap also:
[jcuff@jcair-vm ~]$ swapon -s
Filename                                Type            Size    Used    Priority
/dev/zram0                              partition       2947008 614124  1
/dev/dm-1                               partition       4063224 614176  1

When running it did result in *much* smaller RES (982m vs 3.3G from native example):

 2350 root      20   0 3606m 982m 1020 S 20.8 33.3  12:26.74 bowtie2-align-s

Things chugged along, but I was not seeing this ending any time soon so I truncated the read file dramatically to ca. 5k reads to see if I could get a quick comparison between, zram hybrid sram and swap, and plain old boring old swap files.

As you can see below, only "boring old swap" resulted in anything sensible. The zram alone caused some rather spectacular OOM errors and obvious system instability, it was kinda fun though. You can also see below various versions we tried out, none of which actually worked, but we are also not totally alone here either.

Oh and: "Just the right amount of memory" - like Goldilocks, that one worked ;-)
Machine with memory too small:          (ERR): bowtie2-align exited

3G zram:                                sshd invoked oom-killer: gfp_mask=0x200da

Hybrid 3G zram + 4G physical swap:      6m 25.285s

Hybrid 500MB zram + 4G physical swap:   1m 51.029s

Regular /dev/dm-1 swap file:            0m 29.741s

Machine with enough ram:                0m 12.698s

In summary... NO PROFIT this time :-(

Still a neat idea - just don't try this at home kids!

Thursday, April 17, 2014

of painting, retirement plans and minimum wage

So my lovely "painting diva by night" Michele Clamp bangs out some epic watercolors...

Michele totally scored today! A great friend of ours bought one of her paintings. For ONE HUNDRED DOLLARS! Tonight we decided to look at how we are going to fund our new found retirement from paintings! Here's the transaction, I kid you not, she literally made ONE HUNDRED DOLLARS!

And here is the lovely (now sold!) "Pig in Clover" in his new rather resplendent frame waiting to go to the CCAE to hang out with his chums in the rest of Michele's exhibition...

I did think at the time that charging ONE HUNDRED BUCKS was a bit steep, especially to a great and close friend of ours, so I asked Michele to pull together the numbers.

It was rather sad as you can see:
Actual Painting                 1 hour
Buying frame                    1 hour
Framing                         1 hour
Ferrying to/from gallery        1 hour

Paint                           $1.00
Paper                           $1.00
Brush wear and tear             $1.00
Frame                           $15.00
Sale price                      $100.00
Minus fees to the lovely CCAE   $50.00

Net                             $32.00 

Income @ 4 hours                $8.00 /hour

Which in the state of MA the minimum wage is exactly eight bucks an hour.

Clearly not quite time to retire yet!

Especially given our current sales rate is about one painting every six months, which puts Michele well... yeah best we don't even bother with that math, it would not keep us in adult beverages.


Thursday, April 10, 2014

of schools and of school districts

So folks in the USA worry a lot about where to send their kids to school. Entire family decisions are made and based upon locating to the right regions and towns and cites in America so their kids can get "the best education they can afford".  It's a very big dealio.

For example has huge sections of school data built right into the purchase section for any property. Here for example is a $900,000 home in a town called Sudbury in Massachusetts (it's a bit posh, but I wanted to use it as an example - we would never live there!). The yearly council and property taxes for this particular place come in at over $16,000. But check this out for some of the local schools - you can clearly see where the money goes right?

Anyway so I want to tell you about where I carried out my high school education...

Tulketh High School, Tag Lane, Preston, England.

I was there in the mid to late eighties. Other than white socks being a formal, and required part of the male and female uniform - it was not all that bad a place.  Sure the bar was extremely low, it was a pretty poor neck of the woods. At the time we were living in nearby council assisted housing and didn't have two pennies to rub together. But there at Tulketh the teachers (for the most part - but I'll get to that later) tried their very best to teach us reprobates the three R's.

I remember fondly our English, Latin and French teachers in particular being absolutely great and my Chemistry teacher, well he was the chap who first introduced me to a 480z... and I guess given my current occupation you could call what I do now as being the rest of his history!  So a good teaching staff in a pretty shitty location, but with absolute hearts of gold.  Certainly there was no $16,000 a year council tax, heck I imagine at the time you could probably buy our entire house for that!

For balance, just in case folks think I'm getting all wet and starry eyed, I did unfortunately have a math teacher who in my later years at Tulketh helped me achieve a very solid "D"...  I can tell you that "D" looked totally amazing next to all my other "A's" when I later applied to do my A levels! Anyway, I retook the math course and achieved a straight A on the second go round.  Once I had a teacher that actually taught me the syllabus... but as I said, the bar at Tulketh really was pretty gosh darn low.  I hold no grudges - it was utterly amazing I made it to University to be honest!

Unfortunately for Tulketh, it turned out that things did not get a whole lot better after I left the area either.  I've no background as to why this is, although I do have some personal ideas.  But mainly one.

Teachers are not paid anywhere near enough $ to stay in the profession. 

Couple that with the total reprobates (sorry pupils) that hung out at our school when I was there, I can hardly imagine how difficult it must have been to even get up in the morning to go to work...  Some of our classes were merely a study in chaos theory rather than anything approximating education.  I can't imagine that part ever really improved any.  Certainly not for the teachers.

For example, the latest performance figures from the BBC back in 2004, show the school came in ranked 90th out of a possible 93, with a 15% success rate in the GCSE.  That's altogether just on the other side of absolutely grim, however you want to do the statistics....

I've not been back to that part of the UK in about 10 years, google maps currently shows it to be not doing all that well, and I hear the rest of the building is also all boarded up now:

They appear to have closed the school after attempting to make it a "sports centre of excellence" after failing at education fully in 2003 or so, and then basically from what I can see gave up on the whole system some time in 2009.

Very, very sad.

And worse still - it very much looks like the new rebooted version of this school, albeit at a new site with a big fancy building, and in a fancier postal code with huge multimillion pound building investments, (but with probably the very same dodgy pupils - remember I was one ;-)) is also still not doing so very well either...  Do we think the teachers all got paid more?  I doubt it.  And yet the report below STILL blames the teachers!  It can't be the twenty five million pound building, it has to be the teachers!


When will they ever learn?  Just pay the bloody teachers!

However in this recent and much more positive news:

"The report addressed suggestions raised including references made to Tulketh High School as a possible alternative to the proposed new secondary school.

A new secondary school is currently in phase two of the proposal, and the report said: “Tulketh High School is closed now and whether it should be retained in abeyance until such time as it might be needed is a matter for the County Council as the Local Education Authority.

“It is, however, an option that should be discussed further.”

Gives me hope that the place where I first learned to program a computer may well be able to dig itself out of the rather nasty corner it is in right now.

And for fun, I'll leave you with a picture of what prefects were given to denote their status (courtesy of Tina Kelly from a Facebook post).  I also remember holding one of these badges with great pride!  It was the only time that this little nerd could get his own back on those big bullies - basically by handing out detention slips!  Ah!  Happy days! :-)

Must say, it does have a slight imperial look to it... and I guess we know how well that kind of thing always works out - right?

Tulketh High, there will always be a place for you in my heart - you are clearly gone, but you will never, ever be forgotten, neither will all the amazing teachers there who helped me on my way!

Thursday, December 5, 2013

Watch out for those Ts and the Cs folks!

This is one for technology folks, support folks and basically anyone in customer service...

Cable Co's have a really, really bad rap these days, and not all is warranted but sometimes the SLA is far larger and more important to defend than any customer relationship. It has actually become rather sad and uncomfortable for all, including the support folks. I'll try and illustrate with a personal example...

We recently moved to a "no glass area" of Massachusetts awesome new house, but alas only copper in the area. We had FiOS connectivity for over seven years at our old place and had also recently added a TV package to our business line. This addition needed us to move packages albeit with the same company and the same piece of glass on the wall. As per usual there was clearly some internal company complexity that the same glass that provides packets for TCP needed to be reconfigured to also deliver the TV.

No biggy.

So moving day came and went (we are planning to rent out the old place, but I digress), so I called Verizon to disconnect the service. Should be no muss, no fuss right? Wrong. We now owed a $190.00 fee to disconnect the service we have had in there for seven years. Seemingly this is because we also paid an additional $50.00 per month on top of our existing Internet only plan earlier this year to add TV service. According to the big complicated Verizon computer we are now considered under a "new" contract.

All our previous history on time payments was for naught. Poof! All gone!

After getting absolutely nowhere by phone I took this to the twitters... I've seen people be successful with this whole new modern technology thing, so why not? I'm pretty sure I can have a crack at this. After all we are talking about one hundred and ninety bucks, and that my friends is a fair old number of bud lights that is :-)

The folks at Verizon did reply, and we had a lovely chat. At one point it really did look like we were going to get somewhere, but in the end did not get anywhere. However, we did have a really nice chat nevertheless, I drank tea while typing, it was lovely.

I really ought to have searched the internet first before attempting this clearly futile exercise. Seemingly this type of thing happens an awful lot. I should have also checked my own weblog, different story, similar ending.

I guess at this point, I'll just pay up and remind our future tenants to check any contracts, and much like the airlines say at the end of flights, I'll remind them "that they too have choices in who they fly with".

It's all a bit of a shame really, and having seen the recent South Park satirical send up, it is only going to get way, way worse for all of us.

Tuesday, November 19, 2013

Accelerating science?

Catching up on the SC13 news via twitter last night gave me some new found focus after a rather disappointing day "optimizing" a popular MCMC algorithm in the life sciences:

Finally it looks like we are going to see some really interesting workloads used as HPC benchmarks, it's going to be a whole lot more fun than simply diagonalizing matrices...

Anyway, so back to yesterday, I was disappointed yeah? Well we are a service shop here in the day job. We have 50+ thousand x86 CPU in our cluster, but no matter, there are always times when it is never enough. We have a bunch of GPGPU and Phi cards being used for all sorts of science and until yesterday I'd never really tried things in anger myself. Full disclosure I do a lot more management these days than coding, but I still think there is life in this old dog to have a poke at code and computers every now and then ;-)

So first up - MrBayes. This is a really popular piece of code that also uses the awesome LibHMSBeagle (love the name!) From the website, Beagle is "a high-performance library that can perform the core calculations at the heart of most Bayesian and Maximum Likelihood phylogenetics packages. It can make use of highly-parallel processors such as those in graphics cards (GPUs) found in many PCs."

Fantastic - this should be cake!

So I dived right in and pulled the source and started to target one of our Phi boards:

if test "$enable_phi" = yes; then

AC_MSG_ERROR (Intel Phi not supported on this system)

Well that's kinda cute eh?

Anyway, the boys and girls over at Beagle are clearly still working on it, no matter, I pushed on. Remember I'm not the skill level of these folks - they literally write code like this to optimize MCMC and others on physical hardware, for example this piece extracted from libhmsbeagle/CPU/BeagleCPU4StateSSEImpl.hpp for SSE is the stuff of legend:
/* Loads (transposed) finite-time transition matrices into SSE vectors */
#define SSE_PREFETCH_MATRICES(src_m1, src_m2, dest_vu_m1, dest_vu_m2) \
        const double *m1 = (src_m1); \
        const double *m2 = (src_m2); \
        for (int i = 0; i  OFFSET; i++, m1++, m2++) { \
                dest_vu_m1[i][0].x[0] = m1[0*OFFSET]; \
                dest_vu_m1[i][0].x[1] = m1[1*OFFSET]; \
                dest_vu_m2[i][0].x[0] = m2[0*OFFSET]; \
                dest_vu_m2[i][0].x[1] = m2[1*OFFSET]; \
                dest_vu_m1[i][1].x[0] = m1[2*OFFSET]; \
                dest_vu_m1[i][1].x[1] = m1[3*OFFSET]; \
                dest_vu_m2[i][1].x[0] = m2[2*OFFSET]; \
                dest_vu_m2[i][1].x[1] = m2[3*OFFSET]; \

Hardcore stuff indeed!

Anyway, after a fair amount of poke and prod I did achieve nirvana: MPI MrBayes running on the Phi! Trust me it's not just as simple as "icc -mmic", and I also had to go forward without the lovely LibHMSBeagle... but it did work! I've talked about Phi ports before, even went nuts and moved Perl over to an early board last year.

So I used this small example in the MrBayes distribution to do a looksee at some Spiny Lizards (Sceloporus magister), this is a pretty simple 123 membered taxa, and only 1,606 bases, so a pretty bog standard run. We ran with mcmcp ngen=100000 nruns=4 nchains=4 printfreq=10000 samplefreq=1000; to keep it simple:

[jcuff@-mic0 examples]$ mpirun -np 16 ../mb ./sceloporus.nex
                            MrBayes v3.2.2 x64

                      (Bayesian Analysis of Phylogeny)

                             (Parallel version)
                         (16 processors available)

              Distributed under the GNU General Public License

               Type "help" or "help " for information
                     on the commands that are available.

                   Type "about" for authorship and general
                       information about the program.

   Executing file "./sceloporus.nex"
   DOS line termination
   Longest line length = 1632
   Parsing file
   Expecting NEXUS formatted file
   Reading data block
      Allocated taxon set
      Allocated matrix
      Defining new matrix with 123 taxa and 1606 characters
      Data is Dna
      Missing data coded as ?
      Gaps coded as -

So off we went with two identical versions, built with this so that the standard non-SSE likelihood calculator is used for division in single-precision. I know there are more threads on the Phi, but I wanted to see some quick comparisons, also because this is an MPI code it is not well suited to the highly threaded architecture on the Phi card so more threads actually make this way worse. Anyway off we go:
[jcuff@phi src]$ ./configure --enable-mpi=yes --with-beagle=no --enable-sse=no
Some slight runtime issues, the two socket managed between 3:25 and 2:51 mins, but the Phi made for one 1 hour 11 mins...
[jcuff@box0 examples]$ mpirun -np 16 ../mb.x86 sceloporus.nex 

       10000 -- (-12413.627) [...15 remote chains...] -- 0:03:35 (no sse)
       10000 -- (-12416.929) [...15 remote chains...] -- 0:02:51 (with sse)

[jcuff@mic0 examples]$ mpirun -np 16 ../mb.phi sceloporus.nex 

       10000 -- (-12375.516) [...15 remote chains...] -- 1:11:42 (no sse)

Now I know "I'M DOING THIS WRONG(tm)" - this is why blogs have comments ;-)

However, first off it is just not that simple to get a native version of a complex code running, and I know there are many ways to speed up code on the Phi, it is after all a very different architecture. I wondered how our researchers would do this, they are life scientists and generally go after algorithms that work out of the box so they can get on with the rest of their day jobs. I'm seeing a 23x issue here - this was too much of a gap for someone of my limited skills to manage.

So I toddled off to a machine with a GPGPU (well actually four of them, but that's a story for another day), to go use LibBeagle in anger. Turns out that some of the beagle functions are not all that well documented, but we found them:
MrBayes > showbeagle

   Available resources reported by beagle library:
        Resource 0:
        Name: CPU

        Resource 1:
        Name: Tesla K20c
        Desc: Global memory (MB): 4800 | Clock speed (Ghz): 0.71 | Number of cores: 2496

        Resource 2:
        Name: Tesla K20c
        Desc: Global memory (MB): 4800 | Clock speed (Ghz): 0.71 | Number of cores: 2496

        Resource 3:
        Name: Tesla K20c
        Desc: Global memory (MB): 4800 | Clock speed (Ghz): 0.71 | Number of cores: 2496

        Resource 4:
        Name: Tesla K20c
        Desc: Global memory (MB): 4800 | Clock speed (Ghz): 0.71 | Number of cores: 2496

So let's compare these two:

MrBayes > set usebeagle=yes beagledevice=gpu
[jcuff@gpu01 examples]$
real    1m37.645s
user    1m36.917s
sys     0m0.154s

[1]-  Done                    time ../src/mb sceloporus.nex > out.x86

[jcuff@gpu01 examples]$
real    2m19.238s
user    1m53.320s
sys     0m19.802s

[2]+  Done                    time ../src/mb sceloporus.nex.gpu > out.gpu

Well - once again for this workload we are clearly not "Accelerating science".

I do have to say that I am on the other hand, really looking forward and watching developments of libraries such as LibHMSBeagle. For example, once those folks get the AVX and all the other magic running natively, I'm sure we will see massive increases in speed with all the accelerator boards. Until then, I'm sticking with bog standard x86 boxes and a halfway decent interconnects until things calm down a bit. Remember this is also a rather simple set of code with only 200k lines of code...
[jcuff@phi mrbayes_3.2.2]$ wc -l beagle-lib-read-only/* src/* | grep total
206,791 total

Just imagine what some of the behemoths we deal with daily would be like! I actually think back to the head of this post. It is a real shame that we have collectively chased the "linpack" dream, because based on all of the digging poking and prodding yesterday, the highest performing boxes on the planet right now:
Site:          National Super Computer Center in Guangzhou
Manufacturer:  NUDT
Cores:         3,120,000
Rmax:          33,862.7 TFlop/s
Rpeak:         54,902.4 TFlop/s
Power:         17,808.00 kW
Memory:        1,024,000 GB
Interconnect:  TH Express-2
OS:            Kylin Linux
Compiler:      icc
Math Library:  Intel MKL-11.0.0
MPI:           MPICH2 with a customized GLEX channel

for this particular workload (and I'm going to argue for many others) probably could not actually help our researchers. We are seeing more and more accelerated systems in the top500 there are at least 4 in the top 10 announced yesterday for example! These represent huge investments, and based on this simple back of the envelope example I present here, we are going to need a whole new class of researchers and scientists to program these to get the equivalent amount of awesome science out of them that we do the "bog standard".

Postscript: our Phi run also kinda did this after a few iterations:
Command exited with non-zero status 11

and our CUDA run with mpirun -np 8 also caused a machine with 1TB of DRAM (!) to:
Nov 18 15:40:31 kernel: mb invoked oom-killer: gfp_mask=0x82d4
Nov 18 15:40:31 kernel: Out of memory: Kill process 4282 sacrifice child
Clearly much work still left to do folks!

In the meantime I shall embrace our new HPCG overlords!

Quote: "For example, the Titan system at Oak Ridge National Laboratory has 18,688 nodes, each with a 16-core, 32 GB AMD Opteron processor and a 6GB Nvidia K20 GPU[2]. Titan was the top- ranked system in November 2012 using HPL. However, in obtaining the HPL result on Titan, the Opteron processors played only a supporting role in the result. All floating-point computation and all data were resident on the GPUs. In contrast, real applications, when initially ported to Titan, will typically run solely on the CPUs and selectively off-load computations to the GPU for acceleration."


[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff