Friday, September 18, 2015

12,290,000 V.92 dial-up modems...

It's that time of year for a new fast file system.

Here's a single box teaser with a pair of disks that we are building...
echo 1 > /proc/sys/vm/swappiness

mdadm -C /dev/md0 --level=raid0 --raid-devices=2 /dev/disk0 /dev/disk1

mke2fs -E nodiscard /dev/md0

mount /dev/md0 /mnt/test

time dd if=/dev/zero of=/mnt/test/cuff1.dat bs=1024k count=100000 oflag=direct &
time dd if=/dev/zero of=/mnt/test/cuff2.dat bs=1024k count=100000 oflag=direct &
time dd if=/dev/zero of=/mnt/test/cuff3.dat bs=1024k count=100000 oflag=direct &
time dd if=/dev/zero of=/mnt/test/cuff4.dat bs=1024k count=100000 oflag=direct &
time dd if=/dev/zero of=/mnt/test/cuff5.dat bs=1024k count=100000 oflag=direct &
time dd if=/dev/zero of=/mnt/test/cuff6.dat bs=1024k count=100000 oflag=direct &
time dd if=/dev/zero of=/mnt/test/cuff7.dat bs=1024k count=100000 oflag=direct &
time dd if=/dev/zero of=/mnt/test/cuff8.dat bs=1024k count=100000 oflag=direct &

du -sh /mnt/test/cuff*
98G     /mnt/test/cuff1.dat
98G     /mnt/test/cuff2.dat
98G     /mnt/test/cuff3.dat
98G     /mnt/test/cuff4.dat
98G     /mnt/test/cuff5.dat
98G     /mnt/test/cuff6.dat
98G     /mnt/test/cuff7.dat
98G     /mnt/test/cuff8.dat

Gives me this for writing out about 0.8 of a terabyte in parallel:

Eight threads @ 3.6GB/s

104857600000 bytes (105 GB) copied, 233.532 s, 449 MB/s
104857600000 bytes (105 GB) copied, 233.652 s, 447 MB/s
104857600000 bytes (105 GB) copied, 233.671 s, 449 MB/s
104857600000 bytes (105 GB) copied, 233.732 s, 450 MB/s
104857600000 bytes (105 GB) copied, 233.758 s, 449 MB/s
104857600000 bytes (105 GB) copied, 233.658 s, 449 MB/s
104857600000 bytes (105 GB) copied, 233.604 s, 448 MB/s
104857600000 bytes (105 GB) copied, 233.371 s, 449 MB/s

And gives me this for reading it all back in:

Eight threads @ 5.4GB/s

104857600000 bytes (105 GB) copied, 155.801 s, 673 MB/s
104857600000 bytes (105 GB) copied, 155.628 s, 674 MB/s
104857600000 bytes (105 GB) copied, 155.577 s, 674 MB/s
104857600000 bytes (105 GB) copied, 155.526 s, 674 MB/s
104857600000 bytes (105 GB) copied, 155.325 s, 675 MB/s
104857600000 bytes (105 GB) copied, 155.323 s, 675 MB/s
104857600000 bytes (105 GB) copied, 154.346 s, 679 MB/s
104857600000 bytes (105 GB) copied, 153.967 s, 681 MB/s

We have 16 of these "things", with enough IB network attached so should get a 57GB/s write 86GB/s read, or the same bandwidth as twelve million v.92 (56K) dial-up modems.

Yep, this isn't your grandpappys flash disk ;-)

Wednesday, September 9, 2015

IT organizational health...

Scene One: A doctor's office on main campus

Enter stage left: IT leader (ITL)

Doctor: "Hello, ITL please do take a seat..."

ITL: "Hello Doctor, I can't thank you enough for taking time to see me for my annual organizational health and performance review!"

Doctor: "No problem ITL, wonderful to see you again. However, I do have some rather unfortunate news about your overall organizational health I'm afraid..."

ITL: "Yes? That's interesting because since last year when we last spoke, we've been making significant strides to improve our overall organizational health. For example, we've hired many external consultants and project managers who have been helping us not only eat better and exercise more, but also help us spend all our money and manage our deliverables and projects with key stakeholders!"

Doctor: "Well I'm not sure quite how to put this..."

ITL: "Go on, I'm listening"

Doctor: "Well. It's the right arm see. It's rotten"

ITL: "Really? It looks fine to me. In fact, last year we gave right arm 10 out of 10 in its performance review - a stella all round performer is right arm. We rely on right arm most heavily for a great number of tasks!"

Doctor: "Well, I don't know what else to say, but it's rotten. It's also kinda ruining the rest of the organization for you I'm afraid"

ITL: "Wow - so ok then, what do you suggest we do with this whole right arm situation?"

Doctor: "It's going to have to come off I'm afraid. Nothing else for it, we will simply have to remove it"

ITL: "Well. The leadership team are very fond of right arm. We all look to right arm as a constant source of support..."

Doctor: "I understand, but right arm is the root cause of all of the organizational health issues you are facing. It simply just has to be removed, and replaced with a better quality right arm. Don't worry we have done this procedure a number of times, it's mostly painless and the new arm... well I'm sure you will be most impressed with it!"

ITL: "I'm slightly worried we will not complete a few of our projects without right arm, but I'm willing to compromise some. You've also shown me that the current right arm maybe isn't quite as useful as we thought it was. So if we are to remove right arm..."

Doctor: "... and then replace right arm! We can't just remove it, that would end very poorly!"

ITL: "....whatever, anyway... I will want to reinvest the cost savings of running this whole right arm department. So, how about instead of these expensive solutions involving replacing right arm we simply recruit two more left feet to help out while you work on this whole right arm issue?"

Doctor: "Addition of two left feet may further compromise the stability of your organization. I would strongly recommend against this two left feet approach, especially if you cut off your right arm..."

ITL: "I read in the New York Times that organizations with two left feet are far more agile, I think that was what the article said, I did read it rather quickly. However to me, it sounds like if we remove this whole right arm situation, then the best idea would be to immediately move the budget from the right arm department and hire an additional left foot at once. I've heard from the left foot department that they are terribly understaffed, and could really do with additional support"

Doctor: "So quick question while we are on this left foot topic..."

ITL: "Sure what is it?"

Doctor: "Assume we do hire this additional left foot. And let's also assume as the New York Times article suggests it all works splendidly, and that things don't all just simply just fall over..."

ITL: "Yes?"

Doctor: "What do you suggest we do with the existing right foot department?"

ITL: "We probably ought to have outsourced right foot activities to an off shore service years ago! I hear many IT leaders are doing this. Also the feedback from the left foot group is that right foot never really helped them so much. Least in anyway that they could see. The left foot department have also been advocating we hire more left feet for a while now"

Doctor: "So you think that removing your right arm, having two left feet, and outsourcing all your right foot operations is going to be a good idea for the organization health as a whole?"

ITL: "I really do think so. Doctor, I just can't thank you enough! I love these little annual chats we have. See you next year! Cheerio!"

Doctor: "I can't even, I simply just can't..."

(fade to black)

Friday, July 31, 2015

How does your cluster sound?

So I was on the airplane coming back from XSEDE15 in St. Louis, and got to thinking about all the amazing visualizations that were on display. I wondered. What would a cluster sound like? On our HPC cluster we have millions of jobs running each month, and often 10-20,000 running simultaneously. So I decided to go on a hunt for a MIDI player and a MIDI file format generator.

Found both in seconds, the internet is awesome!

First up a player for OSX (you don't do the --with-libsndfile, it won't work at the end):
brew install libsndfile lame
brew install --with-libsndfile fluidsynth

Now download a soundfont (wow this takes me back!)


and we have the musics!

Jamess-MacBook-Pro:GeneralUser GS 1.44 FluidSynth jcuff$ fluidsynth -i ./GeneralUser\ GS\ FluidSynth\ v1.44.sf2 demo\ MIDIs/All\ Night\ Long.mid
FluidSynth version 1.1.6
Copyright (C) 2000-2012 Peter Hanappe and others.
Distributed under the LGPL license.
SoundFont(R) is a registered trademark of E-mu Systems, Inc.

Ok so we can play a midi file from the CLI. Time to write one now. And yet again, the internet provides:

So here we go:
Jamess-MacBook-Pro:~ jcuff$ wget
Jamess-MacBook-Pro:~ jcuff$ unzip 
Jamess-MacBook-Pro:~ jcuff$ cd MIDIUtil-0.89
Jamess-MacBook-Pro:~ jcuff$ ls
Jamess-MacBook-Pro:~ jcuff$ python ./ install
Jamess-MacBook-Pro:~ jcuff$ sudo python ./ install
Jamess-MacBook-Pro:~ jcuff$ python ./examples/ 
Jamess-MacBook-Pro:~ jcuff$ file output.mid 

output.mid: Standard MIDI data (format 1) using 1 track at 1/960

Jamess-MacBook-Pro:MIDIUtil-0.89 jcuff$ fluidsynth -v -i ../GeneralUser\ GS\ 1.44\ FluidSynth/GeneralUser\ GS\ FluidSynth\ v1.44.sf2 output.mid 
FluidSynth version 1.1.6
Copyright (C) 2000-2012 Peter Hanappe and others.
Distributed under the LGPL license.
SoundFont(R) is a registered trademark of E-mu Systems, Inc.

fluidsynth: noteon 0 60 100 00000 0.975 1.113 0.000 0
fluidsynth: noteoff 0 60 0 00000 1.612 1

Ok - so we can play notes from the command line. Time to knock up a parser from sacct data, let's use -p and -l so we can actually parse the data that comes out… :-)

[root@sa01 tmp]# sacct -p -l > /tmp/sacct.dat
[root@sa01 tmp]# wc -l /tmp/sacct.dat
58651 /tmp/sacct.dat

Ok so we have some rich sacct data for our 58,651 jobs in the system right now. Let's look at the ones that completed, and ran for one day, write some weird chord generator, and then add Michele Clamp's epic python under a 1 hour rather extreme pair programing episode and we give you…


Here's the code for your enjoyment and modification.

So what does your cluster sound like?


You can get after your sound file to post to your sound cloud site with this:
Jamess-MacBook-Pro:MIDIUtil-0.89 jcuff$ python ./ -f tt -s 1 -e 4

Jamess-MacBook-Pro:MIDIUtil-0.89 jcuff$ fluidsynth -F out.wav -i ../GeneralUser\ GS\ 1.44\ FluidSynth/GeneralUser\ GS\ FluidSynth\ v1.44.sf2 output.mid 
FluidSynth version 1.1.6
Copyright (C) 2000-2012 Peter Hanappe and others.
Distributed under the LGPL license.
SoundFont(R) is a registered trademark of E-mu Systems, Inc.

Rendering audio to file 'out.wav'..

Jamess-MacBook-Pro:MIDIUtil-0.89 jcuff$ lame out.wav 
LAME 3.99.5 64bits (
Using polyphase lowpass filter, transition band: 16538 Hz - 17071 Hz
Encoding out.wav to out.mp3
Encoding as 44.1 kHz j-stereo MPEG-1 Layer III (11x) 128 kbps qval=3
    Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA 
 27776/27776 (100%)|    0:17/    0:17|    0:18/    0:18|   40.386x|    0:00 
   kbps        LR    MS  %     long switch short %
  128.0       50.4  49.6        99.6   0.3   0.2
Writing LAME Tag...done
ReplayGain: +10.1dB

Jamess-MacBook-Pro:MIDIUtil-0.89 jcuff$ open out.mp3 

And finally, here's the code!
Jamess-MacBook-Pro:MIDIUtil-0.89 jcuff$ cat ~/Downloads/ 

# wget
# unzip 
# cd MIDIUtil-0.89
# python sudo ./ install

# Fluidsynth has better noises :
# brew install --with-libsndfile fluidsynth
# wget
# unzip this into a directory

# python -f sacct.dat -s 17 -e 25  (Writes into output.mid)

# fluidsynth -F output.wav -i ./GeneralUser\ GS\ 1.44\ FluidSynth/GeneralUser\ GS\ FluidSynth\ v1.44.sf2 output.mid

from argparse  import ArgumentParser
from random    import randint

import math
import re

from midiutil.MidiFile import MIDIFile

parser        = ArgumentParser(description = 'Convert sacct data to midi')

parser.add_argument('-f','--file'     ,      help="The sacct data file")
parser.add_argument('-s','--programstart'  , help="The instrument program start")
parser.add_argument('-e','--programend'  ,   help="The instrument program end")
parser.add_argument('-b','--bpm'  ,          help="Beats per minute")

args = parser.parse_args()

programstart = 1 
programend   = 1
bpm          = 120

if args.programstart is not None:
   programstart = int(args.programstart)

if args.programend is not None:
   programend= int(args.programend)

if args.bpm is not None:
   bpm = int(args.bpm)

fh = open(args.file)

MyMIDI = MIDIFile(1)

track = 0
time  = 0

MyMIDI.addTrackName(track,time,"Sample Track ")
MyMIDI.addTempo(track,time, bpm)

lnum = 0 
daysecs = 24*60*60

maxtime  = 0

for line in fh:
   lnum = lnum + 1

   if lnum == 1:

   line    = line.rstrip('\n')
   ff      = line.split('|')

   cores   = int(ff[21])
   elapsed = ff[22]
   status  = ff[23]

   if lnum%10 != 0:

   if status != "COMPLETED":

   tt = elapsed.split(':')
   channel  = 0

   duration = cores+1
   volume   = 90

   if len(tt) == 3 and elapsed > 0 and cores != 0 and '-' not in elapsed:

     program = randint(programstart,programend)
     secs    = int(tt[0])*60*60 + int(tt[1])*60 + int(tt[2])
     newsecs = 10 + int(secs*(127-10)/daysecs)
     #time    = secs*120.0/float(daysecs)
     #pitch   = cores
     pitch    = newsecs 

     MyMIDI.addProgramChange(track,channel, time, program)

     print tt[0],tt[1],tt[2],secs,newsecs,daysecs,program,time


     tmppitch = pitch + 12
     if tmppitch > 127:
       tmppitch = 127


     if time > maxtime:
        maxtime = time

     tmp = float((math.sqrt(cores))/2.0)

     if tmp > 10.0:
       tmp = 10.0
     time = float(time) + tmp

print maxtime

i =  0


while i < maxtime:
   i = i + 1

# And write it to disk.
binfile = open("output.mid", 'wb')

Thursday, April 16, 2015

Of huge pages and huge performance hits, are we alone?

We do a fair amount of sequence analysis here. One thing we do a lot of is trimming sequence data. The files are somewhat large. I'm not allowed to call this "big data" :-) There's a neat trimming code called "trimmomatic" (awesome name eh?). It's a simple enough piece of java, but interacts poorly with our machines, and it turns out it as a code is not alone.

We have a huge page table issue.

A very big one.

So turns out, khugepaged manages mapping pages in memory, and when you have large codes such as this one that pull a pair of 7GB data files together modify them and then try and get them out to disk as fast as you can you can see it will stress any machine. We used local storage for this and kept it simple.
[jcuff@regal01 dist]# time java -jar ./jar/trimmomatic-0.33.jar PE -phred33 ./FR4_P_pilosa_CTTGTA.R1.fastq.gz ./FR4_P_pilosa_CTTGTA.R2.fastq.gz outr1.dat outr2.dat outr1un.dat outr2un.dat ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

TrimmomaticPE: Started with arguments: -phred33 ./FR4_P_pilosa_CTTGTA.R1.fastq.gz ./FR4_P_pilosa_CTTGTA.R2.fastq.gz outr1.dat outr2.dat outr1un.dat outr2un.dat ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Multiple cores found: Using 16 threads

ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences

Input Read Pairs: 25776503 Both Surviving: 23790480 (92.30%) Forward Only Surviving: 1901563 (7.38%) Reverse Only Surviving: 49159 (0.19%) Dropped: 35301 (0.14%)

TrimmomaticPE: Completed successfully

So how did we do?

real 57m47.317s
user 44m24.784s
sys 435m41.152s

Yeah that's pretty slow.

While we were running we saw khugepaged @ 100% in top and then in "perf":

25022 root      39  19    0    0    0 R 100.0  0.0  6:23.13 khugepaged

[root@regal01 dist]# perf top

Samples: 191K of event 'cycles', Event count (approx.): 104146992710
 75.44%  [kernel]             [k] _spin_lock_irqsave
  4.21%  [kernel]             [k] _spin_lock_irq
  1.02%              [.] logicalSubscript
  0.83%  [kernel]             [k] ____pagevec_lru_add

Never good to be in _spin_lock_irq

Now so let's take out our THP (transparent huge page)

[root@regal01 dist]# echo never > /sys/kernel/mm/transparent_hugepage/enabled

How did we do?
[jcuff@regal01 dist]# time java -jar ./jar/trimmomatic-0.33.jar PE -phred33 ./FR4_P_pilosa_CTTGTA.R1.fastq.gz ./FR4_P_pilosa_CTTGTA.R2.fastq.gz outr1.dat outr2.dat outr1un.dat outr2un.dat ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

TrimmomaticPE: Started with arguments: -phred33 ./FR4_P_pilosa_CTTGTA.R1.fastq.gz ./FR4_P_pilosa_CTTGTA.R2.fastq.gz outr1.dat outr2.dat outr1un.dat outr2un.dat ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Multiple cores found: Using 16 threads

ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences

Input Read Pairs: 25776503 Both Surviving: 23790480 (92.30%) Forward Only Surviving: 1901563 (7.38%) Reverse Only Surviving: 49159 (0.19%) Dropped: 35301 (0.14%)

TrimmomaticPE: Completed successfully

Drum roll please!
real 3m23.022s
user 6m22.666s
sys 0m38.528s


That's a pretty big deal. I'll always take a 20x speed up if I can get it.

Turns out we are not alone:

There's an ongoing discussion here about the current state of the art.

Mean time we are going to disable our THP.

Have others seen this? Happy to see the comments.

Thursday, February 5, 2015

Please, purchase my storage solution....

CUE: Story opens, In a small office at a research computing department:

Endearing Storage Vendor: ".... so, now you have seen our technology, you will want to purchase our one of a kind "storage solution" you will be inordinately happy and immediately absolved of any and all future storage issues... forever. We guarantee it! We would truly love to partner with you, we have a unique, one of a kind system. Once we install your system, you can basically take two weeks off, but also in the meantime we will arrange to get your hair to grow back, and I will buy you many beautiful steak dinners... you are feeling very sleepy... but very satisfied with your decision to partner with us, it is a one of a kind product, did I tell you that our CTO invented....."

(beautiful harp music plays in the background)

Research Computing Director [Dreaming] : Oh wow, this stuff sounds absolutely fantastic, I bet I could finally sleep at night, the milk would never spill or go sour anymore. Life would finally be full of unicorns and rainbows! I so much want to live in this fantastic land of flawless storage, unlimited capacity, endless feature sets, complete 100.1% reliability and uptime, oh it's going to be so utterly awesome. In this world storage never, ever goes bad. Hold that thought I NEED to live in this world!! I MUST buy this storage array... I have to raise a PO!....

Endearing Storage Vendor: When I click my fingers you will awake, refreshed and ready to place your purchase order... 3..... 2...... 1.....

[Click] (director wide awake)

Research Computing Director [Sweating]: Whoah! What! Hang on! Wait Nooooo!!

CUE: Fade to black...

So, all joking aside I've been doing this job, and jobs much like it for years. I actually do know the exact storage system it is that exists in this dream from our little story above. And, well so given we are all friends here I'll take a moment to share the answer with you, let's keep it our little secret though, we should not let this trade secret get out.

Ok, so are you ready? Ok, so it's this one:

Not a single one of them!

Yep - you heard right folks, not a single one. I know I'm like a total heretic right?

You've probably all heard the endearing storage vendor promises... I have them in the archive, somewhere, let me go dig them out for you...

CUE: The clip of "The top 50 Most Endearing Storage Vendor Quotes":

"The competition are light years behind our technology! They are slower, more expensive, and totally unreliable, I mean they basically have NO clue! Our CTO literally invented the binary system!"

"That custom Linux kernel you hand rolled may be clever, but it does not scale. Our custom fork of Plan9 we use to power our ARM powered ASICS - it's quite literally lightyears ahead of the competition"

"Here, how about this... you can try our storage for no cost. I'll ask my manager so you can have a little bit for free - don't worry, we can talk price after your first petabyte migration"

"We vet every single patch upgrade before we release to our customers - rolling upgrades from any point release result in zero downtime"

"This storage will basically NEVER fail - people like Harva... oops sorry I can't disclose our clients, but they think it's totally wonderful, I can set up a call with Dr. X, he will totally vouch for how awesome we are."

"Let's not talk price just yet, let me show you how we use quantum laser effects to increase our redundancy and reliability"

"The next version has a completely redesigned API and REST interface, oh and it will be a seamless data in place update - don't worry"

"I want to take a little time to explain to you about our differential value"

"Let me take a moment to explain how we use a stronger steel frame for our cabinets, it is a key differentiator"

"The drives have a perpetual motion device as bearings, you can basically think of them as "physical flash drives"

"We run one of the two top advanced storage manufacturing plants located south of Basildon"

"Our disk magnets are sourced from an ancient salt mine just south of Las Vegas"

"We are in one of the top one worldwide soda manufacturers, we would tell you but we keep our clients confidential"

"We are unique in the market place. Our product is one of a kind. You need to understand our differential value. Let me set up a call with our CTO, so he can explain how this works at a deep technical level. Did we tell you our CTO invented the binary system?"

"You guys shouldn't waste your time building your own storage. We have an end to end solution for you."

"Putting all your storage under our single name space with our amazing technology will just make everything easier."

"Did I tell you already that our CTO invented the binary system?"

"Would you mind if I called some of your Faculty directly so I can show them our value? I don't want to go over your head or anything, but I really need to show them the value of our system, so they can see why you should buy this system."

"... and this was when our founders invented magnetism"

"Great question! Cluster quorum is maintained by a remote software as a service cloud"

"Our storage array was certified by the TSA, and is in use at 5 of the national airlines that fly out of Canada, we could tell you be we want to keep our clients confidential"

"Through our technology we have effectively achieved 200 nines of reliability, and 800 days of uptime a year"

"We have essentially redesigned how RAID works, let us show you the following algebra..."

"It is essentially a software defined storage stack written into a dedicated FPGA so it's very flexible..."

"You basically don't need backups any more!"

"Great question! I'll circle back with engineering and get right back to you - Steve be sure to take a note on that - great question!"

"I'll skip over these marketing slides so we can do a deep dive on our technology... oh just one thing while we are here, we do as you can see from this slide sell to all of your competitors, but anyway, let's get to the technology, oh and this customer here purchased 500 petabytes, ok moving on..."

"We call this feature RAID ONE MILLION. Yeah I know right? It really is literally that good."

"Cache coherence is on our roadmap"

"Hey let's get a round table with your engineering team. I'll bring our top people in so we can show your team our differential value, once your engineers see this they will be ready to convince you to purchase this storage."

"Great question! File locking is absolutely due for the next release"

Oh and the best ever...?

"This product literally pays for itself!"

So... I dunno about you, but unless this disk array prints twenty freaking dollar bills, that thing ain't paying for anything, least of all itself!

So as I said, it's been my day job to be "sold" to for a number of years now. I've quite possibly heard them all. They also say the easiest thing in the world is to sell a sales man, and I've been told that I'm a bit of a sales man, or at least I've been seen to play one on the T.V...

Even so...



p.s. I shall never, ever disclose my sources of "ESV" tee hee :-)

Wednesday, December 17, 2014

Of big microscopes and even bigger data...

We recently installed one of these awesome electron microscopes... In the center I help PI, we are imaging brains, but more about that another time. Right now this is all about getting this thing running, and running at speed, and some lovely UNIX geekery... I don't get anywhere near enough time these days to get my paws on a CLI, but I needed to stick my nose into this one!

It's extremely cool looking eh? However, it needs a fair amount of horsepower to just even "catch" the data that streams off it.  It is also a scientific instrument so of course the file system obviously ends up being more than just a little bit hairy.  For reference, here is the output of a single "run":
[root@storage]# du -sh .
6.6T .

[root@storage]# ls -RU | wc -l

[root@storage]# find -P . -type f | rev | cut -d/ -f2- | rev |  cut -d/ -f1-2 | cut -d/ -f2- | sort | uniq -c
  65226 001
  62994 002
  67458 003
  67954 004
  65226 005
  62994 006
  67458 007
  67954 008
  65226 009
  62994 010
  67458 011
  67954 012
  65226 013
  62994 014
  67458 015
  67954 016
  65226 017
  62994 018
  67458 019
  67954 020
  65226 021
   8559 022

So 1/2 million files in 6.6T with ca. 65,000 per dir and each image is about 612K...

Please stop me if you have heard *any* of this before :-)

Hehehehe :-)

Anyway, well, so our first task was to catch this stuff.  It flies in from the instrument at a rate of about 3TB an hour, out of eight distinct and separate windows acquisition servers writing out directly to a CIFS mount -- yeah I know, hashtag awesome right? More on SAMBA tuning at scale in another post...

So we benched our storage, a MD3260 with a couple of MD3260e expansion bricks making for a nice 0.6PB single image file system made out of 180 spindles tucked behind an R720.

Nothing too exotic, and at 3TB/hr design spec we need only a dedicated 10G, so we double bagged it, and popped a pair of 10 gee bee cards in the box, span up some LACP and so we were off to the races!

Until we weren't... Do you see the problem here:
41252 be/4 root 0.00 B/s  247.65 M/s  0.00 % 78.54 % dd if=/dev/zero of=test.dat
bs=1024k count=1000000000

Yeah, so that's 250MB/s peak, on the box, with no network in the way, direct to disk with caches working - which is about 0.9TB/hour...  Oh and this was also at about the same time the imaging center director called our group telling us that the microscope was broken, and that he thinks our network is very broken... Yep - a bad day in paradise this sure was turning out to be... and I couldn't even blame it on the network this time! :-)

So we are really not doing so well here. I poked about inside some of our other boxes... we run loads of this stuff... At first I was seeing the same results on some, and on others we were just fine... until I stumbled across one that was pulling nearly 800MB/s... I looked just a little closer at the config for the one that was working as I thought it was...

The default shipping is 4K, which is no good for streaming writes!

Arrggh - the sort of things we used to worry about in the '90s was back and in full effect. Flipped the button, all better.  Still not quite seeing decent performance though, the design spec with this number of spindles and 4 x 6Gb/s SAS wires should peak at... urrm types into google...

So I bust out a copy of the awesome bwm-ng:

Doh! We are not striping, so only using one of the four available 6Gb/s SAS lines... LVM has two modes of operation, Linear and Striping... we were using the Linear one, which was no good... so let's go fix it!
[root@storage]# umount /fs

[root@storage]# lvremove /dev/store_md32xx_vg/store_md32xx_lv
Do you really want to remove active logical volume store_md32xx_lv? [y/n]: y
  Logical volume "store_md32xx_lv" successfully removed

[root@storage]# lvcreate --extents 100%FREE --stripes 10 --stripesize 256 --name store_md32xx_lv store_md32xx_vg
  Logical volume "store_md32xx_lv" created

[root@storage]# lvs --segments
  LV                  VG                  Attr       #Str Type    SSize  
  store_md32xx_lv     store_md32xx_vg     -wi-a-----   10 striped 545.73t
  lv_root             vg_root             -wi-ao----    1 linear    1.09t

[root@storage]# mkfs.xfs /dev/mapper/store_md32xx_vg-store_md32xx_lv 

[root@storage]# mount -a

[root@storage fs]# dd if=/dev/zero of=test.dat bs=1024k count=1000000

Yay! we are now cpu bound on this dd… :-)

17773 root      20   0  103m 2780 1656 R 100.0  0.0   8:26.73 dd 

And now we are striped - so much better, nice balance across luns!

Major kudos to Justin Weissig's great corner of the internet for helping out!

We also removed the cache mirror - this is a catch it as fast as you can system, we can take the risk of controller issues for this application (don't try this at home kids!)... we put it back in the end, but wanted to make sure it was not a bottle.

And here we are all finished running eight at a time and pushing 2.6GB/s (woot!):

At this rate we can also support two more microscopes all off the same kit... with each of them running at full tilt!

How's about them apples for some serious price/performance eh? :-)

Oh and one more thing...

Hashtag BIG DATA... :-)

Thursday, October 23, 2014

OdyBot And Pointy Haired Alerting! AKA: Grumpy old man shakes fist at web two dot oh!

Hi all,

It's been ages since I last posted. We have been super busy in the day job running our monster high performance computing infrastructure and keeping lots of petabytes spinning and many scientists and researchers happy. I wanted to quickly note that we recently had a new addition to our group, supporting the Odyssey cluster.

Let me introduce:


Behind the scenes there are all sorts of fun activities, like checking that our data centers are neat and tidy and doing lots of awesome science:

from Harvard FAS Research Computing on Vimeo.

and sometimes just chilling out riding a skateboard around the yard:

OdyBot Gets Schooled
from Harvard FAS Research Computing on Vimeo.

You can find out all about OdyBot over at

Meanwhile we have had a couple of integration issues back at the ranch making sure that our awesome RC support staff are able to answer questions as our community asks them, and I think my old school methods finally got the better of me yesterday...

So, we use two online web services Userlike and hipchat to provide our external "voice" for OdyBot and for our internal communications. We wanted a quick way to post an alert to our main chat room when the operator count became zero.

Simple eh?

Well, kinda... and given that I'm not a child of the web 2.0 world I went about it in true UNIX style. A python script, with a unix pipe to a perl script... I'm sure sometimes I just do this stuff to wind up my team. :-) Anyway, here's the hipchat part, based on the awesome script with the following 2 second changes to allow it to read from stdin, and quote out the <CR>'s for all that down stream JSON cleverness...
bash-3.2$ diff 
>             "message=s"=> \$optionMessage,
< while (<>){
<   $optionMessage .= $_;
< }
< $optionMessage=~s/\n/\\n/g;

The nice thing about this is you can quickly post random stdin stuffs to you chat rooms:

Then I busted out some extremely suspect python:
bash-3.2$ cat 
import httplib2
import os
import json

API_HOST = os.environ.get('API_HOST') or ''
c = 0 
h = httplib2.Http()

resp, content = h.request(
        'Authorization': API_TOKEN
if resp.status == 200:
    data = json.loads(content)

    print 'Error status=%s' % resp.status

for x in data:
    if (x['slots']['online'] == 1):
      c = c + 1

cc = 1;

if (c>0):
    print "OdyBot Operators Online..."
    for x in data:
      if (x['slots']['online'] == 1):
        print ("%d) %s" % (cc, x['name']))
        cc = cc + 1

    print "Warning!  No OdyBot Ops Are Online!!"

Which works by just querying the Userlike API:
bash-3.2$ ./ 
OdyBot Operators Online...
1) James Cuff
2) Bob Freeman
3) John Noss
4) Dan Caunt

And there you have it - simple alerting to the main chat room when you need it to tell you that folks are not on the wire taking our OdyBot support requests! This is actually pretty important, we absolutely don't want to have our community waiting, and we have invested a lot of time and effort into the OdyBot concept so that our community can contact us. We also have an open OdyBot community list that folks inside and outside of Harvard can use to post questions, tips and techniques, although it's only just starting to ramp up right now.

Question is:

Could you do this integration in an even more ghetto/bandaid/bailing wire fashion :-)

p.s. we are also testing SLACK as a replacement for hipchat as I type. For pretty much the exact same reasons we had to replace Zopim with Userlike yesterday - as you can see, even with shoddy perl and python scripts, it is all about the integration these days!

No matter how ghetto the methods ;-) #allhailtheunixpipe

Friday, July 18, 2014

Of style and science...

There are times in your career that you really, really remember.

This was one of those times.

My then head of department, the dearly departed and most wonderful Professor Dame Louise Johnson wrote this note to my D. Phil. supervisor Geoff. back in 1997.  Geoff. recently sent me a copy while clearing out space to move into his fabulous new building over in Dundee.

To this day, I love that Louise who was an absolute scientific powerhouse said of my research:

"we thought the science was fine"

Although more importantly, her feedback about their concern for my writing style was what has really stuck with me over the years!

Nowt's changed much for me, be at rest Louise

Tuesday, June 24, 2014

Ohai Linux! So you are a network switch now...

Decided to see what the fuss was all about surrounding these open source switches. Plus the rocket powered turtle really did peak my interest ;-)

[ and ]

I built all of this on a CentOS release 6.5 (Final), and I wanted to build everything from source to really see how ONIE worked from the ground up. Don't try this at home kids, there is no reason to try and damage yourself.
git clone

Needed to add some deps, this was a little painful to find what was missing, much make fail, make, fail repeat, but this should be enough for most folks to run so you don't have to go through the iterations I did - this is a monster build. I learned a lot here, never having used "realpath" for example, or any of the syslinux kit which is fab!
sudo yum install realpath
sudo yum install gperf
sudo yum install stgit
sudo yum install texinfo
sudo yum install glibc-static
sudo yum install libexpat-devel
sudo yum install python-devel
sudo yum install fakeroot
sudo yum install syslinux syslinux-devel syslinux-extlinux syslinux-perl
sudo ln -s /usr/share/syslinux /usr/lib/syslinux

Oh and get a fresh autoconf if you are on CentOS 6.5
tar zxvf autoconf-latest.tar.gz
cd autoconf-2.69/
sudo make install

And away we go!
[jcuff@jcair-vm build-config]$ make -j4 MACHINE=kvm_x86_64 all recovery-iso

mkdir: created directory `/home/jcuff/onie/build'
mkdir: created directory `/home/jcuff/onie/build/images'
mkdir: created directory `/home/jcuff/onie/build/download'
mkdir: created directory `/home/jcuff/onie/build/kvm_x86_64-r0'
mkdir: created directory `/home/jcuff/onie/build/kvm_x86_64-r0/stamp'
mkdir: created directory `/home/jcuff/onie/build/kvm_x86_64-r0/initramfs'
==== Getting Linux ====
2014-06-11 14:50:54 URL: [65143140/65143140] -> "/home/jcuff/onie/build/download/linux-3.2.35.tar.xz" [1]
linux-3.2.35.tar.xz: OK

wheee! (get a large beverage, this bit takes a while!
[jcuff@jcair-vm build-config]$ ls -ltra ../build/images/
total 34212
drwxrwxr-x. 7 jcuff jcuff     4096 Jun 11 17:24 ..
-rw-rw-r--. 1 jcuff jcuff  3301792 Jun 11 18:29 kvm_x86_64-r0.vmlinuz
-rw-rw-r--. 1 jcuff jcuff  5284988 Jun 13 11:23 kvm_x86_64-r0.initrd
-rw-rw-r--. 1 jcuff jcuff  8603253 Jun 13 11:23 onie-updater-x86_64-kvm_x86_64-r0
drwxrwxr-x. 2 jcuff jcuff     4096 Jun 13 11:29 .
-rw-rw-r--. 1 jcuff jcuff 17825792 Jun 13 11:30 onie-recovery-x86_64-kvm_x86_64-r0.iso

Make a disk:
[root@jcair-vm onie]# dd if=/dev/zero of=/tmp/onie-x86-demo.img bs=1M count=256
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 0.272711 s, 984 MB/s

Spin up the kvm!
[root@jcair-vm onie]# sudo /usr/libexec/qemu-kvm -m 1024 -name onie -boot order=cd,once=d -cdrom /tmp/onie.iso -net nic,model=e1000 -vnc -vga std -drive file=/tmp/onie-x86-demo.img,media=disk,if=virtio,index=0 -serial telnet:localhost:9000,server

And you are golden!
ONIE: Starting ONIE Service Discovery
Info: Found static url: file:///lib/onie/onie-updater
ONIE: Executing installer: file:///lib/onie/onie-updater
Verifying image checksum ... OK.
Preparing image archive ... OK.
ONIE: Version       : master-201406241118-dirty
ONIE: Architecture  : x86_64
ONIE: Machine       : kvm_x86_64
ONIE: Machine Rev   : 0
ONIE: Config Version: 1
Installing ONIE on: /dev/vda
Pre installation hook
Post installation hook

Remove the CD from your config and you can now boot the live version, and if everything has worked out, the discovery process will work and you can now ping the UK from the USA...
ONIE: Rescue Mode ...
Version   : master-201406241118-dirty
Build Date: 2014-06-24T11:40-0400
Info: Mounting kernel filesystems... done.
Info: Mounting LABEL=ONIE-BOOT on /mnt/onie-boot ...
Running demonstration platform init pre_arch routines...
Running demonstration platform init post_arch routines...
Info: Using eth0 MAC address: 52:54:00:2b:63:f6
Info: eth0:  Checking link... up.
Info: Trying DHCPv4 on interface: eth0
ONIE: Using DHCPv4 addr: eth0: /
Starting: dropbear ssh daemon... done.
Starting: telnetd... done.
discover: Rescue mode detected.  Installer disabled.

Please press Enter to activate this console. 

ONIE:/ # onie-sysinfo -a
VM-1234567890 52:54:00:2b:63:f6 master-201406241118-dirty 42623 kvm_x86_64 0 x86_64-kvm_x86_64-r0 x86_64 1 gpt 2014-06-24T11:40-0400

ONIE:/ # ping
PING ( 56 data bytes
64 bytes from seq=0 ttl=61 time=108.473 ms
64 bytes from seq=1 ttl=61 time=103.824 ms
64 bytes from seq=2 ttl=61 time=103.238 ms
--- ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 103.238/105.178/108.473 ms

p.s. for extra twisted points this is ONIE running on linux KVM, inside virtualbox, on a mac on a pair of different layer three networks... it becomes a little confusing to run commands, but always makes me chuckle that a mac laptop is basically a little data center at this point :-)
jcair:~ jcuff$ uname -v

Darwin Kernel Version 13.2.0: Thu Apr 17 23:03:13 PDT 2014; root:xnu-2422.100.13~1/RELEASE_X86_64

jcair:~ jcuff$ ssh -p 2222 root@ ssh uname -a

Linux onie 3.2.35-onie+ #1 SMP Tue Jun 24 11:30:01 EDT 2014 x86_64 GNU/Linux


Monday, May 5, 2014

compressing DRAM with ZRAM for fun and profit?


Can you use compressed DRAM for science if you don't quite have enough memory?


I'm going to file this under "Great idea, but my execution is slightly suspect"

Anyway, here's an example set up of compressed swap files:
[root@jcair-vm ~]# modprobe zram

[root@jcair-vm ~]# mkswap /dev/zram0
Setting up swapspace version 1, size = 104860756 KiB
no label, UUID=58476253-ad5a-4595-9bec-60bd09d76d30

[root@jcair-vm ~]# mkswap /dev/zram1
Setting up swapspace version 1, size = 104860756 KiB
no label, UUID=ed5d0f85-0245-472e-902e-0e94a743cbe0

[root@jcair-vm ~]# swapon -p5 /dev/zram0 
[root@jcair-vm ~]# swapon -p5 /dev/zram1

[root@jcair-vm ~]# swapon -s
Filename                                Type            Size    Used    Priority
/dev/zram0                              partition       104860752       0       5
/dev/zram1                              partition       104860752       0       5

Clearly without the zram setup above, stress fails right out of the gate:
[root@jcair-vm ~]# stress --vm-bytes 2344600024 -m 2 --vm-keep
stress: info: [6063] dispatching hogs: 0 cpu, 0 io, 2 vm, 0 hdd
stress: FAIL: [6063] (415) -- worker 6065 got signal 9
stress: WARN: [6063] (417) now reaping child worker processes
stress: FAIL: [6063] (451) failed run completed in 10s

But, running a stress test with a memory allocation much bigger than the host seems to work just fine and dandy once we have our zram swap files like those noted above:
[root@jcair-vm ~]# stress --vm-bytes 2344600024 -m 2 --vm-keep
stress: info: [5383] dispatching hogs: 0 cpu, 0 io, 2 vm, 0 hdd

top - 10:41:17 up 13 days, 22:03,  4 users,  load average: 2.91, 0.87, 0.29
Tasks: 192 total,   4 running, 188 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us, 74.8%sy,  0.0%ni, 11.5%id,  0.0%wa,  0.1%hi, 13.4%si,  0.0%st
Mem:   3923468k total,  3840852k used,    82616k free,     5368k buffers
Swap: 209721504k total,   626964k used, 209094540k free,    36932k cached

 5385 root      20   0 2242m 1.2g  124 R 96.4 31.3   0:48.12 stress
 5384 root      20   0 2242m 1.2g  124 R 84.0 32.0   0:48.12 stress

Yay! So - this looks like it could work!

And so here we go with a genome aligner to see if this works. This will be a good test as it writes real data structures into memory, stress was doing a block fill. So first up let's try w/o enough ram:
[root@jcair-vm ~]# cat 
./bowtie2/bowtie2 -x ./hg19 -p 4  <( zcat Sample_cd1m_3rdrun_1_ATCACG.R1.fastq.gz)

[root@jcair-vm ~]# ./ 
Out of memory allocating the ebwt[] array for the Bowtie index.  Please try
again on a computer with more memory.

Error: Encountered internal Bowtie 2 exception (#1)

Command: /root/bowtie2/bowtie2-align-s --wrapper basic-0 -x ./hg19 -p 4 /dev/fd/63 
(ERR): bowtie2-align exited with value 1

Ok, fair enough, so we have a reproducer.

Let's now set up a run with the right amount of physical ram:
[root@jcair-vm ~]# ./bowtie2/bowtie2 -x ./hg19 -p 4 <(cat cuff.fastq) -S out.dat &

7467 root 20 0 3606m 3.3g 1848 S 389.3 58.3 51:37.25 bowtie2-align-s

And we have a result!
[root@jcair-vm ~]# time ./bowtie2/bowtie2 -x ./hg19 -p 4 <(cat cuff.fastq)  -S out.dat 
13558597 reads; of these:
  13558597 (100.00%) were unpaired; of these:
  11697457 (86.27%) aligned 0 times
    545196 (4.02%) aligned exactly 1 time
   1315944 (9.71%) aligned >1 times
13.73% overall alignment rate

Ok, so let's shrink the memory of the machine and see if we can run with zram.

Let's also set the same priority and do a round robin between physical swap and zram so each can write/read a block should be nice balanced I/O. The stress worked, so our theory is that data and in memory structures could compress and we should be able to get at least a 1:2 or 1:1.5 ratio out of the memory, I settled on a 3G machine with a 3G compression and some physical swap also:
[jcuff@jcair-vm ~]$ swapon -s
Filename                                Type            Size    Used    Priority
/dev/zram0                              partition       2947008 614124  1
/dev/dm-1                               partition       4063224 614176  1

When running it did result in *much* smaller RES (982m vs 3.3G from native example):

 2350 root      20   0 3606m 982m 1020 S 20.8 33.3  12:26.74 bowtie2-align-s

Things chugged along, but I was not seeing this ending any time soon so I truncated the read file dramatically to ca. 5k reads to see if I could get a quick comparison between, zram hybrid sram and swap, and plain old boring old swap files.

As you can see below, only "boring old swap" resulted in anything sensible. The zram alone caused some rather spectacular OOM errors and obvious system instability, it was kinda fun though. You can also see below various versions we tried out, none of which actually worked, but we are also not totally alone here either.

Oh and: "Just the right amount of memory" - like Goldilocks, that one worked ;-)
Machine with memory too small:          (ERR): bowtie2-align exited

3G zram:                                sshd invoked oom-killer: gfp_mask=0x200da

Hybrid 3G zram + 4G physical swap:      6m 25.285s

Hybrid 500MB zram + 4G physical swap:   1m 51.029s

Regular /dev/dm-1 swap file:            0m 29.741s

Machine with enough ram:                0m 12.698s

In summary... NO PROFIT this time :-(

Still a neat idea - just don't try this at home kids!

[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff