Thursday, June 30, 2011

576 terabytes? Oh yes sir, that will do nicely!

We needed a moderately accessible cost effective nearline storage solution. It arrived yesterday (our account team busted their tail to get this before our year end!). This morning we had a little "rack and stack party", things are looking pretty fabulous. We have spoken in the past about our slightly "ghetto" storage arrays:

http://blog.jcuff.net/2011/05/big-fat-storage-in-extreme-hurry.html

http://blog.jcuff.net/2011/03/diy-tb-and-rc-playing-catch-up-quick.html

But in Research Computing sometimes we need a slightly more supportable option. Costs a bit more $ but does not require rubber bands or wiggling of sata cables ;-)

So first off we started with some 6GB/s SAS connections:


Wednesday, June 29, 2011

getting yer old school on with tru64 via es40 emulation

Emulators are fun! Especially when I get to go back to my early days of computing. Back in the day I used to run a 400cpu alpha cluster over at the Sanger. It was all Tru64, all the time. We started out with 4.0f and ended when alpha ended at 5.1b - fun times indeed. Here's a little picture of the tru64 DS10 pie box cluster as it was being built at that time in a machine room in the EBI.



Wednesday, June 22, 2011

function over beauty: of busted up feet and awesome science!

This time last year, my director of scientific computing sent around this link to the team with a statement that defined this picture as being the epitome of research computing.

I agreed!


tokyo dr grape machine



Monday, June 20, 2011

2008-2011 RIP our first "cluster"

Our first cluster debuted on the June 2008 list @ number 61. It was our first tightly coupled system with 4096 cores DDR IB, 16TB of DRAM and used about 200KW of power. We lasted exactly 3 years on the challenge list with our 32TF Rmax, 38TF peak sustaining 85% efficiency machine:

http://www.top500.org/site/history/2951

Month               Position
------------------------------
June       2008          61
November   2008          81
June       2009         136
November   2009         184
June       2010         264
November   2010         433
June       2011     Poof! Gone!
------------------------------

We have grown significantly since and have ca. 13,000 cores on line, but it sure is sad to see how fast these systems conk out, with millions of $ being spent on them. It lasted just about as long as the warranty on the box did. 32TF is a bit pathetic these days, the K machine has just come in at slightly over 8,200TF, a whopping 256 times faster than our little box!

Rest in peace dearest little Dell box, you did great science!

Thursday, June 16, 2011

research computing phrase book, redefining "fail" and explaining "win"

Michele and I have been in the US for about 8 or so years now, and one thing that never ceases to make us laugh is my apparent lack of concern for dropping euphemisms into conversations and seeing folks reactions.

Such events often happen to me either in large 20+ person meetings, irc/xmpp chat rooms but more often than not happen inside emails to our team.

So, I thought I'd put up some of my more "confusing" expressions up here on the interwebs.

If nothing else just for fun.


Tuesday, June 7, 2011

xfs_repair testing: double disk raid5 failure... yeah sure no problem... ;-)

So I was chatting with Jeff Layton the other day about xfs_repair times for "big" file systems. All based around the following little twitter thread:

https://twitter.com/#!/jamesdotcuff/status/76694427938201600

So I took one of our large non production 100+TB filesystems and threw some files down, with the idea of crashing the file system and see how we got on with the repair. Basically we wanted to prove out how number/type of file affects rebuild/repair times on a large file system.

Sounds simple enough right?



[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff