Sunday, April 1, 2012

xfs minus fun and profit.

isrv:~ jcuff$ date
Sun Apr 1 02:51:38 EDT 2012

This unfortunately is not an April fool...

We lost a major file system today: reason a bad PCI riser card. (meh)

A huge team effort including our vendor and our group here in research computing trying to get humpty dumpty back together again. Unfortunately there is still yolk and egg shell all over the place. You know it is bad when your vendor is five flights up at the top of a rack replacing hardware for you at 12:00am:

Yet your file system still looks like this at 2:51am under xfs_repair after various attempts to convince it to play nicely, even though all the hardware finally has checked out.

Looks like it is going to be a very long night... or is it morning as I write this?

I've been at big U's for a while and I don't think many "directors" would ever do this. I on the other hand gave up worrying about this pseudo "staffing structure" a long, long time ago. When things go pop, no matter where you are in the pecking order, you roll up your sleeves, get at the console and help your boys and girls out as much as you possibly can. I'm writing this web log entry as I watch our shared console scroll past.

Both me and my team really, really care about the data, and even more so about the science. We had one lovely grad student who needed his data for a thesis defense on Monday morning, we have backups so I'm hoping we got the data to him ok.

Life in scientific computing is a lot like this at times, hardware will let you down, but your team, and your vendors never should, and in my experience both very rarely have.

[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff