Monday, April 16, 2012

Colorado ISTeC / CVMBS talk

Looking forward to talking at Colorado State on Weds!

Sunday, April 1, 2012

turn xfs_repair up to eleven with ag_stride...

If you are in a hurry to repair an XFS volume (and to be honest who ever isn't!), it turns out you can use this trick to run a parallel version of xfs_repair. It chunks up the allocation groups with a stride size parameter:
xfs_repair -o bhash=16384 -o ihash=16384 -o ag_stride=16 \
/dev/mapper/deep-deep_lv >& /tmp/repair.log

and away we go carving up the allocation groups into 16 membered chunks:
- agno = 0
- agno = 16
- agno = 32
- agno = 48
- agno = 64
- agno = 80
- agno = 96

and you get a spot of progress in the output:
- 12:44:35: process known inodes and inode discovery 
            13,811,776 of 151,658,048 inodes done

- 12:44:35: Phase 3: elapsed time 14 minutes, 31 seconds 
            processed 951,442 inodes per minute

- 12:44:35: Phase 3: 9% done - estimated remaining time 
            2 hours, 24 minutes, 52 seconds

This is really useful when you want to let folks know that progress is being made on any given repair. Our file system was extremely sick yesterday due to a massive and extremely rare hardware failure on the PCI riser:

So having this has been a huge help in guessing how long things are going to take on this part of the recovery - we are restoring from backup also, but I wanted to see if we could salvage the data on this 400T array while the guys worked like absolute trojans getting the backups and readonly copies of important files to our customers. Here's where we are at as I type this, looks like phase three recovery is on track at 2.5hrs and hopefully we don't see anything horrible in phases four through seven. Our error log alone currently stands at over 5.5GB...
[root@nssdeep /tmp]# du -sh repair.log 
5.4G repair.log  (eek!)

[root@nssdeep /tmp]# grep remaining repair.log
12:44:35: Phase 3: 9% done - estimated remaining time 2 hours, 24 minutes
12:59:35: Phase 3: 22% done - estimated remaining time 1 hour, 43 minutes
13:14:35: Phase 3: 35% done - estimated remaining time 1 hour, 22 minutes
13:29:35: Phase 3: 48% done - estimated remaining time 1 hour, 3 minutes
13:44:35: Phase 3: 60% done - estimated remaining time 47 minutes
13:59:35: Phase 3: 73% done - estimated remaining time 32 minutes
14:14:35: Phase 3: 86% done - estimated remaining time 16 minutes

However, it turned out that we also had a version of xfs_repair that was segfaulting like this version Tom Crane talked about on the XFS mailing list so we followed his lead and pulled in the git repo this morning to see if we could speed up this four day nightmare!
    47 11:21 git clone git://
    48 11:21 cd xfsprogs/
    49 11:22 yum install uuid-devel
    50 11:23 yum install libuuid-devel
    51 11:23 make
    52 11:25 make install

and then we were off to the races! The current git copy:
[root@nssdeep /tmp]# xfs_repair -V
xfs_repair version 3.1.8

has all you need to not segfault when attempting recovery of filesystems. Reduced what was looking like a four day recovery event into what looks like it will be 4 or so hours. You can see the system hitting disk with "iotop" (never leave home w/o this!):
Total DISK READ: 88.41 M/s | Total DISK WRITE: 2.79 M/s
16207 be/4 root     1995.30 K/s  109.76 K/s  0.00 % 99.99 % xfs_repair
16164 be/4 root        3.65 M/s   94.08 K/s  0.00 % 99.15 % xfs_repair
16194 be/4 root        3.77 M/s  156.80 K/s  0.00 % 99.11 % xfs_repair 
16187 be/4 root        2.89 M/s   54.88 K/s  0.00 % 98.41 % xfs_repair
16169 be/4 root        2.63 M/s  125.44 K/s  0.00 % 97.90 % xfs_repair 
16061 be/4 root        2.20 M/s  160.72 K/s  0.00 % 97.68 % xfs_repair
16082 be/4 root        2.87 M/s  101.92 K/s  0.00 % 97.10 % xfs_repair 
16104 be/4 root        3.78 M/s  152.88 K/s  0.00 % 96.90 % xfs_repair 
16093 be/4 root        4.11 M/s  172.48 K/s  0.00 % 96.86 % xfs_repair 
16074 be/4 root        4.05 M/s  117.60 K/s  0.00 % 96.84 % xfs_repair

We noticed without this version of xfs_repair, you are stuck on one thread, and hoping your file system will come back.

We absolutely owe Christoph Hellwig (, Tom Crane from Royal Holloway and the ever present Dave Chinner, a huge debt of thanks for pointing us in the right direction with their mailing list - not the first time their knowledge and skill has been a real bacon saver!

xfs minus fun and profit.

isrv:~ jcuff$ date
Sun Apr 1 02:51:38 EDT 2012

This unfortunately is not an April fool...

We lost a major file system today: reason a bad PCI riser card. (meh)

A huge team effort including our vendor and our group here in research computing trying to get humpty dumpty back together again. Unfortunately there is still yolk and egg shell all over the place. You know it is bad when your vendor is five flights up at the top of a rack replacing hardware for you at 12:00am:

Yet your file system still looks like this at 2:51am under xfs_repair after various attempts to convince it to play nicely, even though all the hardware finally has checked out.

Looks like it is going to be a very long night... or is it morning as I write this?

I've been at big U's for a while and I don't think many "directors" would ever do this. I on the other hand gave up worrying about this pseudo "staffing structure" a long, long time ago. When things go pop, no matter where you are in the pecking order, you roll up your sleeves, get at the console and help your boys and girls out as much as you possibly can. I'm writing this web log entry as I watch our shared console scroll past.

Both me and my team really, really care about the data, and even more so about the science. We had one lovely grad student who needed his data for a thesis defense on Monday morning, we have backups so I'm hoping we got the data to him ok.

Life in scientific computing is a lot like this at times, hardware will let you down, but your team, and your vendors never should, and in my experience both very rarely have.

[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff