Tuesday, June 7, 2011

xfs_repair testing: double disk raid5 failure... yeah sure no problem... ;-)

So I was chatting with Jeff Layton the other day about xfs_repair times for "big" file systems. All based around the following little twitter thread:

https://twitter.com/#!/jamesdotcuff/status/76694427938201600

So I took one of our large non production 100+TB filesystems and threw some files down, with the idea of crashing the file system and see how we got on with the repair. Basically we wanted to prove out how number/type of file affects rebuild/repair times on a large file system.

Sounds simple enough right?



Well what transpired was a rather fascinating discovery process, that in the end resulted in no filesystem errors and a happy xfs_repair. Future blog posts will just use a reset /SYS from the obm rather than my rather crazy: "let us fail two drives at once approach"!

If folks do happen to know of awesome quick ways to crash a file system feel free to let us know in the comments. I'm sure there is some 1337-fu that xfs_dbg could do to basically mark a filesystem as dirty w/o me having to go about this shenanigans below :-)

First let us make a whole bunch of files, a million ought do this for a starter! The intent was to increase these and graph out number of files vs repair time - not exactly rocket science this here little project ;-)):
[root@solexa02 4]# cat ../files.sh
#!/bin/bash
c=1
while [ $c -le 1000000 ]
do
dd if=/dev/zero of=$c.dd bs=1M count=10
(( c++ ))
done

now let's get rather mean and fail out a pair of drives:
[root@solexa02 full]# mdadm --manage --set-faulty /dev/md13 /dev/dsk/dsk44
mdadm: set /dev/dsk/dsk44 faulty in /dev/md13
[root@solexa02 full]# mdadm --manage --set-faulty /dev/md13 /dev/dsk/dsk36
mdadm: set /dev/dsk/dsk36 faulty in /dev/md13

Ok, so did we damage the file system? Oh yeah, we totally damaged it!:
Buffer I/O error on device dm-0, logical block 23623967467
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 23623967468
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 23623967469
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 23623967470
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 23623967471
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 23623967472
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 23623967473
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 23623967474
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 23623967475
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 23623967476
lost page write due to I/O error on dm-0
Device dm-0, XFS metadata write error block 0x2bfffffd41 in dm-0
Device dm-0, XFS metadata write error block 0x35fffffca1 in dm-0
[root@solexa02 full]# cp: reading `1/1361.dd': Input/output error

Hung up pretty nicely, unmount just hangs, system is a mess.

Time to attempt to clean up (I wanted to do this online so we could script future runs):
[root@solexa02 ~]# mdadm -r /dev/md13 /dev/dsk/dsk44
mdadm: hot removed /dev/dsk/dsk44
[root@solexa02 ~]# mdadm -r /dev/md13 /dev/dsk/dsk36
mdadm: hot removed /dev/dsk/dsk36

Put them back? Hrrm this not going at all well... :-)
[root@solexa02 ~]# mdadm /dev/md13 -a /dev/dsk/dsk44
mdadm: Cannot open /dev/dsk/dsk44: Device or resource busy
[root@solexa02 ~]# mdadm /dev/md13 -a /dev/dsk/dsk36
mdadm: Cannot open /dev/dsk/dsk36: Device or resource busy

Can we deactivate it?
[root@solexa02 ~]# vgchange -a n solexa02_vg
Can't deactivate volume group "solexa02_vg" with 1 open logical volume(s)

[root@solexa02 ~]# lvchange -a n solexa02_vg
LV solexa02_vg/solexa02_lv in use: not deactivating

Nope, going to have to force it (dmsetup is normally awesome at this stuff):
[root@solexa02 ~]# dmsetup info -c
Name                    Maj Min Stat Open Targ Event  UUID
solexa02_vg-solexa02_lv 253   0 L--w    1    4      0 LVM-KBIpmz0ZhpnFQMoCsACqnykHcEPjpufQ28en3nqKkTTOuxVg9h80SZTLvOwqvMv2

[root@solexa02 ~]# dmsetup remove -force solexa02_vg-solexa02_lv

[HANG]

(hang) ok sigh... try a reboot, nope, toast! Here we go off to the lights off manager then, it is a Sun x4500 series box so we can ssh to the console:
-bash-3.1$ ssh root@solexa02-obm
Password:

Sun(TM) Integrated Lights Out Manager

Version 2.0.2.5

Copyright 2007 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.

-> reset /SYS
Are you sure you want to reset /SYS (y/n)? y
Performing hard reset on /SYS

-> start /SP/console
Are you sure you want to start /SP/console (y/n)? y

Serial console started. To stop, type ESC (

ok back up. md still looks shabby for the array we busted up.
md13 : inactive sdu[0] sdas[1] sdak[2] sda[10] sdi[8] sdag[7] sdao[6] sdq[5] sde[4] sdm[3]
29302664320 blocks super non-persistent

time to get it back on line again...
[root@solexa02 ~]# mdadm --assemble /dev/md13
mdadm: /dev/md13 assembled from 8 drives - not enough to start the array.

Oh pants!

This is not cool - negative numbers are bad, even --assemble --force is not happy trying to put back our busted failed out drives:
mdadm: added /dev/dsk/dsk36 to /dev/md13 as -1
mdadm: added /dev/dsk/dsk44 to /dev/md13 as -1

Nothing else left to do other than fully recreate the array - this is bound to screw up the storage!:
[root@solexa02 ~]# mdadm --create /dev/md13 --level=5 --raid-devices=10 /dev/dsk/dsk20 /dev/dsk/dsk44 /dev/dsk/dsk36 /dev/dsk/dsk12 /dev/dsk/dsk4 /dev/dsk/dsk16 /dev/dsk/dsk40 /dev/dsk/dsk32 /dev/dsk/dsk8 /dev/dsk/dsk0
mdadm: Defaulting to version 1.-1 metadata
Continue creating array? y
mdadm: array /dev/md13 started.

Amazingly what follows shows the array to be fine, and not only that the storage works, and files are still there... xfs/lvm/mdadm==win!
[root@solexa02 ~]# pvscan
PV /dev/md10 VG solexa02_vg lvm2 [30.02 TB / 0 free]
PV /dev/md11 VG solexa02_vg lvm2 [30.02 TB / 0 free]
PV /dev/md12 VG solexa02_vg lvm2 [27.29 TB / 0 free]
PV /dev/md13 VG solexa02_vg lvm2 [24.56 TB / 0 free]
Total: 4 [1.89 TB] / in use: 4 [1.89 TB] / in no VG: 0 [0 ]
[root@solexa02 ~]#

[root@solexa02 ~]# lvscan
inactive '/dev/solexa02_vg/solexa02_lv' [111.89 TB] inherit

[root@solexa02 ~]# lvchange -a y solexa02_vg

[root@solexa02 ~]# lvscan
ACTIVE '/dev/solexa02_vg/solexa02_lv' [111.89 TB] inherit

Ok, so we are back... now to check the filesystem with a no modify just to see:
[root@solexa02 ~]# time xfs_repair -n /dev/mapper/solexa02_vg-solexa02_lv 
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

real    1m22.975s
user    0m34.778s
sys     0m20.444s

Well it all looks good to me! 1:22 for the "repair run".
[root@solexa02 ~]# mount /dev/mapper/solexa02_vg-solexa02_lv /mnt

(from /var/log/messages)

XFS mounting filesystem dm-0
Starting XFS recovery on filesystem: dm-0 (logdev: internal)
Ending XFS recovery on filesystem: dm-0 (logdev: internal)

[root@solexa02 ~]# ls -ltra /mnt
total 10240024
-rw-rw-r--  1 root    root   10485760000 May 24 14:13 test.bin
drwxrwxr-x  2 root    root         16384 Jun  2 21:29 bonnie
drwxr-xr-x  4 root    root            46 Jun  6 10:27 .
drwxrwxr-x  4 root    root            79 Jun  6 14:58 jcuff
drwxr-xr-x 26 root    root          4096 Jun  7 12:06 ..

Summary:
1. XFS/LVM/MD is AWESOME!  

2. Two disk raid failures are a fun way to crash a filesystem, 
   but *really* hard to put back

3. I still have not actually managed my automatic benchmark yet, 
   but this sure was fun!
Keep an eye out for part two...

Hopefully I can find a better "totally bust up the filesystem" method... :-)



[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff