Wednesday, October 19, 2011

try not to be so worried about xfs >100T

There has been a lot of chatter about >100TB file systems with XFS:

http://blog.jcuff.net/2011/06/xfsrepair-testing-double-disk-raid5.html

http://blog.jcuff.net/2011/05/big-fat-storage-in-extreme-hurry.html

http://scalability.org/?p=3192 (joe's write up is awesome!)

and the thing that started it:

http://www.enterprisestorageforum.com/storage-hardware/the-state-of-file-systems-technology-problem-statement.html

We had this one system start to show very odd behavior yesterday. We had multiple drive failures and some self inflicted injury that caused our inode table to get out of sync for some of our hardlinked directories. We use the awesomeness of rsnapshot to take snaps of some of our filesystems onto large 100T+ filesystems with xfs a bit like this one:

[root@emcbackup9 emcback9]# df -H .
Filesystem             Size   Used  Avail Use% Mounted on
/dev/mapper/emcbackup9_vg-emcbackup9_lv
                       124T    32T    92T  26% /mnt/emcback9

We thought we would have to get into it in a big way to repair, but a quick read of http://oss.sgi.com/projects/xfs/training/xfs_slides_11_repair.pdf and some experience had the file systems up and sorted in about 10 hours.

The filesystem was not all that poorly and stage 7 fixed the missing inode meta data swiftly, these were all hardlinks that had a simple reference count problem you can see below. The filesystem was showing an interesting feature though, "rm -rf directory" would give "directory not empty":

[root@emcbackup9 Lab]# rm -rf Landscape/
rm: cannot remove directory `Landscape/': Directory not empty

but was clearly empty:

[root@emcbackup9 Lab]# ls -ltra Landscape/
total 0
drwxrwxrwx 1 root root 6 Oct 18 16:02 .
drwxrwxrwx 3 root root 22 Oct 18 16:16 ..

You can clearly see the (1) in the link counter from stat,
something wrong here ;-)

[root@emcbackup9 Lab]# stat Landscape/
File: `Landscape/'
Size: 6 Blocks: 0 IO Block: 4096 directory
Device: fd00h/64768d Inode: 5577705 Links: 1
Access: (0777/drwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2011-10-18 16:17:02.511914105 -0400
Modify: 2011-10-18 16:02:55.424823531 -0400
Change: 2011-10-18 16:16:23.872022080 -0400

Interesting failure, not a show stopper just rather annoying when trying to keep the filesystem neat and tidy. The eventual magic we used was:
xfs_repair -P -o bhash=1024 /dev/mapper/emcbackup9_vg-emcbackup9_lv

Here is the phase 7 sweep:
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
resetting inode 3527544 nlinks from 2 to 23787
resetting inode 4215328 nlinks from 1 to 2
resetting inode 4215348 nlinks from 1 to 2
resetting inode 4215354 nlinks from 1 to 2
resetting inode 4586590 nlinks from 1 to 2
resetting inode 4649378 nlinks from 1 to 2

So albeit my tweet:

https://twitter.com/#!/jamesdotcuff/status/126400580972326912

looked pretty ominous, in the end it all was smiles and sunshine. It was particularly interesting because I was asked about this *exact issue* on twitter yesterday:

https://twitter.com/#!/jamesdotcuff/status/126351096938639360

This is a 16G memory box with 4,805,636,416 files in it but only 26% full, the xfs_repair bhash flag is your friend, it used 100% of the available memory.

Mind you as always your mileage can and will vary!



[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff