Tuesday, January 24, 2012

of search and discovery: when storage goes bump in the night

This posting will irritate the sad old "tl;dr" crowd no end, but I don't care about their pseudo self importance or how supposedly busy their vacuous unread lives are. Let me tell you about busy Mr. tl;dr... I'll tell you all about busy. Oh my, I can tell you Mr. Man over there about freaking busy, you have absolutely no idea how busy, mumble, rant, steam..!!.. do you have any, any idea what busy means.... *gasp* (medicate)

(apply one more dose just to be sure)

Phew! Safe place, safe place, safe place! Sorry about that, I seemed to go off on one a little bit there, unlike me that, must be too much stress...

Ok. So it all started with a simple help ticket. It always does right? "Can't access storage, file not found, etc. etc. etc." This time it was not a cautionary tale, it was a real issue, there were no files to be had.

Zip, nada, zilch.

After initial panic we go and bust out our all various sysadm kung-fu and attempt a fix. Oh hey "Mr. tl;dr", sorry! I totally forgot about you over there in the corner! My bad, did not see you were still reading. Ok, I'll tell you what, I'll give you the answer so you can get on with the rest of your day, I totally understand how busy you are and I also know that you are kind of a big deal ;-) here you go...
[root@nssdeep]# xfs_quota -x -c 'limit -p bsoft=10T bhard=10T ncf_mcl'

[root@nssdeep]# echo 2 > /proc/sys/vm/drop_caches

Ok cool, good to get that out of the way, would not want to waste your time.

Cheerio Mr. tl;dr, thanks ever so for visiting!

Rest of us still here? Ok, well I'll carry on then. See the whole thing about this is it's not so much the fix at the end. It is more about how we got there. You know like that old saying "Life is all about the journey and not the destination", blah, blah, blah. Well it turns out that our destination for this little weblog is also kinda off kilter also, but more about that later.

So how did we get here? After fielding the phone calls and snotty emails it started out with this little process on our storage array:
 5414 root      20   0     0    0    0 R 99.9  0.0   3719:21 xfssyncd/dm-3    

Seems innocuous enough right? Well, it turns out that once this little chap spins up there will be no more disk I/O for you, your client machines or the cluster, customers, scientists or anyone. That's generally also the time when the phone starts ringing again... Oh, and guess what, xfssyncd? Yeah, he is a kernel thread, so no strace for you Mr. SysAdm. Oh no, no, you have going to have to work this one out the hard way. So you bust out your tool kit of unix apps and start typing furiously. Here's a very edited highlight of how some of it went down:

(1) netstat -ao | grep nfs | wc -l 82 threads, nothing major move on

(2) df and df -i /mnt/filesystem 1.5 billion inodes, quite a lot, but nothing to major it's a 450TB array

(3) dstat, iostat -m, iostat -n, iftop -B -i eth4, jnettop, system tap assorted probes nothing major, iostat very calm, no IOPS, it looks like a regular day

(4) checked LSF, nothing silly going on here, like 10 or so jobs running

Ok, so this is now clearly very, very strange. Something, or someone is causing our little friend xfssyncd to get in a tizzy. And quite the tizzy by the looks of things. p.s. the phone is still ringing, emails are starting to become more shirty now - it's clear we have a problem. So we do what every self respecting admin would do...

constantly search google and type in random shit until something works

So we found some leads, wsync was in our fstab to provide HA service so that the file system could sync correctly. We pulled it. We also added a variety of probably "dangerous" flags to our config, resulting in this monster by the end of the game:
rw,nobarrier,uquota,prjquota,noatime,nodiratime,allocsize=1g,delaylog,logbufs=8,logbsize=256k,largeio,swalloc,inode64 1 2

Whoah, that's a whole lotta options! Including the most awesome "delaylog", we will talk about that later! So did any of them solve our issue? Nope, not a single one. But we are good systems admins, we search google and we type random shit into the kernel, just need to find some more random shit right?

Off we went. We now found some amazing flags to control xfssyncd... we found all sorts out there, some were decided to not be bugs, others were kinda real, but did not work for us, the lists went on and on (bit like this crappy blog post).

We also found epic perl scripts that could check xfs statistics, we found /proc interfaces that no one had ever heard of we even read the manual! In the end we settled for these options that basically "tune" the flush operation for xfs so it did not wake up too often.
echo 720000 > /proc/sys/fs/xfs/xfssyncd_centisecs
echo 3000 > /proc/sys/fs/xfs/xfsbufd_centisecs
echo 720000 > /proc/sys/fs/xfs/age_buffer_centisecs


There were high fives, dancing, and a fair amount of whoop, whooping! Well there was for about 720,000 centiseconds. Then our little friend "xfssyncd" came back again and our I/O stopped yet again. Then there was much sobbing, wailing and general consternation.

It was a grim time to be in the office I can tell you. So what did we do?

constantly searched google and typed in random shit until something worked

Turns out to not be a bad strategy. We found some great resources for hardware raid. Totally awesome, we had missed setting all of our adaptive read ahead buffers on the storage. Silly us, how could we have been so daft!
server:~# megacli -LDSetProp ADRA -LALL -aALL
So that fixed it right? Yeah no. Not at all. But! When the storage was working, we all convinced ourselves that it was running much more quickly than it was before, not actually ever confirmed or benchmarked this in anyway, but we would take anything we could get at this point.

Oh and we also updated every piece of firmware and bios code we could lay our hands on just incase - none of that worked either, but did make us all feel full of moral fibre! Hurray!

Then we decided to get really fancy. We took our 2.6.32 CENTOS 6 kernel, and hopped on a 3.2.1 keyed off our .config from our standard build, of course we also had a 10Gb/s intel card that took a little "persuasion" to work with our new kernel. Guess what? xfssyncd went away! Hurray, more cheering, cracking open of beers, singing and oh so awesome...! Hang on what's this?
31349 root      20   0     0    0    0 R 48.4  0.0   5:09.82 kworker/4:3
31354 root      20   0     0    0    0 S 45.8  0.0   3:09.28 kworker/7:2
31293 root      20   0     0    0    0 R 40.5  0.0   3:20.26 kworker/2:2
31202 root      20   0     0    0    0 S 29.6  0.0   4:52.67 kworker/5:2
32077 root      20   0     0    0    0 S 24.4  0.0   0:20.58 kworker/8:1
30314 root      20   0     0    0    0 S 21.7  0.0   3:43.72 kworker/1:2
30553 root      20   0     0    0    0 R 20.4  0.0   6:27.70 kworker/3:2
32054 root      20   0     0    0    0 S 17.8  0.0   0:32.40 kworker/11:0

DAMMIT! Well our new threaded 3.2.1 kernel is great, it has multiple workers, of course which now caused all of them to block - Doh! Luckily the new I/O scheduler in 3.2.1 helped the server not just fall over, we were making progress! What else is left to do... we'll like any good doctor, we decided to control the pain with medication. Our medication came in the form of eching either 2 or 3 into /proc/sys/vm/drop_caches. Here is a picture to show how we were effectively controlling this equivalent of file system cancer:

Not bad, but we still had no idea what was causing all this load, it was like this stealth like disease that was taking over our whole operation. Then we had a break. We spotted one of our scientists in LSF running code that we had not seen before. We immediately jumped on it. It was a monster, 10,000 of lines of neuroinformatics code built upon SPM.

This task was going to require a very stiff drink and some alone time. We first wanted to make sure that there was not a hardware or network error. The graph below clearly shows 650,000 pkts/second and about 950MBytes/second - it filled our 10Gb pipe, the storage is plenty fast enough, when treated well it works just lovely!

So why is it not working, and what is wrong with all these blasted deadlocked kernel threads? Time passed (quite a lot of it as it turned out). We did eventually find one slightly interesting feature about this code. It relied heavily upon "sparse files" to generate contrast files for brain images from an fMRI.

Could it really be that sparse files were causing this issue?

Surely not. We re ran 100's of tests, nothing. We could not reproduce this bug. During one of the tests happened to accidentally run a "df" and noticed that one of the many file systems was full. No biggy, not one that we were using, it can't be related right? Turns out it was. It very much was. We had found the root cause:

A sparse file write over NFS to a file system with project quotas
that are near the limit will cause the file system to tip over.

Jan Schulze had found the same thing years and years ago. We wrote Jan, and sure enough he is still regularly rebooting his servers, nobody ever answered him, and given our journey I can imagine why folks were not all that bothered.

If you managed to read this far, you can scroll up to the top of the screen to see how to work around it, until we see this "repaired".

You also may ask, how many developers would consider this a problem? It is basically a completely self inflicted injury in most folks eyes. If you are writing sparse files at the limit of the file system and there is only so much capacity, and if you then make a 5TB sparse file and start to fill it and overflow your project quota:
[jcuff@deep cuff]$ dd of=tt.dat if=/dev/zero bs=1024k seek=5G
Your file system will crash

Finally also remember that it is not all sadness here, we have a 450TB 650,000 pkts/second storage array and server, and always remember to take what Dave Chinner (aka omnipotent xfs dev) says seriously:


He was right, and continues to be right, XFS is an awesome file system. Be sure to check out his video for some amazing futures for XFS: http://www.youtube.com/watch?v=FegjLbCnoBw XFS is getting better as a filesystem each and every day. So what is the moral of this rather convoluted story then? Simple really just never forget to:

constantly search google and type in random shit until something works


Tuesday, January 10, 2012

eyes on the ball...

Today was one of those tough manager moments.

We have an amazing team, but often we need folks to be available off hours and able to pick up on when computers fail, and yet science still needs to happen. Most often this is late on at night or on a weekend or more recently during a holiday, or even more so all of the above!

We wanted to have a super fair, and super simple algorithm so that folks on the team were not penalized or over burdened. We came up with this simple approach, 9-5 is already covered, we all do that already. However for those hours between 6-11pm and the weekends we basically need a "lookout".

The role of this individual is to be the "keep eyes peeled for strange things". This job is not for those monster huge "oh my the sky is falling" critical incidents - we have other ways of dealing with those :-)

More for those requests that would be really awesome to answer before the morning, say because a researcher could really do with their nodes online because of a grant deadline, or maybe a simple binary fix to get an algorithm back up and running again.

So we came up with the idea that folks would be able to rotate in and out of this role once a week. We have quite a few head count so the burden was not all that high. We also wanted this job to be a triage and learning opportunity. The algorithm for this "eyeballs on call" is:

while(EyesOnBall == 1)
  take_ticket(attempt fix);
  if (Can'tFixItOnYourOwn) 
     try raise_a_team_member();
  else record_ticket(forMorning == 1);

Then really, really simply we had perl come to the rescue to build the schema for this idiot proof rotation! Our only challenge right now is how to get this "calendar" to import into Exchange, but then again everything we do has issues with Exchange, so no harm no foul I guess :-)

use Date::Calc qw(Monday_of_Week);
All hail! The all powerful Date::Calc!

We ended up wil an incredibly simple script to print out the Monday of the week that the "guard would change", and who on the rotation would be taking care of being the fresh eyeballs for the upcoming week:
use Date::Calc qw(Monday_of_Week Weeks_in_Year);
@s=("Staff A","Staff B","Staff C","Staff D","Staff E"); $z=0;
for ($i=1;$i<=Weeks_in_Year($ARGV[0]);$i++){
   printf ("%-15s %-4d %-4d %-4d\n",$s[$z],Monday_of_Week($i,$ARGV[0]));
   $z++; if ($z>$#s){$z=0};

toofan: jcuff$ ./rot.pl 2012
Staff A         1    2    2012
Staff B         1    9    2012
Staff C         1    16   2012
Staff D         1    23   2012
Staff E         1    30   2012
Staff A         2    6    2012
Staff B         2    13   2012
Staff C         2    20   2012
Staff D         2    27   2012
Staff E         3    5    2012
Staff A         3    12   2012
Staff B         3    19   2012
Staff C         3    26   2012
Staff D         4    2    2012
Staff E         4    9    2012
Staff A         4    16   2012
Staff B         4    23   2012
Staff C         4    30   2012
Staff D         5    7    2012
Staff E         5    14   2012
Staff A         5    21   2012
Staff B         5    28   2012
Staff C         6    4    2012
Staff D         6    11   2012
Staff E         6    18   2012
Staff A         6    25   2012
Staff B         7    2    2012
Staff C         7    9    2012
Staff D         7    16   2012
Staff E         7    23   2012
Staff A         7    30   2012
Staff B         8    6    2012
Staff C         8    13   2012
Staff D         8    20   2012
Staff E         8    27   2012
Staff A         9    3    2012
Staff B         9    10   2012
Staff C         9    17   2012
Staff D         9    24   2012
Staff E         10   1    2012
Staff A         10   8    2012
Staff B         10   15   2012
Staff C         10   22   2012
Staff D         10   29   2012
Staff E         11   5    2012
Staff A         11   12   2012
Staff B         11   19   2012
Staff C         11   26   2012
Staff D         12   3    2012
Staff E         12   10   2012
Staff A         12   17   2012
Staff B         12   24   2012

It is so simple it hurts, but I personally think it totally reflects the consumate style of research computing that it is not funny :-)

I'll let you know how we get on with this, it's basically what I'm calling "on call lite"!

(c) 2018 James Cuff