Thursday, January 17, 2013

sometimes failure is the only option

From time to time, operations go bump.

As we all race to consolidate, push the limits of our computing and scale up our cloud services (both on and off prem), these things do happen. The ability for one line of code or config to take out the entire shop is clearly on the rise! Yeah, so we have change controls, ITSM, and all sorts of processes to make sure we are "always up".

Sometimes though, bad things happen (tm).

We also tend to forget that we are people, humans, friends and colleagues. Here's a little study in one way to reply to your client community while your nagios instance alerting you for 1,000's of down services.


Subject: virtual machines have gone bump.

From: James Cuff
Date: Thu, Jan 17, 2013 at 12:37 PM
To:   hptc-users

Hi all,

We had a PBKAC issue on our network configuration about 20 mins 
ago that rather spectacularly took out a series of hypervisors 
running many of our virtual machines.  We are mostly back and 
working to put humpty dumpty back together again.  You will 
experience a lot of strangeness (license servers etc. websites 
and others) while we resolve this.  We also have a dead hypervisor 
that we are performing surgery on to bring him back up.  

We should be up in another 20mins at the most.  

I'll let you know if our surgery is delayed.


From: James Cuff 
Date: Thu, Jan 17, 2013 at 1:06 PM
To:   hptc-users

Hello again!

The magic pixies that make the network flow data were clearly 
sleeping on the job.  They have had a stern talking to by the 
fabulous staff in networking and research computing, and are now 
back moving our research packets in the right direction.
We also took the chance to upgrade the pixies so they can move 
even more data, these new pixies are 10 times better than the old 

So we are back on line, and faster than before!  If you submitted 
a ticket we are in the process of contacting you, replying to the 
tickets may take us a little while.  

Sorry for the interruption, it's been a while since our last PBKAC, 
and as we have grown, our ability to take out hundreds of machines 
with one button also continues to increase.  

My apologies again! 


(c) 2018 James Cuff