Thursday, January 17, 2013

sometimes failure is the only option

From time to time, operations go bump.

As we all race to consolidate, push the limits of our computing and scale up our cloud services (both on and off prem), these things do happen. The ability for one line of code or config to take out the entire shop is clearly on the rise! Yeah, so we have change controls, ITSM, and all sorts of processes to make sure we are "always up".

Sometimes though, bad things happen (tm).

We also tend to forget that we are people, humans, friends and colleagues. Here's a little study in one way to reply to your client community while your nagios instance alerting you for 1,000's of down services.

Enjoy!

Subject: virtual machines have gone bump.
------------------------

From: James Cuff
Date: Thu, Jan 17, 2013 at 12:37 PM
To:   hptc-users


Hi all,

We had a PBKAC issue on our network configuration about 20 mins 
ago that rather spectacularly took out a series of hypervisors 
running many of our virtual machines.  We are mostly back and 
working to put humpty dumpty back together again.  You will 
experience a lot of strangeness (license servers etc. websites 
and others) while we resolve this.  We also have a dead hypervisor 
that we are performing surgery on to bring him back up.  

We should be up in another 20mins at the most.  

I'll let you know if our surgery is delayed.

j.


----------
From: James Cuff 
Date: Thu, Jan 17, 2013 at 1:06 PM
To:   hptc-users


Hello again!

The magic pixies that make the network flow data were clearly 
sleeping on the job.  They have had a stern talking to by the 
fabulous staff in networking and research computing, and are now 
back moving our research packets in the right direction.
  
We also took the chance to upgrade the pixies so they can move 
even more data, these new pixies are 10 times better than the old 
ones! 

So we are back on line, and faster than before!  If you submitted 
a ticket we are in the process of contacting you, replying to the 
tickets may take us a little while.  

Sorry for the interruption, it's been a while since our last PBKAC, 
and as we have grown, our ability to take out hundreds of machines 
with one button also continues to increase.  

My apologies again! 

j.




Monday, January 7, 2013

new nvidia processors

Very few folks will have got my tweet tonight so it is worth explaining ;-)

https://twitter.com/jamesdotcuff/status/288146585223835648

I always have to look at new game technology, the game architectures always inform my future HPC decisions. CES had an early preview tonight. So I just had to stay up to watch this... Others also stayed up...

There are an epic amount of GPU cores (72) on the Tegra4 and these chips will soon be in your cell phone! These are the new commodity and low power processors to follow and to watch out for the HPC folks that are listening. Also - pay attention to this video streaming thing, I hear it is going to be big. And there is something called "cloud services" to allow folks to see all this from any place on the planet :-)

Here are a couple of snaps I took from the webcast, it was a lot of fun!

HDR on the fly...


And software defined modems...


This should be fun...


They called it "shield", but this is quite something to follow team beowulf!


[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff