Thursday, July 18, 2013

the management demands demand management!

Today was one of those hot days here in New England. We set power rates based on load on high days randomly sampled. They call this stuff "demand management" We use quite a lot of power here in our shop for lots of computers, so this here manager demanded some "demand management" of them thar power loads! Hehehe. Fortunately we were able to power down some of them until later on this afternoon to do our bit.
Pretty psyched to be retweeted by our Sustainability Management team as well ;-)

These next two pictures are fun, the first one is the compute load where you see the gray color as 28,000 processors running full tilt this morning with less and less load up to 12:00 today. What is interesting is comparing the chart to the one of KW/hr you can see below it. We can clearly see two bars of power (red and green lines) each running at 200KW at full tilt which drop to about 80KW when we kill the jobs. What I love is where some jobs are running and not using the whole machine the power fluctuates in perfect sync. Finally the power then drops further 60KW once we issued the "power off" around 12:30 for that portion of the compute. This also shows how amazingly power efficient modern processors actually are these days especially when they don't have heavy work loads on them put are still energized.


Pretty cool heh? Love how these are in sync with each other!

Power is something we all think about here in our computing shop, take this little example below, for example! Yup, those are "mega watts" right there on the y-axis :-)



Wednesday, July 17, 2013

a little landmark from my early days: jpred at 1,000!

Little quick historical post.

Turns out that one of the very first papers I wrote as a grad student today hit 1,000 citations! This was back in the day where "software as a service" had not even been invented.

Totally awesome!

Geoff. Barton's great team have been continuing to support, extend and improve the methods in that initial work. It is now a whole lot fancier than my original set of rather dodgy perl scripts.

So lovely to see this!


Wednesday, July 3, 2013

Of benching a wicked fast storage array with a very dodgy tcsh script...

Testing our new secret filesystem... no caching or silly business going on here, all raw spindles and extremely suspect but reproducible fork bombs courtesy of tcsh.

Where hosts contains 20 machines...
[root@localhost jcuff]# foreach h (`cat hosts`)
foreach? ssh $h /mnt/share/jcuff/dd.sh $h &
foreach? end

And where dd.sh is simply a 16 way fork bomb:
[root@localhost jcuff]# cat dd.sh
#!/bin/tcsh
foreach h (`seq 1 16`)
dd if=/dev/zero of=/mnt/jcuff.$h.$1 bs=1024k count=64000 >& /mnt/jcuff.$h.$1.out &
end

Which makes for THREE HUNDRED AND TWENTY concurrent jobs each writing out 64GB, or in layman's terms this is basically TWENTY TERABYTES of output.

How would your current network attached storage hold up under this type of crazy load?

Could it?

Oh and I should say that this file system has another slightly unique property:
[root@localhost ~]# df -H
Filesystem             Size   Used  Avail Use% Mounted on
10.10.187.101:/export
                       1.2P   1.7T   1.1P   1% /mnt/

This one seemed to, in only 25 minutes we had generated our 20 Terabytes!
[root@compute001]# du -sh .
20T

As an aggregate speed we saw:
[root@compute001]# cat *.out | grep bytes \
| awk '{print (gbpersecond=gbpersecond+$8)/1024 " GB/s"}' \
| tail -1

14.1553 GB/s

we also confirmed speeds with a more fancy mpi run (still powered by tcsh ;-)):
[root@compute001 jcuff]# cat mpi.sh
#!/bin/tcsh
foreach h (`cat hosts`)
ssh $h "/usr/bin/mpirun -host $h -np 36
/mnt/share/openmpi_IOR/src/C/IOR -b 2g -t 1m -F -C -w -k -e -vv -o
/mnt/file1_p36_3.$h >&  /mnt/file1_p36_3.$h.out" &
end

Needs a whole lot more testing, with like real science and stuff but as a start out with a three line dodgy tcsh script, this looks to be a mighty fine file system ;-) This is a teaser, I'm not giving out makes and models or any of that, just wanted to let folks know we are building an awesome monster here! Oh and also to help see if folks will ever stop making fun of me using tcsh. Hehehe!

Summary: 320 jobs, 20 hosts, 20TB output @ 14GB/s to 1.2PB


Tuesday, July 2, 2013

active directory with go faster stripes!

We ended up with a fair number of processors in the day job...


Turns out LDAP lookups via AD were causing a fair amount of "churn". We use a whole lot of nscd to reduce the hammering that could happen on the domain controllers, but were still seeing a whole lot of slow "id" and getent's.

Decided to have a look:
[root@hero0108 tmp]# ps aux | grep nscd
nscd     29929  0.0  0.0 209872  3552 ?        Ssl  12:51   0:00 /usr/sbin/nscd
root     31337  0.0  0.0  61220   780 pts/0    S+   13:42   0:00 grep nscd
[root@hero0108 tmp]# strace -f -p 29929 >& trace.out &

Ok so who am I?
[root@hero0108 tmp]# id jcuff

Seems to generate way more systems calls than you would expect!
[root@hero0108 tmp]# wc -l trace.out 
24,162 trace.out

Why?
[pid 29936] stat("/etc/ldap.conf", {st_mode=S_IFREG|0644, st_size=611, ...}) = 0
[pid 29936] geteuid()                   = 28
[pid 29936] getsockname(16, {sa_family=AF_INET, sin_port=htons(57986), sin_addr=inet_addr("10.1.1.57")}, [16]) = 0
[pid 29936] getpeername(16, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("10.1.1.103")}, [68719476752]) = 0

Ah - so the "cache" part of the name server caching daemon, don't seem to be caching a while lot eh? Time for a change of how we do things. We had read lots and lots about sssd [https://fedorahosted.org/sssd/] so decided to give it a whirl.

In the mean time we also realized that our domain controllers could act as global catalogue servers. I've written a little bit about this before, but we never got around to using the GC. GC's it turns out are *fast*. We then checked the the little boxes to also replicate these POSIX fields to our DCs, then we can query them via sssd:

ldaps://domain.controller:3269

So what's the big deal then, why does changing the port matter? Here are results from a cold cache after initial login using the nlscd and nscd combo we have been using this setup for ages, it mostly works:
[jcuff@hero0109 ~]$ time id jcuff

real    0m3.565s
user    0m0.001s
sys     0m0.007s

Versus our new GC + SSSD setup:
[jcuff@holy2a01101 ~]$ time id jcuff

real    0m0.869s
user    0m0.002s
sys     0m0.011s

Boom! Well that's quite a difference eh? Between 3 and 4x speed up right out of the gate. We did have a slight spot of bother this morning though:


We realized that in configuring thousands of machines sssd was helpfully enumerating the entire AD to make a lovely fat local cache. Puppet and sjoeboo to the rescue! Check this marvel of puppet config fun:
class sssd {
  if ($::osfamily == 'RedHat' and $::lsbmajdistrelease == 6) {

    package { 'sssd':
      ensure => installed,
    }
    file { '/etc/sssd/sssd.conf':
      source => 'puppet:///modules/sssd/sssd.conf',
      owner => 'root',
      group => 'root',
      mode => 0600,
      require => Package['sssd'],
      notify => Service['sssd'],
    }
    exec { 'sssd_preseed':
      command => 'curl -s -o /var/lib/sss/db/cache_rc.domain.ldb http://proxy/custom/sssd_cache/cache_rc.domain.ldb',
      creates => '/var/lib/sss/db/cache_rc.domain.ldb',
      require => Package['sssd'],
    }
    service { 'sssd':
      ensure => running,
      enable => true,
      require => Exec['sssd_preseed'],
    }
  }
}

Do you see what he did there?

Yep - takes a populated cache file as part of the initial sssd configuration, as part of the neat "creates" pragma. Then on subsequent reloads we never need to create it. But by this simple "curl" it saves having 10,000 plus user accounts each being enumerated by every server.

Trust me, w/o this you will have a bad day of it ;-)

Kinda fun no?


[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff