Tuesday, July 2, 2013

active directory with go faster stripes!

We ended up with a fair number of processors in the day job...


Turns out LDAP lookups via AD were causing a fair amount of "churn". We use a whole lot of nscd to reduce the hammering that could happen on the domain controllers, but were still seeing a whole lot of slow "id" and getent's.

Decided to have a look:
[root@hero0108 tmp]# ps aux | grep nscd
nscd     29929  0.0  0.0 209872  3552 ?        Ssl  12:51   0:00 /usr/sbin/nscd
root     31337  0.0  0.0  61220   780 pts/0    S+   13:42   0:00 grep nscd
[root@hero0108 tmp]# strace -f -p 29929 >& trace.out &

Ok so who am I?
[root@hero0108 tmp]# id jcuff

Seems to generate way more systems calls than you would expect!
[root@hero0108 tmp]# wc -l trace.out 
24,162 trace.out

Why?
[pid 29936] stat("/etc/ldap.conf", {st_mode=S_IFREG|0644, st_size=611, ...}) = 0
[pid 29936] geteuid()                   = 28
[pid 29936] getsockname(16, {sa_family=AF_INET, sin_port=htons(57986), sin_addr=inet_addr("10.1.1.57")}, [16]) = 0
[pid 29936] getpeername(16, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("10.1.1.103")}, [68719476752]) = 0

Ah - so the "cache" part of the name server caching daemon, don't seem to be caching a while lot eh? Time for a change of how we do things. We had read lots and lots about sssd [https://fedorahosted.org/sssd/] so decided to give it a whirl.

In the mean time we also realized that our domain controllers could act as global catalogue servers. I've written a little bit about this before, but we never got around to using the GC. GC's it turns out are *fast*. We then checked the the little boxes to also replicate these POSIX fields to our DCs, then we can query them via sssd:

ldaps://domain.controller:3269

So what's the big deal then, why does changing the port matter? Here are results from a cold cache after initial login using the nlscd and nscd combo we have been using this setup for ages, it mostly works:
[jcuff@hero0109 ~]$ time id jcuff

real    0m3.565s
user    0m0.001s
sys     0m0.007s

Versus our new GC + SSSD setup:
[jcuff@holy2a01101 ~]$ time id jcuff

real    0m0.869s
user    0m0.002s
sys     0m0.011s

Boom! Well that's quite a difference eh? Between 3 and 4x speed up right out of the gate. We did have a slight spot of bother this morning though:


We realized that in configuring thousands of machines sssd was helpfully enumerating the entire AD to make a lovely fat local cache. Puppet and sjoeboo to the rescue! Check this marvel of puppet config fun:
class sssd {
  if ($::osfamily == 'RedHat' and $::lsbmajdistrelease == 6) {

    package { 'sssd':
      ensure => installed,
    }
    file { '/etc/sssd/sssd.conf':
      source => 'puppet:///modules/sssd/sssd.conf',
      owner => 'root',
      group => 'root',
      mode => 0600,
      require => Package['sssd'],
      notify => Service['sssd'],
    }
    exec { 'sssd_preseed':
      command => 'curl -s -o /var/lib/sss/db/cache_rc.domain.ldb http://proxy/custom/sssd_cache/cache_rc.domain.ldb',
      creates => '/var/lib/sss/db/cache_rc.domain.ldb',
      require => Package['sssd'],
    }
    service { 'sssd':
      ensure => running,
      enable => true,
      require => Exec['sssd_preseed'],
    }
  }
}

Do you see what he did there?

Yep - takes a populated cache file as part of the initial sssd configuration, as part of the neat "creates" pragma. Then on subsequent reloads we never need to create it. But by this simple "curl" it saves having 10,000 plus user accounts each being enumerated by every server.

Trust me, w/o this you will have a bad day of it ;-)

Kinda fun no?


[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff