Thursday, July 19, 2012

part two: scientific software as a service sprawl

Seppo asked a great question in the comment field for the last post...

http://blog.jcuff.net/2012/07/scientific-software-as-service-sprawl.html

So I pulled out some desperate perl and used our database that tracks usage of module load commands. The results are pretty fascinating when you ignore version numbers and look only at the raw main binary (i.e. versions of matlab don't count we just wanted to know the main programs). It all looks like fun... here you go:

First up our checkmodule code:
[jcuff@odyssey ~]$ checkmodule -u jcuff | head

Name     Module              Count     Time
jcuff    hpc/rc              19        Thu Jul 19 13:45:53 2012
jcuff    viz/gnuplot-4.5     135       Thu Jul 19 13:45:45 2012
jcuff    hpc/blat-34         150       Thu Jul 19 13:45:45 2012
jcuff    hpc/blastall        146       Thu Jul 19 13:45:45 2012
jcuff    hpc/abyss-1.2.1     242       Thu Jul 19 13:45:45 2012
jcuff    hpc/openmpi-intel   421       Thu Jul 19 13:45:45 2012
jcuff    bio/samtools-0.1.7  127       Thu Jul 19 13:45:45 2012
jcuff    bio/bowtie-0.11.3   150       Thu Jul 19 13:45:45 2012

[SNIP]

As you can see it tots up the total modules that I've run, the list goes on a bit. Then we execute this evil to get a total over all possible users. Yeah our db layout is a bit odd we could have done this in a much better way, but this is research computing after all, it should have a little bit of cowboy by default!
[jcuff@odyssey ~]$ checkmodule | awk '{print$2" " $3}'| perl -pe 'while(<>){ ($n,$c)=split(/ /,$_,2);($m,$j)=split(/\-/,$n);$k{$m}=$k{$m}+$c;}foreach $h (keys %k){print "$k{$h} $h\n";}' | sort -rn 

QED, we then get a lovely list:
108040 hpc/openmpi
80607 math/matlab
59438 hpc/python
58890 hpc/matlab
49866 hpc/hdf5
46042 hpc/gsl
41291 hpc/intel
40882 hpc/gv
39281 hpc/netcdf
37982 hpc/IDL
29805 math/R
24183 hpc/fftw2
19133 hpc/svn
18334 hpc/xv
17978 hpc/mathematica
17365 hpc/gnuplot
16972 hpc/imagemagick
15314 hpc/fftw3
15196 hpc/numpy
14637 hpc/ds9
14218 hpc/gcc
13302 hpc/git
[SNIP]

I guess we do a lot of hard sums here, it's openmpi and matlab all the time, plus all the cool kids now use python and not a lot of perl, perl5mods was pulled in 3,574 times in comparison to 59,438 lots of python... not an exact science this, but I'd say that python is getting legs ;-)



And Seppo was right, it has a tail alright ;-)



[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff