Tuesday, November 19, 2013

Accelerating science?

Catching up on the SC13 news via twitter last night gave me some new found focus after a rather disappointing day "optimizing" a popular MCMC algorithm in the life sciences:

Finally it looks like we are going to see some really interesting workloads used as HPC benchmarks, it's going to be a whole lot more fun than simply diagonalizing matrices...

Anyway, so back to yesterday, I was disappointed yeah? Well we are a service shop here in the day job. We have 50+ thousand x86 CPU in our cluster, but no matter, there are always times when it is never enough. We have a bunch of GPGPU and Phi cards being used for all sorts of science and until yesterday I'd never really tried things in anger myself. Full disclosure I do a lot more management these days than coding, but I still think there is life in this old dog to have a poke at code and computers every now and then ;-)

So first up - MrBayes. This is a really popular piece of code that also uses the awesome LibHMSBeagle (love the name!) From the website, Beagle is "a high-performance library that can perform the core calculations at the heart of most Bayesian and Maximum Likelihood phylogenetics packages. It can make use of highly-parallel processors such as those in graphics cards (GPUs) found in many PCs."

Fantastic - this should be cake!

So I dived right in and pulled the source and started to target one of our Phi boards:

--enable-phi
if test "$enable_phi" = yes; then

AC_MSG_ERROR (Intel Phi not supported on this system)

Well that's kinda cute eh?

Anyway, the boys and girls over at Beagle are clearly still working on it, no matter, I pushed on. Remember I'm not the skill level of these folks - they literally write code like this to optimize MCMC and others on physical hardware, for example this piece extracted from libhmsbeagle/CPU/BeagleCPU4StateSSEImpl.hpp for SSE is the stuff of legend:
/* Loads (transposed) finite-time transition matrices into SSE vectors */
#define SSE_PREFETCH_MATRICES(src_m1, src_m2, dest_vu_m1, dest_vu_m2) \
        const double *m1 = (src_m1); \
        const double *m2 = (src_m2); \
        for (int i = 0; i  OFFSET; i++, m1++, m2++) { \
                dest_vu_m1[i][0].x[0] = m1[0*OFFSET]; \
                dest_vu_m1[i][0].x[1] = m1[1*OFFSET]; \
                dest_vu_m2[i][0].x[0] = m2[0*OFFSET]; \
                dest_vu_m2[i][0].x[1] = m2[1*OFFSET]; \
                dest_vu_m1[i][1].x[0] = m1[2*OFFSET]; \
                dest_vu_m1[i][1].x[1] = m1[3*OFFSET]; \
                dest_vu_m2[i][1].x[0] = m2[2*OFFSET]; \
                dest_vu_m2[i][1].x[1] = m2[3*OFFSET]; \
        }

Hardcore stuff indeed!

Anyway, after a fair amount of poke and prod I did achieve nirvana: MPI MrBayes running on the Phi! Trust me it's not just as simple as "icc -mmic", and I also had to go forward without the lovely LibHMSBeagle... but it did work! I've talked about Phi ports before, even went nuts and moved Perl over to an early board last year.

So I used this small example in the MrBayes distribution to do a looksee at some Spiny Lizards (Sceloporus magister), this is a pretty simple 123 membered taxa, and only 1,606 bases, so a pretty bog standard run. We ran with mcmcp ngen=100000 nruns=4 nchains=4 printfreq=10000 samplefreq=1000; to keep it simple:

[jcuff@-mic0 examples]$ mpirun -np 16 ../mb ./sceloporus.nex
                            MrBayes v3.2.2 x64

                      (Bayesian Analysis of Phylogeny)

                             (Parallel version)
                         (16 processors available)

              Distributed under the GNU General Public License

               Type "help" or "help " for information
                     on the commands that are available.

                   Type "about" for authorship and general
                       information about the program.


   Executing file "./sceloporus.nex"
   DOS line termination
   Longest line length = 1632
   Parsing file
   Expecting NEXUS formatted file
   Reading data block
      Allocated taxon set
      Allocated matrix
      Defining new matrix with 123 taxa and 1606 characters
      Data is Dna
      Missing data coded as ?
      Gaps coded as -

So off we went with two identical versions, built with this so that the standard non-SSE likelihood calculator is used for division in single-precision. I know there are more threads on the Phi, but I wanted to see some quick comparisons, also because this is an MPI code it is not well suited to the highly threaded architecture on the Phi card so more threads actually make this way worse. Anyway off we go:
[jcuff@phi src]$ ./configure --enable-mpi=yes --with-beagle=no --enable-sse=no
Some slight runtime issues, the two socket managed between 3:25 and 2:51 mins, but the Phi made for one 1 hour 11 mins...
[jcuff@box0 examples]$ mpirun -np 16 ../mb.x86 sceloporus.nex 

Gave:
       10000 -- (-12413.627) [...15 remote chains...] -- 0:03:35 (no sse)
       10000 -- (-12416.929) [...15 remote chains...] -- 0:02:51 (with sse)

[jcuff@mic0 examples]$ mpirun -np 16 ../mb.phi sceloporus.nex 

Gave:
       10000 -- (-12375.516) [...15 remote chains...] -- 1:11:42 (no sse)

Now I know "I'M DOING THIS WRONG(tm)" - this is why blogs have comments ;-)

However, first off it is just not that simple to get a native version of a complex code running, and I know there are many ways to speed up code on the Phi, it is after all a very different architecture. I wondered how our researchers would do this, they are life scientists and generally go after algorithms that work out of the box so they can get on with the rest of their day jobs. I'm seeing a 23x issue here - this was too much of a gap for someone of my limited skills to manage.

So I toddled off to a machine with a GPGPU (well actually four of them, but that's a story for another day), to go use LibBeagle in anger. Turns out that some of the beagle functions are not all that well documented, but we found them:
MrBayes > showbeagle

   Available resources reported by beagle library:
        Resource 0:
        Name: CPU
        Flags: PROCESSOR_CPU PRECISION_DOUBLE PRECISION_SINGLE COMPUTATION_SYNCH
             EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO
             SCALING_ALWAYS SCALING_DYNAMIC SCALERS_RAW SCALERS_LOG
             VECTOR_NONE VECTOR_SSE THREADING_NONE

        Resource 1:
        Name: Tesla K20c
        Desc: Global memory (MB): 4800 | Clock speed (Ghz): 0.71 | Number of cores: 2496
        Flags: PROCESSOR_GPU PRECISION_DOUBLE PRECISION_SINGLE COMPUTATION_SYNCH
             EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO
             SCALING_ALWAYS SCALING_DYNAMIC SCALERS_RAW SCALERS_LOG
             VECTOR_NONE THREADING_NONE

        Resource 2:
        Name: Tesla K20c
        Desc: Global memory (MB): 4800 | Clock speed (Ghz): 0.71 | Number of cores: 2496
        Flags: PROCESSOR_GPU PRECISION_DOUBLE PRECISION_SINGLE COMPUTATION_SYNCH
             EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO
             SCALING_ALWAYS SCALING_DYNAMIC SCALERS_RAW SCALERS_LOG
             VECTOR_NONE THREADING_NONE

        Resource 3:
        Name: Tesla K20c
        Desc: Global memory (MB): 4800 | Clock speed (Ghz): 0.71 | Number of cores: 2496
        Flags: PROCESSOR_GPU PRECISION_DOUBLE PRECISION_SINGLE COMPUTATION_SYNCH
             EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO
             SCALING_ALWAYS SCALING_DYNAMIC SCALERS_RAW SCALERS_LOG
             VECTOR_NONE THREADING_NONE

        Resource 4:
        Name: Tesla K20c
        Desc: Global memory (MB): 4800 | Clock speed (Ghz): 0.71 | Number of cores: 2496
        Flags: PROCESSOR_GPU PRECISION_DOUBLE PRECISION_SINGLE COMPUTATION_SYNCH
             EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO
             SCALING_ALWAYS SCALING_DYNAMIC SCALERS_RAW SCALERS_LOG
             VECTOR_NONE THREADING_NONE

So let's compare these two:

MrBayes > set usebeagle=yes beagledevice=gpu
[jcuff@gpu01 examples]$
real    1m37.645s
user    1m36.917s
sys     0m0.154s

[1]-  Done                    time ../src/mb sceloporus.nex > out.x86

[jcuff@gpu01 examples]$
real    2m19.238s
user    1m53.320s
sys     0m19.802s

[2]+  Done                    time ../src/mb sceloporus.nex.gpu > out.gpu

Well - once again for this workload we are clearly not "Accelerating science".

I do have to say that I am on the other hand, really looking forward and watching developments of libraries such as LibHMSBeagle. For example, once those folks get the AVX and all the other magic running natively, I'm sure we will see massive increases in speed with all the accelerator boards. Until then, I'm sticking with bog standard x86 boxes and a halfway decent interconnects until things calm down a bit. Remember this is also a rather simple set of code with only 200k lines of code...
[jcuff@phi mrbayes_3.2.2]$ wc -l beagle-lib-read-only/* src/* | grep total
206,791 total

Just imagine what some of the behemoths we deal with daily would be like! I actually think back to the head of this post. It is a real shame that we have collectively chased the "linpack" dream, because based on all of the digging poking and prodding yesterday, the highest performing boxes on the planet right now:
Site:          National Super Computer Center in Guangzhou
Manufacturer:  NUDT
Cores:         3,120,000
Rmax:          33,862.7 TFlop/s
Rpeak:         54,902.4 TFlop/s
Power:         17,808.00 kW
Memory:        1,024,000 GB
Interconnect:  TH Express-2
OS:            Kylin Linux
Compiler:      icc
Math Library:  Intel MKL-11.0.0
MPI:           MPICH2 with a customized GLEX channel

for this particular workload (and I'm going to argue for many others) probably could not actually help our researchers. We are seeing more and more accelerated systems in the top500 there are at least 4 in the top 10 announced yesterday for example! These represent huge investments, and based on this simple back of the envelope example I present here, we are going to need a whole new class of researchers and scientists to program these to get the equivalent amount of awesome science out of them that we do the "bog standard".

Postscript: our Phi run also kinda did this after a few iterations:
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
Command exited with non-zero status 11

and our CUDA run with mpirun -np 8 also caused a machine with 1TB of DRAM (!) to:
Nov 18 15:40:31 kernel: mb invoked oom-killer: gfp_mask=0x82d4
Nov 18 15:40:31 kernel: Out of memory: Kill process 4282 sacrifice child
Clearly much work still left to do folks!

In the meantime I shall embrace our new HPCG overlords!

Quote: "For example, the Titan system at Oak Ridge National Laboratory has 18,688 nodes, each with a 16-core, 32 GB AMD Opteron processor and a 6GB Nvidia K20 GPU[2]. Titan was the top- ranked system in November 2012 using HPL. However, in obtaining the HPL result on Titan, the Opteron processors played only a supporting role in the result. All floating-point computation and all data were resident on the GPUs. In contrast, real applications, when initially ported to Titan, will typically run solely on the CPUs and selectively off-load computations to the GPU for acceleration."

:-)


[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff