Thursday, December 13, 2012

careful, there's only so much DRAM and too many idiots in the world!

So I could not quite work out why our code (it is a super complicated astrophysics and math problem that Paul understands really well) was totally failing in my hands on the PHI... Was causing the card to hard fault after a few iterations of openMP/MPI hybrid steps...
[root@mic01 tmp]$ mpirun -host mic0 -np 122 /tmp/mhdtvd
APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)

So it turns out, even clever cards with daft people on the end of them (me), you can totally OOM the "box" rather spectacularly! This one below was one that did not also take out the service processor, my first one needed some judicial use of the *ahem* "power button"... I really ought not be allowed around new fangled technology, but mmalloc() has always been a challenge for me. Hehehe.
[ 1301.580549] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
[ 1301.580549] node 0: slabs: 43, objs: 2193, free: 0
[ 1301.875055] mhdtvd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
[ 1301.875080] mhdtvd cpuset=/ mems_allowed=0
[ 1301.875099] Pid: 12883, comm: mhdtvd Tainted: G    W  2.6.34.11-g65c0cd9 #2
[ 1301.875111] Call Trace:
[ 1301.875146]  [] ? cpuset_print_task_mems_allowed+0x91/0x9c
[ 1301.875171]  [] dump_header.isra.6+0x65/0x17d
[ 1301.875190]  [] ? ___ratelimit+0xb7/0xd8
[ 1301.875207]  [] oom_kill_process.isra.7+0x3e/0x102
[ 1301.875224]  [] __out_of_memory+0x12d/0x144
[ 1301.875241]  [] out_of_memory+0xa5/0xf5
[ 1301.875258]  [] __alloc_pages_nodemask+0x47e/0x5e4
[ 1301.875278]  [] ? ____pagevec_lru_add+0x12d/0x143
[ 1301.875312]  [] handle_mm_fault+0x24f/0x6a2
[ 1301.875334]  [] ? sched_setaffinity+0xe8/0xfa
[ 1301.875365]  [] do_page_fault+0x242/0x25a
[ 1301.875383]  [] page_fault+0x1f/0x30

It is nice that you can check logs as easily as this. I'm sure there will be a splunk plugin at some point. You can also see below that I obey the "oh that failed, maybe just try run it again" style of systems development... I tried quite a few times before I understood what I was doing wrong... ;-)

[root@mic01 tmp]$ ssh mic0 dmesg | grep kill

[ 1301.875055] mhdtvd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
[ 1301.875207] [] oom_kill_process.isra.7+0x3e/0x102
[ 1302.223134] Out of memory: kill process 12869 (pmi_proxy) score 927878696 or a child
[ 1348.584321] mhdtvd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
[ 1348.584458] [] oom_kill_process.isra.7+0x3e/0x102
[ 1348.864501] Out of memory: kill process 13141 (pmi_proxy) score 514225072 or a child
[ 1348.903676] mhdtvd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
[ 1348.903843] [] oom_kill_process.isra.7+0x3e/0x102
[ 1349.194894] Out of memory: kill process 13141 (pmi_proxy) score 511653966 or a child
[ 1401.777290] mhdtvd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
[ 1401.777435] [] oom_kill_process.isra.7+0x3e/0x102
[ 1402.017363] Out of memory: kill process 13383 (pmi_proxy) score 264334911 or a child
[ 1402.034030] mhdtvd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
[ 1402.034176] [] oom_kill_process.isra.7+0x3e/0x102

BTW, we now run this... pretty much all the time during our development now ;-)


Suffice to say, Paul is now running the rest of the tests... I'm not to be trusted ;-) Oh and see what else happens when Paul uses the card rather than me, yep that's 100% utilization baby!


mic0 (cores):
   Device Utilization: User:   99.91%,   System:   0.09%,   Idle:   0.00%
   Per Core Utilization (61 cores in use)
      Core #1:   User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #2:   User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #3:   User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #4:   User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #5:   User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #6:   User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #7:   User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #8:   User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #9:   User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #10:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #11:  User:  99.44%,   System:   0.56%,   Idle:   0.00%
      Core #12:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #13:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #14:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #15:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #16:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #17:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #18:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #19:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #20:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #21:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #22:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #23:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #24:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #25:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #26:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #27:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #28:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #29:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #30:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #31:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #32:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #33:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #34:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #35:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #36:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #37:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #38:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #39:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #40:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #41:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #42:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #43:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #44:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #45:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #46:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #47:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #48:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #49:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #50:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #51:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #52:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #53:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #54:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #55:  User:  99.44%,   System:   0.56%,   Idle:   0.00%
      Core #56:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #57:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #58:  User:  99.44%,   System:   0.56%,   Idle:   0.00%
      Core #59:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #60:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #61:  User: 100.00%,   System:   0.00%,   Idle:   0.00%


Wednesday, December 12, 2012

of #phi sockets and web servers

More fun with Phi... this time to see if we can attach to sockets natively. I decided to use Nigel Griffiths' awesome tutorial as a starter. This as an aside is a very cool httpd tutorial if you ever wanted to know how web servers work but were afraid to ask.

Ok, first up let's build on the host...
[root@mic01 nweb]# icc -O3 -mmic nweb23.c -o nweb

Yup - ok, start up on the phi directly:
[root@mic01-mic0 nweb]# ./nweb 80 ./

Test from the build host connecting to the IP for the phi:
[root@mic01 nweb]# curl mic0
<HTML>
<TITLE>nweb
</TITLE>
<BODY BGCOLOR="lightblue">
<H1>nweb Test page</H1>
<IMG SRC="nigel.jpg">
<p>
Not pretty but it should prove that nweb works :-)
<p>
Feedback is welcome to Nigel Griffiths nag@uk.ibm.com
</table>
</BODY>
</HTML>

So it works, the TCP stack is complete.

p.s. I did run "ab" - you probably would not want this to be a production webserver. Hehehe. Warning - this post like the last post was more of fun exercise in "is it possible" and trying to get familiar with the system what works what is silly etc. etc.

Round these parts "general purpose" often does mean more than moster LAPACK stream results. We have a whole combination of things we need to run quickly and effectively, having a good IP stack is one of the parts. Although we are never going to be a top 500 shop (we leave that to the big boys), these last few posts were basically just for fun and discovery.

Do not panic team Intel, we do have real science on the go behind the scenes - I'm not really planning on running perlcgi ;-).

What has been pretty astounding is how "x86" the Phi really is, Tommy was 100% right!

I do hope Joe Curley isn't reading this, it may bust the Sphygmomanometer ;-)


Tuesday, December 11, 2012

more #phi, this time PERL on PHI

You can't do this on GPGPU... We used this prime number gen as a simple example:
#!/usr/bin/perl -w
# prime-pthread, courtesy of Tom Christiansen
    
use strict;  
use threads;
use Thread::Queue;
    
my $stream = new Thread::Queue;
my $kid    = new threads(\&check_num, $stream, 2);
    
for my $i ( 3 .. 50 ) {
    $stream->enqueue($i);
} 
    
$stream->enqueue(undef);
$kid->join;
    
sub check_num {
     my ($upstream, $cur_prime) = @_;
     my $kid;      
     my $downstream = new Thread::Queue;
         while (my $num = $upstream->dequeue) {
             next unless $num % $cur_prime;
             if ($kid) {
               $downstream->enqueue($num);
                      } else {
                print "Found prime $num\n";
                    $kid = new threads(\&check_num, $downstream, $num);
             }
     } 
     $downstream->enqueue(undef) if $kid;
     $kid->join           if $kid;
}

How to port to a GPGPU? Well ya gonna need a version of this (more later)
[root@mic01-mic0 test]# ../bin/perl -v | grep mic
This is perl 5, version 16, subversion 2 (v5.16.2) built for mic-linux

Here we are running native threaded perl right on the MIC...
[root@mic01-mic0 test]# ../bin/perl ./thread.pl
Found prime 3
Found prime 5
Found prime 7
Found prime 11
Found prime 13
Found prime 17
Found prime 19
Found prime 23
Found prime 29
Found prime 31
Found prime 37
Found prime 41
Found prime 43
Found prime 47

Done.

Simple eh?


Read on for a quick how to for native Xeon Phi Mic build to make a version of perl, you need cross tools and some symlinks, but that's it to run any native perl goodness... there are better ways to do this, but I was in a hurry...

Your new "compiler" will look like this:
[jcuff@mic0 perl-5.16.0]$ cat /usr/bin/mic-linux-gcc
#!/bin/bash

icc -mmic "$@"

Make some symlinks
ln -s /usr/bin/readelf /usr/bin/mic-linux-readelf
ln -s /usr/bin/objdump /usr/bin/mic-linux-objdump
ln -s /usr/bin/ranlib /usr/bin/mic-linux-ranlib

ok now finally set up and build...
source /opt/intel/bin/iccvars.sh intel64

./configure --prefix=/n/sw/mic -Dcc=mic-linux-gcc -Dusethreads --target=mic-linux --host-cc=icc

Again - I'm not going to talk about benchmarks... but if a PCI card can run perl in multiple threads it sure makes life easier for many, especially our users... Special thanks to the folks who put this page together for the help getting started.


Friday, December 7, 2012

Fee foo #phi fum!

So many levels of awesome since I was at SC last year:

http://blog.jcuff.net/2011/11/disruptive-things-spotted-so-far-at.html

those first pictures are now real product Check it out... it is a real living thing... I'm not going to discuss benchmarks or other details until this is fully public but happy to say as James said, you can do a lot with this stuff!. Have to thank Brian on our Intel account team for helping get this into our hands. You can learn more here.

Turns out Tommy was also right :-)

[jcuff@mic01 intro_sampleC]$ source /opt/intel/bin/iccvars.sh intel64

[jcuff@mic01 intro_sampleC]$ make mic
icc -openmp -c sampleC00.c -o sampleC00.o
icc -openmp -c sampleC01.c -o sampleC01.o
icc -openmp -c sampleC02.c -o sampleC02.o
icc -openmp -c sampleC03.c -o sampleC03.o
icc -openmp -c sampleC04.c -o sampleC04.o
icc -openmp -c sampleC05.c -o sampleC05.o
icc -openmp -c sampleC06.c -o sampleC06.o
icc -openmp -c sampleC07.c -o sampleC07.o
icc -openmp -c sampleC08.c -o sampleC08.o
icc -openmp -c sampleC09.c -o sampleC09.o
icc -openmp -c sampleC09_callee.c -o sampleC09_callee.o
icc -openmp -c sampleC10.c -o sampleC10.o
icc -openmp -c sampleC11.c -o sampleC11.o
icc -openmp -c sampleC12.c -o sampleC12.o
icc -openmp -c sampleC13.c -o sampleC13.o
icc -openmp -c sampleC14.c -o sampleC14.o
icc -openmp -c sampleC_driver.c -o sampleC_driver.o
icc sampleC00.o sampleC01.o sampleC02.o sampleC03.o sampleC04.o sampleC05.o sampleC06.o sampleC07.o sampleC08.o sampleC09.o sampleC09_callee.o sampleC10.o sampleC11.o sampleC12.o sampleC13.o sampleC14.o sampleC_driver.o -openmp -o intro_sampleC.out
....
Build complete
Run : intro_sampleC.out


[jcuff@mic01 intro_sampleC]$ ./intro_sampleC.out
Samples started
Checking for Intel(R) MIC Architecture (Target CPU) devices...

Number of Target devices installed: 1

Offload sections will execute on: Target CPU (offload mode)

PASS Sample01
PASS Sample02
PASS Sample03
PASS Sample04
PASS Sample05
PASS Sample06
PASS Sample07
PASS Sample08
PASS Sample09
PASS Sample10
PASS Sample11
PASS Sample12
PASS Sample13
PASS Sample14
Samples complete


Yep, samples work, let us try a spot of math...
[jcuff@mic01 LEO_tutorial]$ ./tbo_sort.out

C/C++ Tutorial: Offload Demonstration

Checking for Intel(R) MIC Architecture (Target CPU) devices...

Number of Target devices installed: 1

Unsorted original values...first twenty (20) values:
Evens and Odds:
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20

Sorted results...first ten (10) values each:
Evens: 2 4 6 8 10 12 14 16 18 20
Odds : 1 3 5 7 9 11 13 15 17 19
Primes: 2 3 5 7 11 13 17 19 23

Q.E.D.


[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff