Thursday, December 13, 2012

careful, there's only so much DRAM and too many idiots in the world!

So I could not quite work out why our code (it is a super complicated astrophysics and math problem that Paul understands really well) was totally failing in my hands on the PHI... Was causing the card to hard fault after a few iterations of openMP/MPI hybrid steps...
[root@mic01 tmp]$ mpirun -host mic0 -np 122 /tmp/mhdtvd
APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)

So it turns out, even clever cards with daft people on the end of them (me), you can totally OOM the "box" rather spectacularly! This one below was one that did not also take out the service processor, my first one needed some judicial use of the *ahem* "power button"... I really ought not be allowed around new fangled technology, but mmalloc() has always been a challenge for me. Hehehe.
[ 1301.580549] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
[ 1301.580549] node 0: slabs: 43, objs: 2193, free: 0
[ 1301.875055] mhdtvd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
[ 1301.875080] mhdtvd cpuset=/ mems_allowed=0
[ 1301.875099] Pid: 12883, comm: mhdtvd Tainted: G    W  2.6.34.11-g65c0cd9 #2
[ 1301.875111] Call Trace:
[ 1301.875146]  [] ? cpuset_print_task_mems_allowed+0x91/0x9c
[ 1301.875171]  [] dump_header.isra.6+0x65/0x17d
[ 1301.875190]  [] ? ___ratelimit+0xb7/0xd8
[ 1301.875207]  [] oom_kill_process.isra.7+0x3e/0x102
[ 1301.875224]  [] __out_of_memory+0x12d/0x144
[ 1301.875241]  [] out_of_memory+0xa5/0xf5
[ 1301.875258]  [] __alloc_pages_nodemask+0x47e/0x5e4
[ 1301.875278]  [] ? ____pagevec_lru_add+0x12d/0x143
[ 1301.875312]  [] handle_mm_fault+0x24f/0x6a2
[ 1301.875334]  [] ? sched_setaffinity+0xe8/0xfa
[ 1301.875365]  [] do_page_fault+0x242/0x25a
[ 1301.875383]  [] page_fault+0x1f/0x30

It is nice that you can check logs as easily as this. I'm sure there will be a splunk plugin at some point. You can also see below that I obey the "oh that failed, maybe just try run it again" style of systems development... I tried quite a few times before I understood what I was doing wrong... ;-)

[root@mic01 tmp]$ ssh mic0 dmesg | grep kill

[ 1301.875055] mhdtvd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
[ 1301.875207] [] oom_kill_process.isra.7+0x3e/0x102
[ 1302.223134] Out of memory: kill process 12869 (pmi_proxy) score 927878696 or a child
[ 1348.584321] mhdtvd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
[ 1348.584458] [] oom_kill_process.isra.7+0x3e/0x102
[ 1348.864501] Out of memory: kill process 13141 (pmi_proxy) score 514225072 or a child
[ 1348.903676] mhdtvd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
[ 1348.903843] [] oom_kill_process.isra.7+0x3e/0x102
[ 1349.194894] Out of memory: kill process 13141 (pmi_proxy) score 511653966 or a child
[ 1401.777290] mhdtvd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
[ 1401.777435] [] oom_kill_process.isra.7+0x3e/0x102
[ 1402.017363] Out of memory: kill process 13383 (pmi_proxy) score 264334911 or a child
[ 1402.034030] mhdtvd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
[ 1402.034176] [] oom_kill_process.isra.7+0x3e/0x102

BTW, we now run this... pretty much all the time during our development now ;-)


Suffice to say, Paul is now running the rest of the tests... I'm not to be trusted ;-) Oh and see what else happens when Paul uses the card rather than me, yep that's 100% utilization baby!


mic0 (cores):
   Device Utilization: User:   99.91%,   System:   0.09%,   Idle:   0.00%
   Per Core Utilization (61 cores in use)
      Core #1:   User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #2:   User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #3:   User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #4:   User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #5:   User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #6:   User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #7:   User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #8:   User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #9:   User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #10:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #11:  User:  99.44%,   System:   0.56%,   Idle:   0.00%
      Core #12:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #13:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #14:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #15:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #16:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #17:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #18:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #19:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #20:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #21:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #22:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #23:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #24:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #25:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #26:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #27:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #28:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #29:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #30:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #31:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #32:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #33:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #34:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #35:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #36:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #37:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #38:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #39:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #40:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #41:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #42:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #43:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #44:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #45:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #46:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #47:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #48:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #49:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #50:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #51:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #52:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #53:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #54:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #55:  User:  99.44%,   System:   0.56%,   Idle:   0.00%
      Core #56:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #57:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #58:  User:  99.44%,   System:   0.56%,   Idle:   0.00%
      Core #59:  User: 100.00%,   System:   0.00%,   Idle:   0.00%
      Core #60:  User:  99.72%,   System:   0.28%,   Idle:   0.00%
      Core #61:  User: 100.00%,   System:   0.00%,   Idle:   0.00%



[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff