Friday, July 18, 2014

Of style and science...


There are times in your career that you really, really remember.

This was one of those times.

My then head of department, the dearly departed and most wonderful Professor Dame Louise Johnson wrote this note to my D. Phil. supervisor Geoff. back in 1997.  Geoff. recently sent me a copy while clearing out space to move into his fabulous new building over in Dundee.

To this day, I love that Louise who was an absolute scientific powerhouse said of my research:

"we thought the science was fine"

Although more importantly, her feedback about their concern for my writing style was what has really stuck with me over the years!

Nowt's changed much! ;-)





Tuesday, June 24, 2014

Ohai Linux! So you are a network switch now...

Decided to see what the fuss was all about surrounding these open source switches. Plus the rocket powered turtle really did peak my interest ;-)


[ http://cumulusnetworks.com and http://onie.org/ ]

I built all of this on a CentOS release 6.5 (Final), and I wanted to build everything from source to really see how ONIE worked from the ground up. Don't try this at home kids, there is no reason to try and damage yourself.
git clone https://github.com/onie/onie.git

Needed to add some deps, this was a little painful to find what was missing, much make fail, make, fail repeat, but this should be enough for most folks to run so you don't have to go through the iterations I did - this is a monster build. I learned a lot here, never having used "realpath" for example, or any of the syslinux kit which is fab!
sudo yum install realpath
sudo yum install gperf
sudo yum install stgit
sudo yum install texinfo
sudo yum install glibc-static
sudo yum install libexpat-devel
sudo yum install python-devel
sudo yum install fakeroot
sudo yum install syslinux syslinux-devel syslinux-extlinux syslinux-perl
sudo ln -s /usr/share/syslinux /usr/lib/syslinux

Oh and get a fresh autoconf if you are on CentOS 6.5
wget http://ftp.gnu.org/gnu/autoconf/autoconf-latest.tar.gz
tar zxvf autoconf-latest.tar.gz
cd autoconf-2.69/
ls
make
./configure
sudo make install

And away we go!
[jcuff@jcair-vm build-config]$ make -j4 MACHINE=kvm_x86_64 all recovery-iso

mkdir: created directory `/home/jcuff/onie/build'
mkdir: created directory `/home/jcuff/onie/build/images'
mkdir: created directory `/home/jcuff/onie/build/download'
mkdir: created directory `/home/jcuff/onie/build/kvm_x86_64-r0'
mkdir: created directory `/home/jcuff/onie/build/kvm_x86_64-r0/stamp'
mkdir: created directory `/home/jcuff/onie/build/kvm_x86_64-r0/initramfs'
==== Getting Linux ====
2014-06-11 14:50:54 URL:http://dev.cumulusnetworks.com/~curt/mirror/onie/linux-3.2.35.tar.xz [65143140/65143140] -> "/home/jcuff/onie/build/download/linux-3.2.35.tar.xz" [1]
linux-3.2.35.tar.xz: OK

wheee! (get a large beverage, this bit takes a while!
[jcuff@jcair-vm build-config]$ ls -ltra ../build/images/
total 34212
drwxrwxr-x. 7 jcuff jcuff     4096 Jun 11 17:24 ..
-rw-rw-r--. 1 jcuff jcuff  3301792 Jun 11 18:29 kvm_x86_64-r0.vmlinuz
-rw-rw-r--. 1 jcuff jcuff  5284988 Jun 13 11:23 kvm_x86_64-r0.initrd
-rw-rw-r--. 1 jcuff jcuff  8603253 Jun 13 11:23 onie-updater-x86_64-kvm_x86_64-r0
drwxrwxr-x. 2 jcuff jcuff     4096 Jun 13 11:29 .
-rw-rw-r--. 1 jcuff jcuff 17825792 Jun 13 11:30 onie-recovery-x86_64-kvm_x86_64-r0.iso

Make a disk:
[root@jcair-vm onie]# dd if=/dev/zero of=/tmp/onie-x86-demo.img bs=1M count=256
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 0.272711 s, 984 MB/s

Spin up the kvm!
[root@jcair-vm onie]# sudo /usr/libexec/qemu-kvm -m 1024 -name onie -boot order=cd,once=d -cdrom /tmp/onie.iso -net nic,model=e1000 -vnc 0.0.0.0:0 -vga std -drive file=/tmp/onie-x86-demo.img,media=disk,if=virtio,index=0 -serial telnet:localhost:9000,server



And you are golden!
ONIE: Starting ONIE Service Discovery
Info: Found static url: file:///lib/onie/onie-updater
ONIE: Executing installer: file:///lib/onie/onie-updater
Verifying image checksum ... OK.
Preparing image archive ... OK.
ONIE: Version       : master-201406241118-dirty
ONIE: Architecture  : x86_64
ONIE: Machine       : kvm_x86_64
ONIE: Machine Rev   : 0
ONIE: Config Version: 1
Installing ONIE on: /dev/vda
Pre installation hook
Post installation hook
Rebooting...

Remove the CD from your config and you can now boot the live version, and if everything has worked out, the discovery process will work and you can now ping the UK from the USA...
ONIE: Rescue Mode ...
Version   : master-201406241118-dirty
Build Date: 2014-06-24T11:40-0400
Info: Mounting kernel filesystems... done.
Info: Mounting LABEL=ONIE-BOOT on /mnt/onie-boot ...
Running demonstration platform init pre_arch routines...
Running demonstration platform init post_arch routines...
Info: Using eth0 MAC address: 52:54:00:2b:63:f6
Info: eth0:  Checking link... up.
Info: Trying DHCPv4 on interface: eth0
ONIE: Using DHCPv4 addr: eth0: 192.168.122.120 / 255.255.255.0
Starting: dropbear ssh daemon... done.
Starting: telnetd... done.
discover: Rescue mode detected.  Installer disabled.

Please press Enter to activate this console. 

ONIE:/ # onie-sysinfo -a
VM-1234567890 52:54:00:2b:63:f6 master-201406241118-dirty 42623 kvm_x86_64 0 x86_64-kvm_x86_64-r0 x86_64 1 gpt 2014-06-24T11:40-0400

ONIE:/ # ping www.ebi.ac.uk
PING www.ebi.ac.uk (193.62.192.80): 56 data bytes
64 bytes from 193.62.192.80: seq=0 ttl=61 time=108.473 ms
64 bytes from 193.62.192.80: seq=1 ttl=61 time=103.824 ms
64 bytes from 193.62.192.80: seq=2 ttl=61 time=103.238 ms
^C
--- www.ebi.ac.uk ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 103.238/105.178/108.473 ms

p.s. for extra twisted points this is ONIE running on linux KVM, inside virtualbox, on a mac on a pair of different layer three networks... it becomes a little confusing to run commands, but always makes me chuckle that a mac laptop is basically a little data center at this point :-)
jcair:~ jcuff$ uname -v

Darwin Kernel Version 13.2.0: Thu Apr 17 23:03:13 PDT 2014; root:xnu-2422.100.13~1/RELEASE_X86_64

jcair:~ jcuff$ ssh -p 2222 root@10.251.211.187 ssh 192.168.122.120 uname -a

Linux onie 3.2.35-onie+ #1 SMP Tue Jun 24 11:30:01 EDT 2014 x86_64 GNU/Linux


Enjoy!

Monday, May 5, 2014

compressing DRAM with ZRAM for fun and profit?

Theory:

Can you use compressed DRAM for science if you don't quite have enough memory?

TL;DR No

I'm going to file this under "Great idea, but my execution is slightly suspect"

Anyway, here's an example set up of compressed swap files:
[root@jcair-vm ~]# modprobe zram

[root@jcair-vm ~]# mkswap /dev/zram0
Setting up swapspace version 1, size = 104860756 KiB
no label, UUID=58476253-ad5a-4595-9bec-60bd09d76d30

[root@jcair-vm ~]# mkswap /dev/zram1
Setting up swapspace version 1, size = 104860756 KiB
no label, UUID=ed5d0f85-0245-472e-902e-0e94a743cbe0

[root@jcair-vm ~]# swapon -p5 /dev/zram0 
[root@jcair-vm ~]# swapon -p5 /dev/zram1

[root@jcair-vm ~]# swapon -s
Filename                                Type            Size    Used    Priority
/dev/zram0                              partition       104860752       0       5
/dev/zram1                              partition       104860752       0       5

Clearly without the zram setup above, stress fails right out of the gate:
[root@jcair-vm ~]# stress --vm-bytes 2344600024 -m 2 --vm-keep
stress: info: [6063] dispatching hogs: 0 cpu, 0 io, 2 vm, 0 hdd
stress: FAIL: [6063] (415) -- worker 6065 got signal 9
stress: WARN: [6063] (417) now reaping child worker processes
stress: FAIL: [6063] (451) failed run completed in 10s

But, running a stress test with a memory allocation much bigger than the host seems to work just fine and dandy once we have our zram swap files like those noted above:
[root@jcair-vm ~]# stress --vm-bytes 2344600024 -m 2 --vm-keep
stress: info: [5383] dispatching hogs: 0 cpu, 0 io, 2 vm, 0 hdd

top - 10:41:17 up 13 days, 22:03,  4 users,  load average: 2.91, 0.87, 0.29
Tasks: 192 total,   4 running, 188 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us, 74.8%sy,  0.0%ni, 11.5%id,  0.0%wa,  0.1%hi, 13.4%si,  0.0%st
Mem:   3923468k total,  3840852k used,    82616k free,     5368k buffers
Swap: 209721504k total,   626964k used, 209094540k free,    36932k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5385 root      20   0 2242m 1.2g  124 R 96.4 31.3   0:48.12 stress
 5384 root      20   0 2242m 1.2g  124 R 84.0 32.0   0:48.12 stress

Yay! So - this looks like it could work!

And so here we go with a genome aligner to see if this works. This will be a good test as it writes real data structures into memory, stress was doing a block fill. So first up let's try w/o enough ram:
[root@jcair-vm ~]# cat go.sh 
./bowtie2/bowtie2 -x ./hg19 -p 4  <( zcat Sample_cd1m_3rdrun_1_ATCACG.R1.fastq.gz)

[root@jcair-vm ~]# ./go.sh 
Out of memory allocating the ebwt[] array for the Bowtie index.  Please try
again on a computer with more memory.

Error: Encountered internal Bowtie 2 exception (#1)

Command: /root/bowtie2/bowtie2-align-s --wrapper basic-0 -x ./hg19 -p 4 /dev/fd/63 
(ERR): bowtie2-align exited with value 1

Ok, fair enough, so we have a reproducer.

Let's now set up a run with the right amount of physical ram:
[root@jcair-vm ~]# ./bowtie2/bowtie2 -x ./hg19 -p 4 <(cat cuff.fastq) -S out.dat &

7467 root 20 0 3606m 3.3g 1848 S 389.3 58.3 51:37.25 bowtie2-align-s

And we have a result!
[root@jcair-vm ~]# time ./bowtie2/bowtie2 -x ./hg19 -p 4 <(cat cuff.fastq)  -S out.dat 
13558597 reads; of these:
  13558597 (100.00%) were unpaired; of these:
  11697457 (86.27%) aligned 0 times
    545196 (4.02%) aligned exactly 1 time
   1315944 (9.71%) aligned >1 times
13.73% overall alignment rate

Ok, so let's shrink the memory of the machine and see if we can run with zram.

Let's also set the same priority and do a round robin between physical swap and zram so each can write/read a block should be nice balanced I/O. The stress worked, so our theory is that data and in memory structures could compress and we should be able to get at least a 1:2 or 1:1.5 ratio out of the memory, I settled on a 3G machine with a 3G compression and some physical swap also:
[jcuff@jcair-vm ~]$ swapon -s
Filename                                Type            Size    Used    Priority
/dev/zram0                              partition       2947008 614124  1
/dev/dm-1                               partition       4063224 614176  1

When running it did result in *much* smaller RES (982m vs 3.3G from native example):

 2350 root      20   0 3606m 982m 1020 S 20.8 33.3  12:26.74 bowtie2-align-s

Things chugged along, but I was not seeing this ending any time soon so I truncated the read file dramatically to ca. 5k reads to see if I could get a quick comparison between, zram hybrid sram and swap, and plain old boring old swap files.

As you can see below, only "boring old swap" resulted in anything sensible. The zram alone caused some rather spectacular OOM errors and obvious system instability, it was kinda fun though. You can also see below various versions we tried out, none of which actually worked, but we are also not totally alone here either.

Oh and: "Just the right amount of memory" - like Goldilocks, that one worked ;-)
Machine with memory too small:          (ERR): bowtie2-align exited

3G zram:                                sshd invoked oom-killer: gfp_mask=0x200da

Hybrid 3G zram + 4G physical swap:      6m 25.285s

Hybrid 500MB zram + 4G physical swap:   1m 51.029s

Regular /dev/dm-1 swap file:            0m 29.741s

Machine with enough ram:                0m 12.698s

In summary... NO PROFIT this time :-(

Still a neat idea - just don't try this at home kids!

Thursday, April 17, 2014

of painting, retirement plans and minimum wage


So my lovely "painting diva by night" Michele Clamp bangs out some epic watercolors...


Michele totally scored today! A great friend of ours bought one of her paintings. For ONE HUNDRED DOLLARS! Tonight we decided to look at how we are going to fund our new found retirement from paintings! Here's the transaction, I kid you not, she literally made ONE HUNDRED DOLLARS!


And here is the lovely (now sold!) "Pig in Clover" in his new rather resplendent frame waiting to go to the CCAE to hang out with his chums in the rest of Michele's exhibition...


I did think at the time that charging ONE HUNDRED BUCKS was a bit steep, especially to a great and close friend of ours, so I asked Michele to pull together the numbers.

It was rather sad as you can see:
Actual Painting                 1 hour
Buying frame                    1 hour
Framing                         1 hour
Ferrying to/from gallery        1 hour

Paint                           $1.00
Paper                           $1.00
Brush wear and tear             $1.00
Frame                           $15.00
                                -------
Sale price                      $100.00
Minus fees to the lovely CCAE   $50.00

Net                             $32.00 

Income @ 4 hours                $8.00 /hour

Which in the state of MA the minimum wage is exactly eight bucks an hour.

Clearly not quite time to retire yet!

Especially given our current sales rate is about one painting every six months, which puts Michele well... yeah best we don't even bother with that math, it would not keep us in adult beverages.

hehe!

Thursday, April 10, 2014

of schools and of school districts

So folks in the USA worry a lot about where to send their kids to school. Entire family decisions are made and based upon locating to the right regions and towns and cites in America so their kids can get "the best education they can afford".  It's a very big dealio.

For example http://zillow.com has huge sections of school data built right into the purchase section for any property. Here for example is a $900,000 home in a town called Sudbury in Massachusetts (it's a bit posh, but I wanted to use it as an example - we would never live there!). The yearly council and property taxes for this particular place come in at over $16,000. But check this out for some of the local schools - you can clearly see where the money goes right?



Anyway so I want to tell you about where I carried out my high school education...

Tulketh High School, Tag Lane, Preston, England.

I was there in the mid to late eighties. Other than white socks being a formal, and required part of the male and female uniform - it was not all that bad a place.  Sure the bar was extremely low, it was a pretty poor neck of the woods. At the time we were living in nearby council assisted housing and didn't have two pennies to rub together. But there at Tulketh the teachers (for the most part - but I'll get to that later) tried their very best to teach us reprobates the three R's.

I remember fondly our English, Latin and French teachers in particular being absolutely great and my Chemistry teacher, well he was the chap who first introduced me to a 480z... and I guess given my current occupation you could call what I do now as being the rest of his history!  So a good teaching staff in a pretty shitty location, but with absolute hearts of gold.  Certainly there was no $16,000 a year council tax, heck I imagine at the time you could probably buy our entire house for that!

For balance, just in case folks think I'm getting all wet and starry eyed, I did unfortunately have a math teacher who in my later years at Tulketh helped me achieve a very solid "D"...  I can tell you that "D" looked totally amazing next to all my other "A's" when I later applied to do my A levels! Anyway, I retook the math course and achieved a straight A on the second go round.  Once I had a teacher that actually taught me the syllabus... but as I said, the bar at Tulketh really was pretty gosh darn low.  I hold no grudges - it was utterly amazing I made it to University to be honest!

Unfortunately for Tulketh, it turned out that things did not get a whole lot better after I left the area either.  I've no background as to why this is, although I do have some personal ideas.  But mainly one.

Teachers are not paid anywhere near enough $ to stay in the profession. 

Couple that with the total reprobates (sorry pupils) that hung out at our school when I was there, I can hardly imagine how difficult it must have been to even get up in the morning to go to work...  Some of our classes were merely a study in chaos theory rather than anything approximating education.  I can't imagine that part ever really improved any.  Certainly not for the teachers.

For example, the latest performance figures from the BBC back in 2004, show the school came in ranked 90th out of a possible 93, with a 15% success rate in the GCSE.  That's altogether just on the other side of absolutely grim, however you want to do the statistics....



I've not been back to that part of the UK in about 10 years, google maps currently shows it to be not doing all that well, and I hear the rest of the building is also all boarded up now:


They appear to have closed the school after attempting to make it a "sports centre of excellence" after failing at education fully in 2003 or so, and then basically from what I can see gave up on the whole system some time in 2009.

Very, very sad.

And worse still - it very much looks like the new rebooted version of this school, albeit at a new site with a big fancy building, and in a fancier postal code with huge multimillion pound building investments, (but with probably the very same dodgy pupils - remember I was one ;-)) is also still not doing so very well either...  Do we think the teachers all got paid more?  I doubt it.  And yet the report below STILL blames the teachers!  It can't be the twenty five million pound building, it has to be the teachers!

*sigh*

When will they ever learn?  Just pay the bloody teachers!


However in this recent and much more positive news:

"The report addressed suggestions raised including references made to Tulketh High School as a possible alternative to the proposed new secondary school.

A new secondary school is currently in phase two of the proposal, and the report said: “Tulketh High School is closed now and whether it should be retained in abeyance until such time as it might be needed is a matter for the County Council as the Local Education Authority.

“It is, however, an option that should be discussed further.”


Gives me hope that the place where I first learned to program a computer may well be able to dig itself out of the rather nasty corner it is in right now.

And for fun, I'll leave you with a picture of what prefects were given to denote their status (courtesy of Tina Kelly from a Facebook post).  I also remember holding one of these badges with great pride!  It was the only time that this little nerd could get his own back on those big bullies - basically by handing out detention slips!  Ah!  Happy days! :-)


Must say, it does have a slight imperial look to it... and I guess we know how well that kind of thing always works out - right?

Tulketh High, there will always be a place for you in my heart - you are clearly gone, but you will never, ever be forgotten, neither will all the amazing teachers there who helped me on my way!


Thursday, December 5, 2013

Watch out for those Ts and the Cs folks!

This is one for technology folks, support folks and basically anyone in customer service...

Cable Co's have a really, really bad rap these days, and not all is warranted but sometimes the SLA is far larger and more important to defend than any customer relationship. It has actually become rather sad and uncomfortable for all, including the support folks. I'll try and illustrate with a personal example...

We recently moved to a "no glass area" of Massachusetts awesome new house, but alas only copper in the area. We had FiOS connectivity for over seven years at our old place and had also recently added a TV package to our business line. This addition needed us to move packages albeit with the same company and the same piece of glass on the wall. As per usual there was clearly some internal company complexity that the same glass that provides packets for TCP needed to be reconfigured to also deliver the TV.

No biggy.

So moving day came and went (we are planning to rent out the old place, but I digress), so I called Verizon to disconnect the service. Should be no muss, no fuss right? Wrong. We now owed a $190.00 fee to disconnect the service we have had in there for seven years. Seemingly this is because we also paid an additional $50.00 per month on top of our existing Internet only plan earlier this year to add TV service. According to the big complicated Verizon computer we are now considered under a "new" contract.

All our previous history on time payments was for naught. Poof! All gone!

After getting absolutely nowhere by phone I took this to the twitters... I've seen people be successful with this whole new modern technology thing, so why not? I'm pretty sure I can have a crack at this. After all we are talking about one hundred and ninety bucks, and that my friends is a fair old number of bud lights that is :-)


The folks at Verizon did reply, and we had a lovely chat. At one point it really did look like we were going to get somewhere, but in the end did not get anywhere. However, we did have a really nice chat nevertheless, I drank tea while typing, it was lovely.


I really ought to have searched the internet first before attempting this clearly futile exercise. Seemingly this type of thing happens an awful lot. I should have also checked my own weblog, different story, similar ending.

I guess at this point, I'll just pay up and remind our future tenants to check any contracts, and much like the airlines say at the end of flights, I'll remind them "that they too have choices in who they fly with".

It's all a bit of a shame really, and having seen the recent South Park satirical send up, it is only going to get way, way worse for all of us.


Tuesday, November 19, 2013

Accelerating science?

Catching up on the SC13 news via twitter last night gave me some new found focus after a rather disappointing day "optimizing" a popular MCMC algorithm in the life sciences:

Finally it looks like we are going to see some really interesting workloads used as HPC benchmarks, it's going to be a whole lot more fun than simply diagonalizing matrices...

Anyway, so back to yesterday, I was disappointed yeah? Well we are a service shop here in the day job. We have 50+ thousand x86 CPU in our cluster, but no matter, there are always times when it is never enough. We have a bunch of GPGPU and Phi cards being used for all sorts of science and until yesterday I'd never really tried things in anger myself. Full disclosure I do a lot more management these days than coding, but I still think there is life in this old dog to have a poke at code and computers every now and then ;-)

So first up - MrBayes. This is a really popular piece of code that also uses the awesome LibHMSBeagle (love the name!) From the website, Beagle is "a high-performance library that can perform the core calculations at the heart of most Bayesian and Maximum Likelihood phylogenetics packages. It can make use of highly-parallel processors such as those in graphics cards (GPUs) found in many PCs."

Fantastic - this should be cake!

So I dived right in and pulled the source and started to target one of our Phi boards:

--enable-phi
if test "$enable_phi" = yes; then

AC_MSG_ERROR (Intel Phi not supported on this system)

Well that's kinda cute eh?

Anyway, the boys and girls over at Beagle are clearly still working on it, no matter, I pushed on. Remember I'm not the skill level of these folks - they literally write code like this to optimize MCMC and others on physical hardware, for example this piece extracted from libhmsbeagle/CPU/BeagleCPU4StateSSEImpl.hpp for SSE is the stuff of legend:
/* Loads (transposed) finite-time transition matrices into SSE vectors */
#define SSE_PREFETCH_MATRICES(src_m1, src_m2, dest_vu_m1, dest_vu_m2) \
        const double *m1 = (src_m1); \
        const double *m2 = (src_m2); \
        for (int i = 0; i  OFFSET; i++, m1++, m2++) { \
                dest_vu_m1[i][0].x[0] = m1[0*OFFSET]; \
                dest_vu_m1[i][0].x[1] = m1[1*OFFSET]; \
                dest_vu_m2[i][0].x[0] = m2[0*OFFSET]; \
                dest_vu_m2[i][0].x[1] = m2[1*OFFSET]; \
                dest_vu_m1[i][1].x[0] = m1[2*OFFSET]; \
                dest_vu_m1[i][1].x[1] = m1[3*OFFSET]; \
                dest_vu_m2[i][1].x[0] = m2[2*OFFSET]; \
                dest_vu_m2[i][1].x[1] = m2[3*OFFSET]; \
        }

Hardcore stuff indeed!

Anyway, after a fair amount of poke and prod I did achieve nirvana: MPI MrBayes running on the Phi! Trust me it's not just as simple as "icc -mmic", and I also had to go forward without the lovely LibHMSBeagle... but it did work! I've talked about Phi ports before, even went nuts and moved Perl over to an early board last year.

So I used this small example in the MrBayes distribution to do a looksee at some Spiny Lizards (Sceloporus magister), this is a pretty simple 123 membered taxa, and only 1,606 bases, so a pretty bog standard run. We ran with mcmcp ngen=100000 nruns=4 nchains=4 printfreq=10000 samplefreq=1000; to keep it simple:

[jcuff@-mic0 examples]$ mpirun -np 16 ../mb ./sceloporus.nex
                            MrBayes v3.2.2 x64

                      (Bayesian Analysis of Phylogeny)

                             (Parallel version)
                         (16 processors available)

              Distributed under the GNU General Public License

               Type "help" or "help " for information
                     on the commands that are available.

                   Type "about" for authorship and general
                       information about the program.


   Executing file "./sceloporus.nex"
   DOS line termination
   Longest line length = 1632
   Parsing file
   Expecting NEXUS formatted file
   Reading data block
      Allocated taxon set
      Allocated matrix
      Defining new matrix with 123 taxa and 1606 characters
      Data is Dna
      Missing data coded as ?
      Gaps coded as -

So off we went with two identical versions, built with this so that the standard non-SSE likelihood calculator is used for division in single-precision. I know there are more threads on the Phi, but I wanted to see some quick comparisons, also because this is an MPI code it is not well suited to the highly threaded architecture on the Phi card so more threads actually make this way worse. Anyway off we go:
[jcuff@phi src]$ ./configure --enable-mpi=yes --with-beagle=no --enable-sse=no
Some slight runtime issues, the two socket managed between 3:25 and 2:51 mins, but the Phi made for one 1 hour 11 mins...
[jcuff@box0 examples]$ mpirun -np 16 ../mb.x86 sceloporus.nex 

Gave:
       10000 -- (-12413.627) [...15 remote chains...] -- 0:03:35 (no sse)
       10000 -- (-12416.929) [...15 remote chains...] -- 0:02:51 (with sse)

[jcuff@mic0 examples]$ mpirun -np 16 ../mb.phi sceloporus.nex 

Gave:
       10000 -- (-12375.516) [...15 remote chains...] -- 1:11:42 (no sse)

Now I know "I'M DOING THIS WRONG(tm)" - this is why blogs have comments ;-)

However, first off it is just not that simple to get a native version of a complex code running, and I know there are many ways to speed up code on the Phi, it is after all a very different architecture. I wondered how our researchers would do this, they are life scientists and generally go after algorithms that work out of the box so they can get on with the rest of their day jobs. I'm seeing a 23x issue here - this was too much of a gap for someone of my limited skills to manage.

So I toddled off to a machine with a GPGPU (well actually four of them, but that's a story for another day), to go use LibBeagle in anger. Turns out that some of the beagle functions are not all that well documented, but we found them:
MrBayes > showbeagle

   Available resources reported by beagle library:
        Resource 0:
        Name: CPU
        Flags: PROCESSOR_CPU PRECISION_DOUBLE PRECISION_SINGLE COMPUTATION_SYNCH
             EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO
             SCALING_ALWAYS SCALING_DYNAMIC SCALERS_RAW SCALERS_LOG
             VECTOR_NONE VECTOR_SSE THREADING_NONE

        Resource 1:
        Name: Tesla K20c
        Desc: Global memory (MB): 4800 | Clock speed (Ghz): 0.71 | Number of cores: 2496
        Flags: PROCESSOR_GPU PRECISION_DOUBLE PRECISION_SINGLE COMPUTATION_SYNCH
             EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO
             SCALING_ALWAYS SCALING_DYNAMIC SCALERS_RAW SCALERS_LOG
             VECTOR_NONE THREADING_NONE

        Resource 2:
        Name: Tesla K20c
        Desc: Global memory (MB): 4800 | Clock speed (Ghz): 0.71 | Number of cores: 2496
        Flags: PROCESSOR_GPU PRECISION_DOUBLE PRECISION_SINGLE COMPUTATION_SYNCH
             EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO
             SCALING_ALWAYS SCALING_DYNAMIC SCALERS_RAW SCALERS_LOG
             VECTOR_NONE THREADING_NONE

        Resource 3:
        Name: Tesla K20c
        Desc: Global memory (MB): 4800 | Clock speed (Ghz): 0.71 | Number of cores: 2496
        Flags: PROCESSOR_GPU PRECISION_DOUBLE PRECISION_SINGLE COMPUTATION_SYNCH
             EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO
             SCALING_ALWAYS SCALING_DYNAMIC SCALERS_RAW SCALERS_LOG
             VECTOR_NONE THREADING_NONE

        Resource 4:
        Name: Tesla K20c
        Desc: Global memory (MB): 4800 | Clock speed (Ghz): 0.71 | Number of cores: 2496
        Flags: PROCESSOR_GPU PRECISION_DOUBLE PRECISION_SINGLE COMPUTATION_SYNCH
             EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO
             SCALING_ALWAYS SCALING_DYNAMIC SCALERS_RAW SCALERS_LOG
             VECTOR_NONE THREADING_NONE

So let's compare these two:

MrBayes > set usebeagle=yes beagledevice=gpu
[jcuff@gpu01 examples]$
real    1m37.645s
user    1m36.917s
sys     0m0.154s

[1]-  Done                    time ../src/mb sceloporus.nex > out.x86

[jcuff@gpu01 examples]$
real    2m19.238s
user    1m53.320s
sys     0m19.802s

[2]+  Done                    time ../src/mb sceloporus.nex.gpu > out.gpu

Well - once again for this workload we are clearly not "Accelerating science".

I do have to say that I am on the other hand, really looking forward and watching developments of libraries such as LibHMSBeagle. For example, once those folks get the AVX and all the other magic running natively, I'm sure we will see massive increases in speed with all the accelerator boards. Until then, I'm sticking with bog standard x86 boxes and a halfway decent interconnects until things calm down a bit. Remember this is also a rather simple set of code with only 200k lines of code...
[jcuff@phi mrbayes_3.2.2]$ wc -l beagle-lib-read-only/* src/* | grep total
206,791 total

Just imagine what some of the behemoths we deal with daily would be like! I actually think back to the head of this post. It is a real shame that we have collectively chased the "linpack" dream, because based on all of the digging poking and prodding yesterday, the highest performing boxes on the planet right now:
Site:          National Super Computer Center in Guangzhou
Manufacturer:  NUDT
Cores:         3,120,000
Rmax:          33,862.7 TFlop/s
Rpeak:         54,902.4 TFlop/s
Power:         17,808.00 kW
Memory:        1,024,000 GB
Interconnect:  TH Express-2
OS:            Kylin Linux
Compiler:      icc
Math Library:  Intel MKL-11.0.0
MPI:           MPICH2 with a customized GLEX channel

for this particular workload (and I'm going to argue for many others) probably could not actually help our researchers. We are seeing more and more accelerated systems in the top500 there are at least 4 in the top 10 announced yesterday for example! These represent huge investments, and based on this simple back of the envelope example I present here, we are going to need a whole new class of researchers and scientists to program these to get the equivalent amount of awesome science out of them that we do the "bog standard".

Postscript: our Phi run also kinda did this after a few iterations:
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
Command exited with non-zero status 11

and our CUDA run with mpirun -np 8 also caused a machine with 1TB of DRAM (!) to:
Nov 18 15:40:31 kernel: mb invoked oom-killer: gfp_mask=0x82d4
Nov 18 15:40:31 kernel: Out of memory: Kill process 4282 sacrifice child
Clearly much work still left to do folks!

In the meantime I shall embrace our new HPCG overlords!

Quote: "For example, the Titan system at Oak Ridge National Laboratory has 18,688 nodes, each with a 16-core, 32 GB AMD Opteron processor and a 6GB Nvidia K20 GPU[2]. Titan was the top- ranked system in November 2012 using HPL. However, in obtaining the HPL result on Titan, the Opteron processors played only a supporting role in the result. All floating-point computation and all data were resident on the GPUs. In contrast, real applications, when initially ported to Titan, will typically run solely on the CPUs and selectively off-load computations to the GPU for acceleration."

:-)

Thursday, October 24, 2013

Fun with iDRAC

So I was busy over at the DellXL conference for the last few days with our new friends at GWU HPC. We were all chatting about the issues with iDRAC and access to console in a true "DevOps" way. i.e. via scripts and automation and no sad pointy clicky webby nonsense. We all agreed that the current, load OBM website, click "launch console", download java jnlp file was not in any way optimal.

After a fair amount of bitching and moaning, I took to PHP and CURL during the meeting to attempt to write a script to automate things... Have to admit - it's the first PHP I've written in years - it was fun. We have 1000's of machines that are using iDRAC for access to KVM so fixing this was important as my boys and girls complain about it on a pretty much daily basis. When ever we need access to KVM it is normally because some epic system crash has happened and we are all pretty stressed and animated, the last thing we need is foophy website / java nonsense to deal with!

For the TL;DR crowd, I totally invented the wrong wheel! The boys and girls at GWU had a *much* better idea... but more about that later!

Ok so here's the script I knocked up:
[James-Cuffs-MacBook-Pro]$ cat RemoteConsole.php 
$u              = "root";
$p              = "calvin";

$host           = $argv[1];
$ha             = explode('.',$host);
$loginUrl       = "https://$host/data/login";
$logoutUrl      = "https://$host/data/logout";
$consoleUrl     = "https://$host/viewer.jnlp($host@0@$ha[0],+Dell+RemoteConsole,+User+root";
$ustring        = "user=".$u."&password=$p";

$ch = curl_init();

curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_URL, $loginUrl);
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, $ustring);
curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);

$store           = curl_exec ($ch);
$xml             = simplexml_load_string($store);
$forwardUrl      = $xml->forwardUrl;
list($junk, $ST) = split('\?',$forwardUrl);
$index           = "https://$host/$forwardUrl";
curl_setopt($ch, CURLOPT_URL, $index);

$content         = curl_exec ($ch);
$mili            = round(microtime(true)*1000);
$consoleUrl      = "$consoleUrl@$mili@$ST)";

curl_setopt($ch, CURLOPT_URL, $consoleUrl);

$content         = curl_exec ($ch);
file_put_contents('console.jnlp', $content);
system("open console.jnlp");
sleep(20);

curl_setopt($ch, CURLOPT_URL, $logoutUrl);
$content         = curl_exec ($ch);
curl_close ($ch); 

I was *extremely* excited about this, and demoed it to the whole group at DellXL, folks were suitably impressed, until Tim Wickberg at GWU said - "oh you don't need to do any of that at all!". I was slightly taken aback, but listened on as he explained how you need to simply do one thing to make all of this php obsolete...

Turns out that while the careful use of this millisecond time generation for the URL is important, you really don't need it at all. All that scripting above (while an interesting and fun study, is mostly useless.
$mili            = round(microtime(true)*1000);
$consoleUrl      = "$consoleUrl@$mili@$ST)";

All I actually needed to do was edit those darn .jnlp files:
[James-Cuffs-MacBook-Pro]$ egrep "user|pass" console.jnlp 
   user=2114738097
   passwd=2007905771

And replace those "one time" numeric integer password/username combinations! You only need to hard code those to be root/calvin!! Doh! Now you can quickly edit the .jnlp to swap the hostname, and username/pw combination and you are golden.
[James-Cuffs-MacBook-Pro]$ egrep "user|pass" console.jnlp 
   user=root
   passwd=calvin

You have an eternally working file to always launch access to any OBM/iDRAC in your environment - so simple.

Well done team GWU - this discovery is pretty awesome and will help us script remote console/kvm access to 1000's of machines over in western mass! Many thanks!

p.s. disclaimer - please don't put your iDRAC systems on a network that folks outside of your admin team can access, or your machines will no longer belong to you. We have ours on a "dark network", that only admin staff in RC can access with 2fA via a special VPN network that only has access to the OBM and admin portions of the network! This is why we can use root/calvin with abandon!

Wednesday, October 9, 2013

Of Nobel Prizes...


Wonderful news today, fabulous day to support great science with great colleagues!


Reminds me of my first few days here at Harvard, (email below) and my rather epic faux pas with messing up how to refer to CHARMM ;-) This was also the first Infiniband cluster we put in when I arrived. Things sure have grown since those early days when we were scrambling for resource. Martin was right, we did need a whole lot more compute and now RC has over 55,000 CPU, we almost have enough horsepower to run some halfway decent molecular dynamics! We are so proud to in our small way have been able to support Martin's group and his achievements - a good news day all round!

From:    Martin Karplus 
Date:    Fri, Jun 30, 2006 at 6:04 PM
Subject: Re: dellblades.
To:      James Cuff 


It was good to meet with you.  Once the Dell benchmarking is complete
(perhaps some experts from them or you can do some optimization of
CHARMM on the Dell blades beyond what we have done ourselves),
we should discuss the future.  

In the longer run, there is a need for more computer power and
I am sure we all hope that the FAS will provide some funding.

PS  The CHARMM program refers to the academic version which is
distributed by Harvard University; CHARMm refers to the
commercial version distributed by Accelrys.  Rather than
calling it "charmm", it is important to make this distinction
and for the program we use as CHARMM.

Friday, September 13, 2013

my first steps with OpenFlow...

Hanging out at our CondoOfCondos workshop in Texas...

We are clearly getting a bit old school here:

http://archive.openflow.org/wk/index.php/OpenFlow_Tutorial

I pulled the mininet OVF file from here, and imported into VirtualBox.


I always set up the following port forwards inside of VirtualBox to point at the VM IP you can pull from the console above. It looks like this screen grab below. I used 127.0.0.1 here rather than my host address incase my 802.11X provided DHCP address here at the University of Texas changes, so I bind to the internal, remember as Dorothy said:

"There's no place like 127.0.0.1" ;-)


This allows you to do things like this so you can get an xterm on the box easy w/o having to have "host based" adapters set up:
bash-3.2$ ssh -Y -p 2222 mininet@localhost

Ok, now we log in from our local machine with our nifty ssh -Y -p 2222 trick, then we quickly xauth merge our original users .Xauthority so we can run commands like mn as root, the last command xterm h1 h2 h3 brings up terminals on each of the "nodes".
bash-3.2$ hostname
wireless-10-146-144-180.public.utexas.edu

bash-3.2$ ssh -Y -p 2222 mininet@localhost
mininet@localhost's password: 
Welcome to Ubuntu 12.10 (GNU/Linux 3.5.0-17-generic x86_64)

 * Documentation:  https://help.ubuntu.com/
New release '13.04' available.
Run 'do-release-upgrade' to upgrade to it.

Last login: Fri Sep 13 07:11:12 2013 from 10.0.2.2

mininet@mininet-vm:~$ sudo su -

root@mininet-vm:~# xauth merge ~mininet/.Xauthority 

root@mininet-vm:~# mn --topo single,3 --mac --switch ovsk --controller remote
*** Creating network
*** Adding controller
Unable to contact the remote controller at 127.0.0.1:6633
*** Adding hosts:
h1 h2 h3 
*** Adding switches:
s1 
*** Adding links:
(h1, s1) (h2, s1) (h3, s1) 
*** Configuring hosts
h1 h2 h3 
*** Starting controller
*** Starting 1 switches
s1 
*** Starting CLI:

mininet> nodes
available nodes are: 
h1 h2 h3 s1 c0

mininet> xterm h1 h2 h3

Ok cool so we have a running environment - let's have a look at the controller:
mininet@mininet-vm:~$ dpctl show tcp:127.0.0.1:6634
features_reply (xid=0xed316c06): ver:0x1, dpid:1
n_tables:255, n_buffers:256
features: capabilities:0xc7, actions:0xfff
 1(s1-eth1): addr:42:d3:0e:0e:32:f8, config: 0, state:0
     current:    10GB-FD COPPER 
 2(s1-eth2): addr:ca:87:a0:f4:41:5e, config: 0, state:0
     current:    10GB-FD COPPER 
 3(s1-eth3): addr:06:35:30:84:75:c7, config: 0, state:0
     current:    10GB-FD COPPER 
 LOCAL(s1): addr:3a:ae:61:e0:0c:4d, config: 0x1, state:0x1
get_config_reply (xid=0xb4fb7b2b): miss_send_len=0

mininet@mininet-vm:~$ dpctl dump-flows tcp:127.0.0.1:6634
stats_reply (xid=0xec08f0e9): flags=none type=1(flow)


Ok first ping test is epic fail, as you can see above - no flows there be!
mininet> h1 ping -c2 h2
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
From 10.0.0.1 icmp_seq=1 Destination Host Unreachable
From 10.0.0.1 icmp_seq=2 Destination Host Unreachable

--- 10.0.0.2 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1001ms

So let's add our very first "flow"!
mininet@mininet-vm:~$ dpctl add-flow tcp:127.0.0.1:6634 in_port=1,actions=output:2
mininet@mininet-vm:~$ dpctl add-flow tcp:127.0.0.1:6634 in_port=2,actions=output:1

and now we get...
mininet> h1 ping -c2 h2
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_req=1 ttl=64 time=0.253 ms
64 bytes from 10.0.0.2: icmp_req=2 ttl=64 time=0.045 ms

YAY!

See lotsa ICMP flows!
mininet@mininet-vm:~$ dpctl dump-flows tcp:127.0.0.1:6634

stats_reply (xid=0x3eaa5384): flags=none type=1(flow)

  cookie=0, duration_sec=9s, duration_nsec=302000000s, table_id=0, priority=65535, n_packets=2, n_bytes=196, idle_timeout=60,hard_timeout=0,icmp,in_port=1,dl_vlan=0xffff,dl_vlan_pcp=0x00,dl_src=00:00:00:00:00:01,dl_dst=00:00:00:00:00:03,nw_src=10.0.0.1,nw_dst=10.0.0.3,nw_tos=0x00,icmp_type=8,icmp_code=0,actions=output:3

  cookie=0, duration_sec=9s, duration_nsec=302000000s, table_id=0, priority=65535, n_packets=2, n_bytes=196, idle_timeout=60,hard_timeout=0,icmp,in_port=3,dl_vlan=0xffff,dl_vlan_pcp=0x00,dl_src=00:00:00:00:00:03,dl_dst=00:00:00:00:00:01,nw_src=10.0.0.3,nw_dst=10.0.0.1,nw_tos=0x00,icmp_type=0,icmp_code=0,actions=output:1

  cookie=0, duration_sec=12s, duration_nsec=838000000s, table_id=0, priority=65535, n_packets=2, n_bytes=196, idle_timeout=60,hard_timeout=0,icmp,in_port=2,dl_vlan=0xffff,dl_vlan_pcp=0x00,dl_src=00:00:00:00:00:02,dl_dst=00:00:00:00:00:01,nw_src=10.0.0.2,nw_dst=10.0.0.1,nw_tos=0x00,icmp_type=0,icmp_code=0,actions=output:1

  cookie=0, duration_sec=11s, duration_nsec=836000000s, table_id=0, priority=65535, n_packets=1, n_bytes=98, idle_timeout=60,hard_timeout=0,icmp,in_port=1,dl_vlan=0xffff,dl_vlan_pcp=0x00,dl_src=00:00:00:00:00:01,dl_dst=00:00:00:00:00:02,nw_src=10.0.0.1,nw_dst=10.0.0.2,nw_tos=0x00,icmp_type=8,icmp_code=0,actions=output:2

Kinda more fun this once you stand up an actual controller:
mininet@mininet-vm:~$ controller -v ptcp:
Sep 13 07:48:23|00001|poll_loop|DBG|[POLLIN] on fd 4:
Sep 13 07:48:23|00002|rconn|DBG|tcp: entering ACTIVE
Sep 13 07:48:23|00003|vconn|DBG|tcp:127.0.0.1:51691: 

sent (Success): hello (xid=0x3394cdce):

Sep 13 07:48:23|00004|vconn|DBG|tcp:127.0.0.1:51691: 

received: hello (xid=0xe):

Sep 13 07:48:23|00005|vconn|DBG|tcp:127.0.0.1:51691: 

negotiated OpenFlow version 0x01 (we support versions 0x01 to 0x01 inclusive,
peer no later than version 0x01)

Sep 13 07:48:23|00006|vconn|DBG|tcp:127.0.0.1:51691: 

sent (Success): features_request (xid=0xbac80804):

Sep 13 07:48:23|00007|vconn|DBG|tcp:127.0.0.1:51691: 

sent (Success): set_config (xid=0x2c509fae): miss_send_len=128

Sep 13 07:48:23|00008|poll_loop|DBG|[POLLIN] on fd 6:

Sep 13 07:48:23|00009|vconn|DBG|tcp:127.0.0.1:51691: 

received: features_reply (xid=0xbac80804): ver:0x1, dpid:1

n_tables:255, n_buffers:256

features: capabilities:0xc7, actions:0xfff

 1(s1-eth1): addr:42:d3:0e:0e:32:f8, config: 0, state:0
     current:    10GB-FD COPPER 
 2(s1-eth2): addr:ca:87:a0:f4:41:5e, config: 0, state:0
     current:    10GB-FD COPPER 
 3(s1-eth3): addr:06:35:30:84:75:c7, config: 0, state:0
     current:    10GB-FD COPPER 
 LOCAL(s1):  addr:3a:ae:61:e0:0c:4d, config: 0x1, state:0x1

And here is the output of an iperf, I'll leave it here for now
*** Iperf: testing TCP bandwidth between h1 and h3
waiting for iperf to start up...*** Results: ['3.41 Gbits/sec', '3.41 Gbits/sec']

Looks like I might be all set for the tutorial this afternoon!



[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff