Friday, May 18, 2012

#filmquotebioinformatics - of busted clusters and software

Much hilarity on twitter over the last day with #filmquotebioinformatics:

https://twitter.com/#!/search/%23filmquotebioinformatics

I had a little crack at it myself. Mainly because I would never want to miss an opportunity to wind up Captain Birney with a well placed reference to Mrs Robinson ;-)


The best thing is looking at just how many "bioinformatics" film quotes reference actual physical kit or hardware/software issues. Michele (@MicheleClamp) and I were walking back from a meeting with folks at HSPH today wondering and pondering the question "what is an informatics core?".

The interesting thing from this fun friday film experiment seems to show me at least that Research Computing and Informatics are and have to be very tightly coupled entities!

In particular some of the stand outs I loved:

Keith Bradnam@kbradnam
We're going to need a bigger hard drive
Mick Watson@BioMickWatson
"you can't handle the data!"
Oliver Hofmann@fiamh
@fredebibs: Remember, with big cluster comes great maintenance activity
Nick Loman@pathogenomenick
It's an error message. It means your 1024 node, 3 month long job sleeps with the fishes
Jason Stajich ‏‏@jasonstajich
The cluster, it goes on forever - My God it's full of jobs.

and my all time fave:
Deanna Church ‏@deannachurch
Go ahead punk, run make.

hehehe! So it turns out, computers are an important part of informatics after all - which is a nice thought on a Friday afternoon to have ;-)

Saturday, May 5, 2012

never underestimate the power of a genome pig

This blog post inspired by:



Modern genome assembly and analysis takes in, and deals with very large data. My team runs, manages and patches literally petabytes of storage, but sometimes we need less disk, and much more thought. It is really hard juggling this amount of DNA on a regular basis - we push this same data through 100's of algorithms and heuristics on a given day, every step is just another headache for us all.

However, as a community we also tend to use compression as much as practically possible, and I often forget that gzip can be run in parallel thanks to the most awesome pigz, written by Mark Adler who as a member of JPL (I visited their offices, totally amazing!) he has really cared about data compression! Mars is a long way away, compression on space missions is even more important - remember we had disk like this back in the day. Mark also helped PNG become part of our culture. Wonderful algorithms!

http://zlib.net/pigz/


Check it out, it is pretty awesome - I *always* forget to use it, and it is totally brilliant!

What follows is an example on a 24 core machine looking at a 29G chunk of DNA, not an awful lot, only 135,419,082 sequences vs 14,615,943 reads.

Compared to most of our data sets these days, this is tiny but enough data to have a quick look at with. The analysis after this will take much longer!

[jcuff@sandy]$ du -sh *.fastq

47M     454_Aseg_M.fastq
29G     Aseg_all_pe.fastq
3.4G    Aseg_all_se.fastq

Now we compress that fairly hefty 30G file... we use -p24 to make the "pig" run 24 way parallel. Gzip is a classic embarrassingly parallel problem, so it works really well with trivial threads as applied in the pigz algorithm:

[jcuff@sandy]$ time ./pigz-2.2.4/pigz -p24 *.fastq

real     4m    20.8s
user     101m  19.4s
sys      0m    52.7s

And behold, it is a whole lot smaller! ;-) it took only 4 mins to compress nearly 34GB into 10GB which would have taken over an hour and a half with a single thread 101 mins.

[jcuff@sandy]$ du -sh *.fastq.gz

13M     454_Aseg_M.fastq.gz
8.6G    Aseg_all_pe.fastq.gz
1.2G    Aseg_all_se.fastq.gz

Now we get to run our velvet:

[jcuff@sandy]$ ../velvet/git/velvet/velveth . 31 -fastq.gz -shortPaired ./Aseg_all_pe.fastq.gz -short ./Aseg_all_se.fastq.gz -long ./454_Aseg_M.fastq.gz

[0.000001] Reading FastQ file ./Aseg_all_pe.fastq.gz
[333.662963] 120721600 reads found.
[333.662996] Done
[333.663040] Reading FastQ file ./Aseg_all_se.fastq.gz
[374.494176] 14615943 reads found.
[374.494207] Done
[374.494293] Reading FastQ file ./454_Aseg_M.fastq.gz
[375.022046] 81539 reads found.
[375.022100] Done
[375.022205] Reading read set file ./Sequences;
[400.671892] 135419082 sequences found
[558.247869] Done
[558.247938] 135419082 sequences in total.
Directory in inputSequenceArrayIntoSplayTableAndArchive .
Filename 1 in inputSequenceArrayIntoSplayTableAndArchive .
Filename 2 in inputSequenceArrayIntoSplayTableAndArchive ./Roadmaps
Outfile ./Roadmaps
[558.248432] Writing into roadmap file ./Roadmaps...
[627.369394] Inputting sequences...
[627.369458] Inputting sequence 0 / 135419082
[649.689662] Inputting sequence 1000000 / 135419082
[676.686462] Inputting sequence 2000000 / 135419082

[SNIP]

[4226.565974] Inputting sequence 133000000 / 135419082
[4250.713730] Inputting sequence 134000000 / 135419082
[4275.043031] Inputting sequence 135000000 / 135419082
[4290.560832]  === Sequences loaded in 3663.191519 s
[4290.560961] Done inputting sequences
[4290.560968] Destroying splay table
[4292.089251] Splay table destroyed

[jcuff@sandy]$ ../velvet/git/velvet/velvetg . -ins_length 200 -read_trkg yes
[0.000000] Reading roadmap file ./Roadmaps

Update and special mention to the block sorting bzip compressor!

http://compression.ca/pbzip2/

via our good pal Roman Valls





[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff