Saturday, May 5, 2012

never underestimate the power of a genome pig

This blog post inspired by:



Modern genome assembly and analysis takes in, and deals with very large data. My team runs, manages and patches literally petabytes of storage, but sometimes we need less disk, and much more thought. It is really hard juggling this amount of DNA on a regular basis - we push this same data through 100's of algorithms and heuristics on a given day, every step is just another headache for us all.

However, as a community we also tend to use compression as much as practically possible, and I often forget that gzip can be run in parallel thanks to the most awesome pigz, written by Mark Adler who as a member of JPL (I visited their offices, totally amazing!) he has really cared about data compression! Mars is a long way away, compression on space missions is even more important - remember we had disk like this back in the day. Mark also helped PNG become part of our culture. Wonderful algorithms!

http://zlib.net/pigz/


Check it out, it is pretty awesome - I *always* forget to use it, and it is totally brilliant!

What follows is an example on a 24 core machine looking at a 29G chunk of DNA, not an awful lot, only 135,419,082 sequences vs 14,615,943 reads.

Compared to most of our data sets these days, this is tiny but enough data to have a quick look at with. The analysis after this will take much longer!

[jcuff@sandy]$ du -sh *.fastq

47M     454_Aseg_M.fastq
29G     Aseg_all_pe.fastq
3.4G    Aseg_all_se.fastq

Now we compress that fairly hefty 30G file... we use -p24 to make the "pig" run 24 way parallel. Gzip is a classic embarrassingly parallel problem, so it works really well with trivial threads as applied in the pigz algorithm:

[jcuff@sandy]$ time ./pigz-2.2.4/pigz -p24 *.fastq

real     4m    20.8s
user     101m  19.4s
sys      0m    52.7s

And behold, it is a whole lot smaller! ;-) it took only 4 mins to compress nearly 34GB into 10GB which would have taken over an hour and a half with a single thread 101 mins.

[jcuff@sandy]$ du -sh *.fastq.gz

13M     454_Aseg_M.fastq.gz
8.6G    Aseg_all_pe.fastq.gz
1.2G    Aseg_all_se.fastq.gz

Now we get to run our velvet:

[jcuff@sandy]$ ../velvet/git/velvet/velveth . 31 -fastq.gz -shortPaired ./Aseg_all_pe.fastq.gz -short ./Aseg_all_se.fastq.gz -long ./454_Aseg_M.fastq.gz

[0.000001] Reading FastQ file ./Aseg_all_pe.fastq.gz
[333.662963] 120721600 reads found.
[333.662996] Done
[333.663040] Reading FastQ file ./Aseg_all_se.fastq.gz
[374.494176] 14615943 reads found.
[374.494207] Done
[374.494293] Reading FastQ file ./454_Aseg_M.fastq.gz
[375.022046] 81539 reads found.
[375.022100] Done
[375.022205] Reading read set file ./Sequences;
[400.671892] 135419082 sequences found
[558.247869] Done
[558.247938] 135419082 sequences in total.
Directory in inputSequenceArrayIntoSplayTableAndArchive .
Filename 1 in inputSequenceArrayIntoSplayTableAndArchive .
Filename 2 in inputSequenceArrayIntoSplayTableAndArchive ./Roadmaps
Outfile ./Roadmaps
[558.248432] Writing into roadmap file ./Roadmaps...
[627.369394] Inputting sequences...
[627.369458] Inputting sequence 0 / 135419082
[649.689662] Inputting sequence 1000000 / 135419082
[676.686462] Inputting sequence 2000000 / 135419082

[SNIP]

[4226.565974] Inputting sequence 133000000 / 135419082
[4250.713730] Inputting sequence 134000000 / 135419082
[4275.043031] Inputting sequence 135000000 / 135419082
[4290.560832]  === Sequences loaded in 3663.191519 s
[4290.560961] Done inputting sequences
[4290.560968] Destroying splay table
[4292.089251] Splay table destroyed

[jcuff@sandy]$ ../velvet/git/velvet/velvetg . -ins_length 200 -read_trkg yes
[0.000000] Reading roadmap file ./Roadmaps

Update and special mention to the block sorting bzip compressor!

http://compression.ca/pbzip2/

via our good pal Roman Valls





[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff