Thursday, February 24, 2011

on the train ride in...

Here's the set up with 5 x 30MB files:
for ((c=1; c<=5;c++)) do dd if=/dev/urandom of=file.$c bs=1024k count=30; done
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 4.17672 s, 7.5 MB/s
the question - change one argument to make this faster...:
time find . -type f | xargs gzip 

real    0m7.218s
user    0m6.940s
sys     0m0.250s

and the "answer":
time find . -type f | xargs -n 1 -P 5 gzip 

real    0m2.178s
user    0m7.030s
sys     0m0.170s
A very subtle "hpc" use of xargs ;-)

*update* We were all obsessing in our group jabber chat as to how to do this with "one" argument. It caused quite the stir with that extra "-n 1" thing you need to get parallel xargs to run. Well so I thought a bit more and did one better - here is an example by changing basically zero "arguments" ;-)
time find . -type f | parallel gzip 

real    0m3.033s
user    0m7.050s
sys     0m0.250s
brought to you by the power of:
2032  wget
 2033  tar jxvf parallel-20110205.tar.bz2 
 2034  cd parallel-20110205/
 2035  ./configure 
 2036  make -j 8 install

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

[any opinions here are all mine, and have absolutely nothing to do with my employer]
(c) 2011 James Cuff