Wednesday, April 20, 2011

velocity in research computing *really* matters

This morning we saw a ticket in RT from Mario Juric who is doing some amazing work on our cluster as a Hubble Fellow over at the Center for Astrophysics. Anyway, so Mario found a problem with our 2.7.1 production version of python.

Mario's request was super simple (stuff we see all the time):
I'm using Python 2.7 installed on Odyssey. Two days ago I've managed to
trace a serious bug it the implementation of one of its components that
caused our codes to (on very rare occasions) produce erroneous results.
The bug has been reported upstream as:

I'd really appreciate if this could be done. Otherwise, we'd have to build
a separate version of Python (and numpy, and scipy, ...) just to get this
fix in.

So Mario also gives us the full patch info, where to find information about the error and the type of error. Here is the ticket queue from python HQ:
Date                    User          Action  Args
2011-04-20 06:39:51     rhettinger    set     status: open -> closed
2011-04-20 00:19:26     python-dev    set     messages: + msg134110
2011-04-20 00:01:12     python-dev    set     messages: + msg134109
2011-04-19 21:08:45     durban        set     nosy: + durban
2011-04-19 20:58:09     amaury        set     status: closed -> open
2011-04-19 20:26:37     mjuric        set     messages: + msg134096
2011-04-19 19:08:53     rhettinger    set     status: open -> closed
2011-04-19 18:15:34     python-dev    set     messages: + msg134085
2011-04-19 16:55:00     python-dev    set     nosy: + python-dev
2011-04-19 09:34:04     rhettinger    set     messages: + msg134027
2011-04-19 08:46:24     rhettinger    set     assignee: rhettinger
2011-04-19 08:28:24     mjuric        create 

This was all based on Mario's original filing:
The implementation of OrderedDict.__reduce__() in Python 2.7.1 is not thread safe because of the following four lines:

tmp = self.__map, self.__root
del self.__map, self.__root
inst_dict = vars(self).copy()
self.__map, self.__root = tmp

If one thread is pickling an OrderedDict, while another accesses it, a race condition occurs if the accessing thread accesses the dict after self.__map and self.__root have been delated, and before they've been set again (above).

So we get the ticket flow like this following on from the 20th April @ 6:39am ticket close out @ python HQ by Raymond Hettinger. At this point I start thinking about it some more, and realize that this is not an ordinary patch a piece of code kinda ticket...
Wed Apr 20 08:56:20 2011 Mario Juric  - Ticket created 
Wed Apr 20 09:10:52 2011 Chris Walker - Taken 
Wed Apr 20 09:19:12 2011 Chris Walker - Correspondence added
Wed Apr 20 09:36:59 2011 Mario Juric  - Correspondence added 

Seriously one conversation: [broken] -> [fixed?] -> [yeah fixed!]


So what is important here? Why am I making so much fuss about a patch, the type of which happens every single day in the open source and in our research community?

1: our researchers are amazing, finding that level of threading issue deep inside a code base is not trivial!

2: open source community bug tracking blows proprietary release cycles out of the water!

3: the open source community take issues seriously, they are not messing about!

4: if you are not fast, proactive and aware in research computing you are dead! the science community has no need for you - they can do all this IT stuff themselves - remember they can find galaxies and collide particles and annotate the human genome - sticking a few computers together is just cake and biscuit to these boys and girls!

Think about how long such a change to a core threading library would take to be approved in any production enterprise IT shop? No chance that such seat of pants application is either sensible or practical to do on a production floor. However in RC, this is exactly what we need to do.

So that is why research computing is different! - least to me and my team at any rate!

(c) 2018 James Cuff