Another way to implement map reduce in Python is to use the Hadoop Streaming Jar (http://hadoop.apache.org/common/docs/r0.20.1/streaming.html). You basically write a mapper and reducer as python scripts and ship them out with the job in a tarball. The scripts just need to read input from stdin and write to stdout. You then pass arguments to the streaming jar when you kick off the job with the shell command to invoke your python scripts for the map and reduce tasks.
this is pretty cool, but be aware that he is using marshal to send function objects to remote systems, so you are marrying yourself to its list of requirements as well: http://docs.python.org/library/marshal.html
it'll work great for simple use cases but since he also does nothing to handle dependency capturing, your map and reduce functions had better use standard python objects or python objects that you are sure will be installed on the remote systems ...
Yes, that would make sense. The site I got this from (http://the.taoofmac.com) mentions using PyPy on all machines to run it (which is effectively Python 2.7.1 at this point), but then again, you have to match versions very closely when you're using Hadoop and HDFS anyway... Python is actually easier to deploy in matched runtimes across a cluster :)
Unfortunately, python doesn't let you serialize a function through pickle or other non-marshal serializations. The only cross-python solution is to actually have python read the source code and send it (rather than the bytecode) over the wire, which seems much more brittle to me.
oh yeah, if you run this script with different python versions, you will experience mysterious failures probably.
in some limited testing I did, I was able to serialize a function on 64-bit Linux and have it execute on 32-bit Windows. I even went crazy and wrote using ctypes against the Win32 API on Linux and sent a string to be loads()ed by code on Windows... and that worked.
(this isn't really python being awesome as much as bytecode languages working as designed... but as far as I know this is all working by happy co-incidence, unlike say Java, where the bytecode is designed to be cross-platform...)
pickle can't represent functions. pickle is higher level and (I theorize) its authors realized the quandary you can work yourself into with function serialization. you could use pickle, you would just have to have a more complicated client installation..
additionally, since clients are directly executing code that the server provides, it might be nice (in the finished product, of course) to see some validation of the code blobs before executing them.
I'd love to see one of these python MR approaches get some traction, until then I expect most companies will be following Facebook's approach of putting data into HDFS/Hive and then using Python, or whatever, to parse the output of HiveQL.