I really want to use Python in a distributed computation project of mine, but I've been running into serious performance bottlenecks with regards to object serialization. Like many concurrent Python frameworks, Pulsar uses `pickle` to serialize data across processes, and for all but the smallest data structures, workers end up spending 95+% of their time (de)serializing data.
I work with video data, so I just ran these rough benchmarks on my MacBook Pro:
import cPickle as pickle # yes, I'm on Python 2.7. Sue me.
import numpy as np
a = np.random.rand(800, 600, 3) # 800x600 px, RGB channel
%timeit pickle.dumps(a) // %timeit magic from IPython
1 loops, best of 3: 855 ms per loop
It takes nearly a second to serialize a not-so-huge `numpy` array, which makes it very difficult to do any sort of soft real-time analysis.
This is a huge pain, and (very sadly for this Python aficionado) suggests that Python might be the wrong language for this kind of work.
You should use highest protocol of pickle, here are the numbers in my machine:
In [5]: %timeit pickle.dumps(a)
1 loops, best of 3: 724 ms per loop
In [6]: %timeit pickle.dumps(a, protocol=pickle.HIGHEST_PROTOCOL)
100 loops, best of 3: 12.4 ms per loop
(That is my humorous way of saying that Python 2.3 is rather old and therefore HIGHEST_PROTOCOL seems like it would be a desirable trade off for most.)
Yeah, what the hell? Even 2.5 is way too old, I generally only support 2.6+ nowadays. If something breaks compatibility with 2.5, no problem there. Hell, Python 3 has been out for years!
No, I'm agreeing with you. I want to transition to 3.4 completely (mypy looks amazing), but there are some small Django libraries (third-party libs) that aren't yet compatible :(
Mypy is indeed amazing, I can't get enough of it. Throw in asyncio and pathlib and you get the reason why I could never go back.
You might look into trying to port the django libs by yourself if you have the skill to do so. 2to3 often gets you really far. If you don't, I'd definitely recommend opening a ticket on the project page. I've done that with a few libs I use, and for a couple of them the maintainers just totally forgot about them and got around to doing the conversion just because I asked.
I did that already, I just have to find some time to do the conversion. The maintainer was kind enough to assist.
How do you use MyPy? I'm particularly worried about two things:
1) If I'm building a library, I can't have MyPy as a requirement, but would still like to use it for the types checks. Is there a way to omit the import when distributing your library?
2) Can it only check parts of an application? Maybe I have a big Django app and don't want it to static-check Django and all the other imports every time, for example.
Thanks for your reply! There aren't preprocessors, that's true, but I think the way MyPy does things is a bit unnecessary. They could have avoided the typing module and just used bare annotations, and made MyPy a static type checker. Still, this just avoids the dependency, which isn't such a big deal.
For comparison, I get the same sort of speed on my system:
%timeit p = pickle.dumps(a)
1 loops, best of 3: 1.14 s per loop
But tostring and fromstring run in a fraction of the time.
%timeit s = a.tostring()
1000 loops, best of 3: 1.13 ms per loop
%timeit b = np.fromstring(s).reshape(800, 600, 3)
1000 loops, best of 3: 1.57 ms per loop
np.all(a == b)
True
You'd allocate memory in the parent process and then you'd write to that memory from within the child processes. The Python function you're "multiprocessing" won't necessarily need to return anything.
Pulsar is 100% written on top of asyncio. In this respect is not dissimilar to Tornado or Twisted, which also use an event loop for their asynchronous implementation (with their own event loop and Future classes).
However, pulsar is built on top of the standard lib with all the benefits that that brings, especially in view of the changes in python 3.5 (https://www.python.org/dev/peps/pep-0492/).
The actor model in pulsar refers to the parallel side of the asynchronous framework. This is where pulsar differs from twisted for example. In pulsar each actor (think of a specialised thread or process) has its own event loop. In this way any actor can run its own asynchronous server for example.
Tornado is an asynchronous web framework, pulsar is not, you can use any web framework and run it on pulsar wsgi application. You can use pulsar to create any other socket application, not just HTTP.
I am not sure whether Pulsar would anywhere achieve the speed / throughout of a truly concurrent system like JVM, Go, Haskell, and Erlang. I don't see benchmarks posted on Pulsar's website, but I came across this discussion [1] where aphyr and KiranDave discuss about why Clojure/Erlang's model of explicit concurrency (which enabled parallelism) is superior to NodeJs' (implicit concurrency via libuv, and multi-process parallelism via clusters).
From reading the "design" section on the website, to me it looks like Pulsar is an attempt to replicate NodeJs (?) and by extension cannot compete with languages with truly concurrent runtimes?
Pulsar can be configured to use either processes or threads. If you use processes, you pay the IPC penalty, just like NodeJS does. If you use threads, you pay the GIL penalty.
So yeah, it won't be as fast as BEAM or the JVM either way.
Oh, another one. Doesn't the fact that there is a new concurrent framework for Python nearly every single month point to the existence of a rather large pachyderm in the room?
Looks like an actor framework, built directly on top of stdlib asyncio, and is designed to use multiprocessing first instead of tornado's cooperative multitasking.
I work with video data, so I just ran these rough benchmarks on my MacBook Pro:
It takes nearly a second to serialize a not-so-huge `numpy` array, which makes it very difficult to do any sort of soft real-time analysis.This is a huge pain, and (very sadly for this Python aficionado) suggests that Python might be the wrong language for this kind of work.
Any suggestions?