Hacker News new | past | comments | ask | show | jobs | submit login
Pulsar: Concurrent framework for Python (quantmind.github.io)
95 points by kermatt on Aug 6, 2015 | hide | past | favorite | 29 comments



I really want to use Python in a distributed computation project of mine, but I've been running into serious performance bottlenecks with regards to object serialization. Like many concurrent Python frameworks, Pulsar uses `pickle` to serialize data across processes, and for all but the smallest data structures, workers end up spending 95+% of their time (de)serializing data.

I work with video data, so I just ran these rough benchmarks on my MacBook Pro:

    import cPickle as pickle  # yes, I'm on Python 2.7.  Sue me.
    import numpy as np
    
    a = np.random.rand(800, 600, 3)  # 800x600 px, RGB channel

    %timeit pickle.dumps(a)  // %timeit magic from IPython
    1 loops, best of 3: 855 ms per loop

It takes nearly a second to serialize a not-so-huge `numpy` array, which makes it very difficult to do any sort of soft real-time analysis.

This is a huge pain, and (very sadly for this Python aficionado) suggests that Python might be the wrong language for this kind of work.

Any suggestions?


You should use highest protocol of pickle, here are the numbers in my machine:

    In [5]: %timeit pickle.dumps(a)
    1 loops, best of 3: 724 ms per loop

    In [6]: %timeit pickle.dumps(a, protocol=pickle.HIGHEST_PROTOCOL)
    100 loops, best of 3: 12.4 ms per loop


Oh wow! This is more than I dared hope for, many thanks!


What's the trade off here?


Backwards compatibility. Anything pickled with HIGHEST_PROTOCOL will be unreadable in anything below Python 2.3.


It might be worth noting that if you pickle data in Python 3 then it actually has protocol versions which aren't compatible with Python 2: https://docs.python.org/3/library/pickle.html#data-stream-fo....


So my Ford Model T won't be able to read it?

(That is my humorous way of saying that Python 2.3 is rather old and therefore HIGHEST_PROTOCOL seems like it would be a desirable trade off for most.)


Yeah, what the hell? Even 2.5 is way too old, I generally only support 2.6+ nowadays. If something breaks compatibility with 2.5, no problem there. Hell, Python 3 has been out for years!


Hey, I'm with you. I use 3.4 on almost everything, and 2.7 at worst anywhere else. I weep for those stuck using Python 2.2.


No, I'm agreeing with you. I want to transition to 3.4 completely (mypy looks amazing), but there are some small Django libraries (third-party libs) that aren't yet compatible :(


Mypy is indeed amazing, I can't get enough of it. Throw in asyncio and pathlib and you get the reason why I could never go back.

You might look into trying to port the django libs by yourself if you have the skill to do so. 2to3 often gets you really far. If you don't, I'd definitely recommend opening a ticket on the project page. I've done that with a few libs I use, and for a couple of them the maintainers just totally forgot about them and got around to doing the conversion just because I asked.


I did that already, I just have to find some time to do the conversion. The maintainer was kind enough to assist.

How do you use MyPy? I'm particularly worried about two things:

1) If I'm building a library, I can't have MyPy as a requirement, but would still like to use it for the types checks. Is there a way to omit the import when distributing your library?

2) Can it only check parts of an application? Maybe I have a big Django app and don't want it to static-check Django and all the other imports every time, for example.


Sorry it took so long for me to get back to you; I only check HN from time to time.

1. No, afaik there's no way to omit the import. Sadly there's little in the way of macros or preprocessors in the python world at this point.

2. I'm fairly sure there's ways to use module stubs, but all my mypy work has dealt with the stdlib which presents no issue.


Thanks for your reply! There aren't preprocessors, that's true, but I think the way MyPy does things is a bit unnecessary. They could have avoided the typing module and just used bare annotations, and made MyPy a static type checker. Still, this just avoids the dependency, which isn't such a big deal.


For comparison, I get the same sort of speed on my system:

    %timeit p = pickle.dumps(a)
    1 loops, best of 3: 1.14 s per loop
But tostring and fromstring run in a fraction of the time.

    %timeit s = a.tostring()
    1000 loops, best of 3: 1.13 ms per loop

    %timeit b = np.fromstring(s).reshape(800, 600, 3)
    1000 loops, best of 3: 1.57 ms per loop

    np.all(a == b)
    True


Check out the performance of these serializers:

https://github.com/eishay/jvm-serializers/wiki

Although they're mostly limited to Java, I know at least one (Avro) can be used in Python.


Arvo seems pretty nifty, indeed. I'll see how far I get using higher protocol versions with pickle, but I'm bookmarking this for a later date.

Thanks!


Maybe you can use shared memory. You could use raw mmap.mmap(), or use the sharedctypes multiprocessing submodule (https://docs.python.org/2/library/multiprocessing.html#modul...) .

You'd allocate memory in the parent process and then you'd write to that memory from within the child processes. The Python function you're "multiprocessing" won't necessarily need to return anything.


Pass the data by compressed, binary file using something like bcolz or h5py?


This interesting, I'm also looking for an Actor framework for python to do work with video.

It would be good if this could use a nanomsg and its zero-copy mechanism instead of just raw sockets.


Pulsar is 100% written on top of asyncio. In this respect is not dissimilar to Tornado or Twisted, which also use an event loop for their asynchronous implementation (with their own event loop and Future classes).

However, pulsar is built on top of the standard lib with all the benefits that that brings, especially in view of the changes in python 3.5 (https://www.python.org/dev/peps/pep-0492/).

The actor model in pulsar refers to the parallel side of the asynchronous framework. This is where pulsar differs from twisted for example. In pulsar each actor (think of a specialised thread or process) has its own event loop. In this way any actor can run its own asynchronous server for example.

Tornado is an asynchronous web framework, pulsar is not, you can use any web framework and run it on pulsar wsgi application. You can use pulsar to create any other socket application, not just HTTP.


I am not sure whether Pulsar would anywhere achieve the speed / throughout of a truly concurrent system like JVM, Go, Haskell, and Erlang. I don't see benchmarks posted on Pulsar's website, but I came across this discussion [1] where aphyr and KiranDave discuss about why Clojure/Erlang's model of explicit concurrency (which enabled parallelism) is superior to NodeJs' (implicit concurrency via libuv, and multi-process parallelism via clusters).

From reading the "design" section on the website, to me it looks like Pulsar is an attempt to replicate NodeJs (?) and by extension cannot compete with languages with truly concurrent runtimes?

[1] https://news.ycombinator.com/item?id=4306241


Great link - thanks for posting!

Pulsar can be configured to use either processes or threads. If you use processes, you pay the IPC penalty, just like NodeJS does. If you use threads, you pay the GIL penalty.

So yeah, it won't be as fast as BEAM or the JVM either way.


Is this related to the fibers/channels/actor framework by the same name for Clojure?

https://github.com/puniverse/pulsar


Good question. The name is the same and goals of both projects are similar. Implementation is very different, though.


Oh, another one. Doesn't the fact that there is a new concurrent framework for Python nearly every single month point to the existence of a rather large pachyderm in the room?



How does it compare/contrast with Tornado?


Looks like an actor framework, built directly on top of stdlib asyncio, and is designed to use multiprocessing first instead of tornado's cooperative multitasking.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: