Hacker News new | past | comments | ask | show | jobs | submit login

> Programs that just divide up work across processes are much easier to write without introducing obscure bugs due to the lack of atomicity.

You often don't even need to do this yourself. GNU parallel is the way to go for dividing work up amongst CPU cores. Why reinvent the wheel?

I agree with you that threads are talked about way more than they should be. It's like all programmers learn this one simple rule: to be fast you have to be multi-threaded. It's really not the case. There is also massive confusion amongst programmers on the difference between concurrency and parallelism. I sometimes ask applicants to describe the difference and few can. Python is fine at concurrency if that's all you want to do.




> GNU parallel is the way to go for dividing work up amongst CPU cores. Why reinvent the wheel?

Because most problems are not the embarrassingly parallel kind suitable for use with GNU parallel. For example, any problems that require some communication between the individual tasks.


I don't think the parent was proposing reinventing the wheel, Python has straightforward process parallelism support in the 'multiprocessing' library and for Python that's generally a better idea than GNU Parallel, IMO.


The advantage of GNU parallel is it's a standard tool that works for any non-parallel process. This has all the usual advantages of following the Unix principle.


> that works for any non-parallel process

No, it doesn't. Only for those processes, where you can trivially split the input and concatenate the outputs. Try using GNU parallel to sort a list of numbers, or to compute their prefix sum – it's not possible, and those are even simpler use cases than most of what you'll encounter in practice.


Oh come on. It should be obvious that I'm talking about the processes that can be split up in that way. Those problems are so common that someone literally wrote GNU parallel to solve them.


> I'm talking about the processes that can be split up in that way

No, you weren't. You said: "[...] GNU parallel [...] works for any non-parallel process" (emphasis mine)

> Those problems are so common that someone literally wrote GNU parallel to solve them.

As part of my job I'm writing multi-threaded, parallel programs all the time, and in those years only a single problem would have been feasible to parallelize with GNU parallel; but since I was using Rust, it was trivial to do the parallelization right there in my code without having to resort an outer script/binary that calls GNU parallel on my program.


> Try using GNU parallel to sort a list of numbers,

`parsort` is part of GNU Parallel.


... and it uses a manually implemented post-processing step. You can't just run the sort program with GNU parallel and expect to get a fully sorted list.


> Try using GNU parallel to sort a list of numbers, [...] – it's not possible,

Yet it clearly is possible, so your blanket statement is clearly wrong.

`parsort` a simple wrapper, and this really goes for many uses of GNU Parallel: You need to prepare your data for the parallel step and post-process the output.

Maybe you originally meant to say: "Only for those processes, where you can preprocess the input and post-process the outputs."


Why would you use GNU parallel if you have to implement your own non-trivial pre- or post-processing logic anyway? Just spawn the worker processes yourself.

GNU parallel is great if you have, e.g., a bunch of files, each of which needs to be processed individually, like running awk or sed over it. Then you can just plop parallel in front and get a speedup for free. That's not what parsort does.


> GNU parallel is the way to go for dividing work up amongst CPU cores. Why reinvent the wheel?

We’re not talking about writing scripts to run on your laptop. We’re talking about code written for production applications. Deploying GNU parallel to production nodes / containers would be a major change to production systems that may not be feasible and even if it is would come with a high cost in terms of added complexity, maintenance, and production troubleshooting.


I used to use GNU parallel to run big data tasks on supercomputers. There's nothing special about "production". It's all just computers.


That's actually what I'm doing a lot of the time. Or even just bash: for i in {1..threadcount}; do pypy my.py $i/$threadcount & done;


We’re not talking about how to write scripts to run on your laptop, we’re talking about production systems.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: