So I've actually tried to use the futures package. While it's very clean for certain types of tasks, there are a few problems that I think are inherent to the way R deals with its parallel packages (which the futures package is built on top of).
Futures is great for tasks where you have some kind of task workflow like:
And boom, you can have two tasks running in parallel and everything "just works." It's extremely nice to use thanks to R's promises capability.
Where it falls down is when you try to load up a bunch of futures at once... I'm not clear on the implementation details, but from what I can tell every parallel task is assigned a "port" on your system, but if there is a port conflict (or the OS doesn't "release" (?) the port quickly enough) tasks simply die with an inscrutable error.
I've found that it's necessary to 1. ensure that only one "set" of parallel tasks are running at one time, and 2. create a central "port registry" and manually assign ports randomly within nonoverlapping ranges for parallel tasks. It's straightforward but frustrating to do.
Finally (and I don't know if the futures package has updated since I tried it out last year) it doesn't work on Windows, which is a problem for many R users.
As for not working on Windows, have you tried `plan(multisession)` instead of `plan(multicore)`? The latter will never work on Windows due to it's lack of forkability, as mentioned in this vignette.
That ports thing is interesting; could that bug be specific to the `multicore` plan?
It's been over a year since I dug into this problem but I believe that the ports issue is inherent to all R parallel code, at least on Linux. Peeking at the futures code multisession is build on the "clustering" type of parallelism which means that it is vulnerable to this issue.
I haven't encountered the problem on Mac but that's because my laptop is generally unable to create enough parallel jobs to saturate the number of available ports.
Yes and no. Multicore/multisession is not the only option for parallelism in R. The default approach is to use fork() on *nix and... hell, I don't use Windows, so I don't really care how it works there. But there are other options.
Depending on your use case it may be fixable. What topology are you using where this is breaking down, and could the synchronization/communication be queued with something like rredis or rrqueue?
All of this assuming you want to use R for your work, as opposed to an external dependency injection framework (or some crap like Torque).
Using R is basically a requirement for my work (biology) unless I felt like reimplementing all the packages I depend on, which I don't want to do since I want to graduate sometime this decade.
The multicore strategy calls mcparallel under the hood which uses the forking strategy (hence why it doesn't work on Windows). I don't recall exactly where in the source code I figured this out (again, this was a year ago), but it suffers from the same problem that there are a limited number of communication channels (what I generically call "ports") for all interprocess communication in R, which includes fork-based parallelism. The problem is that R does not allow you to tell it which port to communicate across for forked processes. The only advantage that the fork method has that I can tell in R is that you can share memory and don't have to copy stuff around.
My topology was a set of 3 analyses (didn't depend on each other), which had to be run in replicate 1000 times (those didn't depend on each other), and each task had two independent subtasks that were combined together to form a final result. I was running these on a server with 48 cores.
I'm sure there are other more heavyweight things that I could do to fix it, like using custom packages or setting up external software, but this was for a one-off analysis for a paper. It had to be reproducibly run (hence the desire for Windows compatibility) but as long as I could run it, it didn't matter if it was the best architecture.
I'm also working in biology (at the UCSF Cancer Center) and one of the reasons why the future package exists in the first place is that we needed a way to process a large number of microarrays, RNA / DNA sequencing samples and HiC data (and I'd like to use everything from the R prompt). We have a large compute cluster available for this (sorry @apathy but we're using TORQUE but hope to move to Slurm soon).
Now our R script for sequence alignment basically looks like:
## Use nested futures where first layer is resolved
## via the scheduler and the second using multiple
## cores / processes on each of the compute node.
library("future.BatchJobs")
plan(list(batchjobs_torque, multiprocess))
fastq <- dir(pattern = "[.]fq$")
bam <- listenv()
for (ii in seq_along(fastq)) {
fq <- fastq[ii]
bam[[ii]] %<-% {
bam_ii <- listenv()
for (chr in 1:24) {
bam_ii[[chr]] %<-% DNAseq::align(fq, chromosome = chr)
}
as.list(bam_ii)
}
}
You ran into IPC process limits. The usual way around this is just to run the tasks across a bunch of nodes, using mclapply or openMP (if running C library calls) on each.
R is not beautiful but can be coerced into getting things done if need be. But any program running processes in user space will hit IPC limits to prevent fork bombs. Either you live with it or you write threaded (ugh) or openMP (yay) hooks (typically via Rcpp) to sidestep this.
The reason I asked about topology and machines is that I'm one of the people who pestered Ripley to include parallel support on Windows at all. My graduate adviser wanted to run one of my analyses on Windows and I wanted to run several million of them on the cluster. So I bitched until BDR fixed it. This does not usually work...
Well done - and thanks for pushing for this (and for BDR to implement this and many many other things)! I didn't know this history. R users and developers have soo many to thank at the same time as lots of heroic work of a few.
From my reading of the docs, "multisession" is supposed to work on Windows, but not "multicore". For cross-platform coding, one can use the "multiprocess" option, which results in multicore on platforms that support it, and multisession otherwise.
Author here. Thanks for this illustration of futures in R.
About one port per future: This is the case for futures that uses the 'multisession' backend, which is a localhost + PSOCK version of the much more general 'cluster' future type. PSOCK cluster futures are in turn built on top of the 'parallel::makePSOCKcluster()', which launches a set of R processes/sessions that communicates with the main R session via sockets.
The 'multicore' futures do _not_ use the above, but instead relies on processes forking (part of the 'parallel' package) just like 'parallel::mclapply(). R doesn't support forking on Windows, which is why multicore futures fall back to synchronous processing on Windows. With multicore there is no usage of ports.
BTW, since a few months, the future package provides the 'multiprocess' future type, which basically is just a convenient alias for 'multicore' with fallback to 'multisession' so that plan('multiprocess') provides parallel processing everyone.
When the first multisession future is created, it also triggers the setup of all the background R sessions (via 'parallel::makePSOCKcluster()'). If there are issues with not being able to create all workers, then there should be an informative error at this point.
However, other than by mistake interrupting/terminating the background R processes or the main R processes while they communicate with each other, I actually haven't experienced any sudden deaths. Of course, a worker can always die for whatever reason you can core dump an R session.
Oh, I should forget to say, that both multicore and multisession futures try to play very nice with the machine settings. For instance, they will not use more cores than what is available on the machine (unless you force it to). They will also be agile to whatever number of cores you are assigned by job schedulers such as Slurm, SGE and TORQUE/PBS. That is, if you only ask for two cores but the machine has 48, the future package will only use two. Also, if you use nested or recursive futures, you won't by mistake spawn of a tree of background processes - it'll stick with what your main R processes has available in the first place.
I have occasionally experienced broken communications with PSOCK workers, but as far as I've been able to tell, this has always been when either my main or my background R sessions have been interrupted, e.g. hitting Ctrl-C at the "wrong time" causing the socket communication to become incomplete (maybe it's possible to add protection against - don't know).
@jonchang, if you experience "tasks [that] simply die with an inscrutable error" again, could please report to https://github.com/HenrikBengtsson/future/issues? I'd like to gather as much info as possible on such cases and maybe I can add some additional protection to the future package (or propose fixes to the parallel package).
@jonchang, I actually developed the future package almost solely on my Windows notebook and then tested on Linux and macOS - if it didn't work for you on Windows you must have hit a bad build. Please, give it another try. I'm trying very hard not to leave Windows users behind.
Futures is great for tasks where you have some kind of task workflow like:
Because then you can just do something like: And boom, you can have two tasks running in parallel and everything "just works." It's extremely nice to use thanks to R's promises capability.Where it falls down is when you try to load up a bunch of futures at once... I'm not clear on the implementation details, but from what I can tell every parallel task is assigned a "port" on your system, but if there is a port conflict (or the OS doesn't "release" (?) the port quickly enough) tasks simply die with an inscrutable error.
I've found that it's necessary to 1. ensure that only one "set" of parallel tasks are running at one time, and 2. create a central "port registry" and manually assign ports randomly within nonoverlapping ranges for parallel tasks. It's straightforward but frustrating to do.
Finally (and I don't know if the futures package has updated since I tried it out last year) it doesn't work on Windows, which is a problem for many R users.