So I've actually tried to use the futures package. While it's very clean for certain types of tasks, there are a few problems that I think are inherent to the way R deals with its parallel packages (which the futures package is built on top of).
Futures is great for tasks where you have some kind of task workflow like:
And boom, you can have two tasks running in parallel and everything "just works." It's extremely nice to use thanks to R's promises capability.
Where it falls down is when you try to load up a bunch of futures at once... I'm not clear on the implementation details, but from what I can tell every parallel task is assigned a "port" on your system, but if there is a port conflict (or the OS doesn't "release" (?) the port quickly enough) tasks simply die with an inscrutable error.
I've found that it's necessary to 1. ensure that only one "set" of parallel tasks are running at one time, and 2. create a central "port registry" and manually assign ports randomly within nonoverlapping ranges for parallel tasks. It's straightforward but frustrating to do.
Finally (and I don't know if the futures package has updated since I tried it out last year) it doesn't work on Windows, which is a problem for many R users.
As for not working on Windows, have you tried `plan(multisession)` instead of `plan(multicore)`? The latter will never work on Windows due to it's lack of forkability, as mentioned in this vignette.
That ports thing is interesting; could that bug be specific to the `multicore` plan?
It's been over a year since I dug into this problem but I believe that the ports issue is inherent to all R parallel code, at least on Linux. Peeking at the futures code multisession is build on the "clustering" type of parallelism which means that it is vulnerable to this issue.
I haven't encountered the problem on Mac but that's because my laptop is generally unable to create enough parallel jobs to saturate the number of available ports.
Yes and no. Multicore/multisession is not the only option for parallelism in R. The default approach is to use fork() on *nix and... hell, I don't use Windows, so I don't really care how it works there. But there are other options.
Depending on your use case it may be fixable. What topology are you using where this is breaking down, and could the synchronization/communication be queued with something like rredis or rrqueue?
All of this assuming you want to use R for your work, as opposed to an external dependency injection framework (or some crap like Torque).
Using R is basically a requirement for my work (biology) unless I felt like reimplementing all the packages I depend on, which I don't want to do since I want to graduate sometime this decade.
The multicore strategy calls mcparallel under the hood which uses the forking strategy (hence why it doesn't work on Windows). I don't recall exactly where in the source code I figured this out (again, this was a year ago), but it suffers from the same problem that there are a limited number of communication channels (what I generically call "ports") for all interprocess communication in R, which includes fork-based parallelism. The problem is that R does not allow you to tell it which port to communicate across for forked processes. The only advantage that the fork method has that I can tell in R is that you can share memory and don't have to copy stuff around.
My topology was a set of 3 analyses (didn't depend on each other), which had to be run in replicate 1000 times (those didn't depend on each other), and each task had two independent subtasks that were combined together to form a final result. I was running these on a server with 48 cores.
I'm sure there are other more heavyweight things that I could do to fix it, like using custom packages or setting up external software, but this was for a one-off analysis for a paper. It had to be reproducibly run (hence the desire for Windows compatibility) but as long as I could run it, it didn't matter if it was the best architecture.
I'm also working in biology (at the UCSF Cancer Center) and one of the reasons why the future package exists in the first place is that we needed a way to process a large number of microarrays, RNA / DNA sequencing samples and HiC data (and I'd like to use everything from the R prompt). We have a large compute cluster available for this (sorry @apathy but we're using TORQUE but hope to move to Slurm soon).
Now our R script for sequence alignment basically looks like:
## Use nested futures where first layer is resolved
## via the scheduler and the second using multiple
## cores / processes on each of the compute node.
library("future.BatchJobs")
plan(list(batchjobs_torque, multiprocess))
fastq <- dir(pattern = "[.]fq$")
bam <- listenv()
for (ii in seq_along(fastq)) {
fq <- fastq[ii]
bam[[ii]] %<-% {
bam_ii <- listenv()
for (chr in 1:24) {
bam_ii[[chr]] %<-% DNAseq::align(fq, chromosome = chr)
}
as.list(bam_ii)
}
}
You ran into IPC process limits. The usual way around this is just to run the tasks across a bunch of nodes, using mclapply or openMP (if running C library calls) on each.
R is not beautiful but can be coerced into getting things done if need be. But any program running processes in user space will hit IPC limits to prevent fork bombs. Either you live with it or you write threaded (ugh) or openMP (yay) hooks (typically via Rcpp) to sidestep this.
The reason I asked about topology and machines is that I'm one of the people who pestered Ripley to include parallel support on Windows at all. My graduate adviser wanted to run one of my analyses on Windows and I wanted to run several million of them on the cluster. So I bitched until BDR fixed it. This does not usually work...
Well done - and thanks for pushing for this (and for BDR to implement this and many many other things)! I didn't know this history. R users and developers have soo many to thank at the same time as lots of heroic work of a few.
From my reading of the docs, "multisession" is supposed to work on Windows, but not "multicore". For cross-platform coding, one can use the "multiprocess" option, which results in multicore on platforms that support it, and multisession otherwise.
Author here. Thanks for this illustration of futures in R.
About one port per future: This is the case for futures that uses the 'multisession' backend, which is a localhost + PSOCK version of the much more general 'cluster' future type. PSOCK cluster futures are in turn built on top of the 'parallel::makePSOCKcluster()', which launches a set of R processes/sessions that communicates with the main R session via sockets.
The 'multicore' futures do _not_ use the above, but instead relies on processes forking (part of the 'parallel' package) just like 'parallel::mclapply(). R doesn't support forking on Windows, which is why multicore futures fall back to synchronous processing on Windows. With multicore there is no usage of ports.
BTW, since a few months, the future package provides the 'multiprocess' future type, which basically is just a convenient alias for 'multicore' with fallback to 'multisession' so that plan('multiprocess') provides parallel processing everyone.
When the first multisession future is created, it also triggers the setup of all the background R sessions (via 'parallel::makePSOCKcluster()'). If there are issues with not being able to create all workers, then there should be an informative error at this point.
However, other than by mistake interrupting/terminating the background R processes or the main R processes while they communicate with each other, I actually haven't experienced any sudden deaths. Of course, a worker can always die for whatever reason you can core dump an R session.
Oh, I should forget to say, that both multicore and multisession futures try to play very nice with the machine settings. For instance, they will not use more cores than what is available on the machine (unless you force it to). They will also be agile to whatever number of cores you are assigned by job schedulers such as Slurm, SGE and TORQUE/PBS. That is, if you only ask for two cores but the machine has 48, the future package will only use two. Also, if you use nested or recursive futures, you won't by mistake spawn of a tree of background processes - it'll stick with what your main R processes has available in the first place.
I have occasionally experienced broken communications with PSOCK workers, but as far as I've been able to tell, this has always been when either my main or my background R sessions have been interrupted, e.g. hitting Ctrl-C at the "wrong time" causing the socket communication to become incomplete (maybe it's possible to add protection against - don't know).
@jonchang, if you experience "tasks [that] simply die with an inscrutable error" again, could please report to https://github.com/HenrikBengtsson/future/issues? I'd like to gather as much info as possible on such cases and maybe I can add some additional protection to the future package (or propose fixes to the parallel package).
@jonchang, I actually developed the future package almost solely on my Windows notebook and then tested on Linux and macOS - if it didn't work for you on Windows you must have hit a bad build. Please, give it another try. I'm trying very hard not to leave Windows users behind.
I actually hacked together an almost functional system similar to this a few years ago. It used the same primitives, delayedAssign and parallel::mcparallel, to implement parallel-evaluated lazy promises. It was nearly useful, but I couldn't get it to work when passing one promised value to another lazy expression, presumably because only the process that forked a subprocess can read the value it returns, so the second forked process can't evaluate the promose. It looks like this package solves that problem by forcing evaluation of any implicit futures before passing them to another future. I'm definitely interested in trying this out.
Now, with this in hand can we have for instance a multithreaded (or in some other parallel way) web server or even REST API for R?
Talking from the practical perspective, the biggest problem with wide adoption of R is the problem of integration. Sometimes you just want to have a single module in R, and the rest of the system in some other technology. I know there are ways to do it, but not without quite high technical debt. On the other hand having native microservice-like integration could probably help.
"A very flexible framework for building server side logic in R. The framework is unoppinionated when it comes to how HTTP requests and WebSocket messages are handled and supports all levels of app complexity; from serving static content to full-blown dynamic web-apps. Fiery does not hold your hand as much as e.g. the shiny package does, but instead sets you free to create your web app the way you want."
I'm pretty sure it's blocking though, R doesn't really have good concurrency primitives. However, you can launch multiple R processes behind nginx and be able to serve N requests concurrently. Running this in an autoscaling group on AWS provides enough resilience in practice
where compute_some_value() and compute_other_value() are independent and both of them take a long time to run, so they would benefit from running in parallel. However, actually running them in parallel is tricky, because most parallel interfaces in R are modelled after lapply, running a single function on multiple elements of a list, and this doesn't fit that mold. You could parallelize it manually using primitives such as parallel::mcparallel, and delayedAssign, but you don't get error handling/propagation, and your code gets super messy with the implementation details of your parallelization strategy. And if you do parallelize it and then someone else calls your code in parallel, now you get too many parallel processes and risk running of memory and ending up in swapping hell.
The bottom line is that code such as the above generally just doesn't get parallelized, because the only way of doing so (as far as I know) requires pointing several guns at your foot. So this package looks very interesting and useful to me, and I also think it provides a good set of primitives with which to implement yet another "multi-backend parallel lapply" package with advantages over the others, such as doing its best to ensure consistent behavior across the different "backends".
(Edit: Also see jonchang's comment along similar lines.)
Thank you for explaining this; I was trying to see how this would be useful. Could this be used to do parallel data load (like read from csv and database at the same time?)
Can someone explain in few sentences (or point me to the place where I can get the answer) what is the benefit of using R over some functional language like Haskel, OCaml or F#?
Which use-cases does R shine in over those languages?
Is the popularity driven by R the language or its libraries?
Libraries. If you want to do statistics in any other language, you have to reimplement 80% of the diagnostics and tests by hand and a good portion of methods and algorithms (no, not just esoteric ones).
While obviously the answer is libraries and data types (DataFrame is a very powerful semantic, replicated later in Python, Julia, Scala/Spark), R definitely IS a functional language:
- closures, lambdas
- high-order functions
- a lot R processing is map/fold-like (even though it's lapply instead of map)
I took a course recently in ML that focused on Support Vector Machines (SVM), Decision Trees, and Neural Nets. The professor chose R because of the libraries.
SVM's, Decision Trees, and Neural Nets are non-trivial to write libraries for. In order to produce similar libraries for any other language, you need either a good grounding in math (linear algebra, PDE's, and projective geometry at the least) or a well-developed ability to translate from one language - R source for instance - to the language of your choice without understanding completely what you're translating. Regardless of your background, it will take a significant amount of time and focus to accomplish it.
Our professor mentioned that Python may have libraries available now. He likes Python in general. In his research, he uses R simply due to familiarity.
Futures is great for tasks where you have some kind of task workflow like:
Because then you can just do something like: And boom, you can have two tasks running in parallel and everything "just works." It's extremely nice to use thanks to R's promises capability.Where it falls down is when you try to load up a bunch of futures at once... I'm not clear on the implementation details, but from what I can tell every parallel task is assigned a "port" on your system, but if there is a port conflict (or the OS doesn't "release" (?) the port quickly enough) tasks simply die with an inscrutable error.
I've found that it's necessary to 1. ensure that only one "set" of parallel tasks are running at one time, and 2. create a central "port registry" and manually assign ports randomly within nonoverlapping ranges for parallel tasks. It's straightforward but frustrating to do.
Finally (and I don't know if the futures package has updated since I tried it out last year) it doesn't work on Windows, which is a problem for many R users.