Hacker News new | past | comments | ask | show | jobs | submit login

Using R is basically a requirement for my work (biology) unless I felt like reimplementing all the packages I depend on, which I don't want to do since I want to graduate sometime this decade.

The multicore strategy calls mcparallel under the hood which uses the forking strategy (hence why it doesn't work on Windows). I don't recall exactly where in the source code I figured this out (again, this was a year ago), but it suffers from the same problem that there are a limited number of communication channels (what I generically call "ports") for all interprocess communication in R, which includes fork-based parallelism. The problem is that R does not allow you to tell it which port to communicate across for forked processes. The only advantage that the fork method has that I can tell in R is that you can share memory and don't have to copy stuff around.

My topology was a set of 3 analyses (didn't depend on each other), which had to be run in replicate 1000 times (those didn't depend on each other), and each task had two independent subtasks that were combined together to form a final result. I was running these on a server with 48 cores.

I'm sure there are other more heavyweight things that I could do to fix it, like using custom packages or setting up external software, but this was for a one-off analysis for a paper. It had to be reproducibly run (hence the desire for Windows compatibility) but as long as I could run it, it didn't matter if it was the best architecture.




I'm also working in biology (at the UCSF Cancer Center) and one of the reasons why the future package exists in the first place is that we needed a way to process a large number of microarrays, RNA / DNA sequencing samples and HiC data (and I'd like to use everything from the R prompt). We have a large compute cluster available for this (sorry @apathy but we're using TORQUE but hope to move to Slurm soon).

Now our R script for sequence alignment basically looks like:

  ## Use nested futures where first layer is resolved
  ## via the scheduler and the second using multiple
  ## cores / processes on each of the compute node.
  library("future.BatchJobs")
  plan(list(batchjobs_torque, multiprocess))

  fastq <- dir(pattern = "[.]fq$")
  bam <- listenv()
  for (ii in seq_along(fastq)) {
     fq <- fastq[ii]
     bam[[ii]] %<-% {
       bam_ii <- listenv()
       for (chr in 1:24) {
         bam_ii[[chr]] %<-% DNAseq::align(fq, chromosome = chr)
       }
       as.list(bam_ii)
     }
  }
The future.BatchJobs package (https://cran.r-project.org/package=future.BatchJobs), which enhanced the future package, uses the framework of the BatchJobs package as its backend.


You ran into IPC process limits. The usual way around this is just to run the tasks across a bunch of nodes, using mclapply or openMP (if running C library calls) on each.

R is not beautiful but can be coerced into getting things done if need be. But any program running processes in user space will hit IPC limits to prevent fork bombs. Either you live with it or you write threaded (ugh) or openMP (yay) hooks (typically via Rcpp) to sidestep this.

The reason I asked about topology and machines is that I'm one of the people who pestered Ripley to include parallel support on Windows at all. My graduate adviser wanted to run one of my analyses on Windows and I wanted to run several million of them on the cluster. So I bitched until BDR fixed it. This does not usually work...


Well done - and thanks for pushing for this (and for BDR to implement this and many many other things)! I didn't know this history. R users and developers have soo many to thank at the same time as lots of heroic work of a few.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: