> The time it takes to allocate a chunk is linear in its size which is quite slow. I am not sure why that is, but maybe it is faster using OpenCL instead of Cuda? What were your timings for allocating and disposing raw memory plotted against size?
I must admit I've never systematically measured it before, so I got curious and wrote this program: http://lpaste.net/356490
(100 runs for every size.) This is a fairly pessimistic measurement, as I am even launching a kernel and doing a (single) write to the memory block - it is unlikely there is any delayed/lazy allocation going on. It does not seem that there is any significant correlation between memory size and allocation cost. Nor should there be - it's just bookkeeping, after all.
> You should go a bit further and allow embedding C code for performance oriented users.
This would have the same effect as embedding assembly into C code, i.e. it basically makes everything nearby off-limits for optimisation. It's even worse in Futhark than for a C compiler, since Futhark depends much more heavily on automatic program transformation. It may be possible to permit very simple near-scalar C functions, but at that point I'm not sure it's worth it any more. Futhark is a perfectly capable scalar language.
> That is interesting. Can Futhark be used as a library apart from Haskell (in which it is written)?
I think we are talking about different things. If you mean can the Futhark compiler be used as a library, then no. Futhark is very much an ahead-of-time language, and not suitable for run-time code generation. In fact, Futhark is a bit of a counter-reaction to the run-time-code-generating embedded languages that were popular in the FP community some years ago. I think the results we have obtained (both in performance and language ergonomics) have validated this approach, but of course it also means we have to make some sacrifices.
What Futhark can do is generate (ahead-of-time) library code that can be used by other languages, without paying the setup/teardown cost (or copying to/from GPU) every time you call a Futhark function from some other language. This works well enough to support real-time applications like particle toys[0] or webcam filters[1], although low latency isn't really what Futhark is built for.
Have you considered looking at something like Obsidian[2]? It is a much lower level functional language that directly exposes GPU concepts. While Obsidian is Haskell-specific, there is a successor called FCL[3] that is standalone. Unlike Futhark, these languages do not depend on aggressive and expensive compiler transformations, and therefore may be a better fit for embedding.
> It does not seem that there is any significant correlation between memory size and allocation cost. Nor should there be - it's just bookkeeping, after all.
I guess this explains your lack of concern about memory pooling. I am going to have to do some research to see if OpenCL does pooling natively, but I can assure you that this is not the behavior on Cuda devices. I can't really get your example to compile right now, but I'll look into it later.
Also you are right that is strange that allocations are so slow on Cuda.
But if it turns out that OpenCL is doing pooling behind the scenes, this is going to be an issue for you when you decide to do a Cuda backend because the performance profile will start to get dominated by allocations.
> This would have the same effect as embedding assembly into C code, i.e. it basically makes everything nearby off-limits for optimisation.
This is true, but that would also be the case if you were using an FFI. Being able to embed C code would shorten that path a bit by not requiring the user to compile a code fragment to a .dll before invoking it.
> In fact, Futhark is a bit of a counter-reaction to the run-time-code-generating embedded languages that were popular in the FP community some years ago.
Hmmm...I never heard about this. But Then again, relatively speaking I haven't been programming that long.
> Have you considered looking at something like Obsidian[2]? It is a much lower level functional language that directly exposes GPU concepts. While Obsidian is Haskell-specific, there is a successor called FCL[3] that is standalone. Unlike Futhark, these languages do not depend on aggressive and expensive compiler transformations, and therefore may be a better fit for embedding.
I can't find any documentation for Obsidian, but there is a paper for FCL in the Github, so I'll look at that.
I must admit I've never systematically measured it before, so I got curious and wrote this program: http://lpaste.net/356490
The output on an NVIDIA K40 GPU is this:
1 bytes; average: 382us; min: 364us; max: 501us
2 bytes; average: 376us; min: 364us; max: 478us
4 bytes; average: 372us; min: 364us; max: 474us
8 bytes; average: 375us; min: 364us; max: 405us
16 bytes; average: 373us; min: 364us; max: 407us
32 bytes; average: 375us; min: 365us; max: 404us
64 bytes; average: 375us; min: 364us; max: 585us
128 bytes; average: 376us; min: 367us; max: 396us
256 bytes; average: 376us; min: 364us; max: 395us
512 bytes; average: 373us; min: 364us; max: 425us
1024 bytes; average: 371us; min: 365us; max: 421us
2048 bytes; average: 372us; min: 364us; max: 472us
4096 bytes; average: 372us; min: 365us; max: 411us
8192 bytes; average: 371us; min: 364us; max: 394us
16384 bytes; average: 371us; min: 364us; max: 403us
32768 bytes; average: 374us; min: 365us; max: 554us
65536 bytes; average: 372us; min: 364us; max: 407us
131072 bytes; average: 371us; min: 365us; max: 393us
262144 bytes; average: 372us; min: 364us; max: 394us
524288 bytes; average: 372us; min: 364us; max: 405us
1048576 bytes; average: 373us; min: 364us; max: 482us
2097152 bytes; average: 371us; min: 364us; max: 391us
4194304 bytes; average: 371us; min: 364us; max: 393us
8388608 bytes; average: 380us; min: 363us; max: 487us
16777216 bytes; average: 372us; min: 364us; max: 474us
33554432 bytes; average: 371us; min: 365us; max: 391us
67108864 bytes; average: 373us; min: 349us; max: 593us
134217728 bytes; average: 372us; min: 365us; max: 399us
268435456 bytes; average: 372us; min: 365us; max: 410us
536870912 bytes; average: 376us; min: 364us; max: 473us
(100 runs for every size.) This is a fairly pessimistic measurement, as I am even launching a kernel and doing a (single) write to the memory block - it is unlikely there is any delayed/lazy allocation going on. It does not seem that there is any significant correlation between memory size and allocation cost. Nor should there be - it's just bookkeeping, after all.
> You should go a bit further and allow embedding C code for performance oriented users.
This would have the same effect as embedding assembly into C code, i.e. it basically makes everything nearby off-limits for optimisation. It's even worse in Futhark than for a C compiler, since Futhark depends much more heavily on automatic program transformation. It may be possible to permit very simple near-scalar C functions, but at that point I'm not sure it's worth it any more. Futhark is a perfectly capable scalar language.
> That is interesting. Can Futhark be used as a library apart from Haskell (in which it is written)?
I think we are talking about different things. If you mean can the Futhark compiler be used as a library, then no. Futhark is very much an ahead-of-time language, and not suitable for run-time code generation. In fact, Futhark is a bit of a counter-reaction to the run-time-code-generating embedded languages that were popular in the FP community some years ago. I think the results we have obtained (both in performance and language ergonomics) have validated this approach, but of course it also means we have to make some sacrifices.
What Futhark can do is generate (ahead-of-time) library code that can be used by other languages, without paying the setup/teardown cost (or copying to/from GPU) every time you call a Futhark function from some other language. This works well enough to support real-time applications like particle toys[0] or webcam filters[1], although low latency isn't really what Futhark is built for.
[0]: https://github.com/Athas/diving-beet
[1]: https://github.com/nqpz/futcam
Have you considered looking at something like Obsidian[2]? It is a much lower level functional language that directly exposes GPU concepts. While Obsidian is Haskell-specific, there is a successor called FCL[3] that is standalone. Unlike Futhark, these languages do not depend on aggressive and expensive compiler transformations, and therefore may be a better fit for embedding.
[2]: https://hackage.haskell.org/package/Obsidian [3]: https://github.com/dybber/fcl