> Do you think some of the pain would be alleviated if Futhark had a simple FFI ...

Athas · on June 26, 2017

> The time it takes to allocate a chunk is linear in its size which is quite slow. I am not sure why that is, but maybe it is faster using OpenCL instead of Cuda? What were your timings for allocating and disposing raw memory plotted against size?

I must admit I've never systematically measured it before, so I got curious and wrote this program: http://lpaste.net/356490

The output on an NVIDIA K40 GPU is this:

1 bytes; average: 382us; min: 364us; max: 501us

2 bytes; average: 376us; min: 364us; max: 478us

4 bytes; average: 372us; min: 364us; max: 474us

8 bytes; average: 375us; min: 364us; max: 405us

16 bytes; average: 373us; min: 364us; max: 407us

32 bytes; average: 375us; min: 365us; max: 404us

64 bytes; average: 375us; min: 364us; max: 585us

128 bytes; average: 376us; min: 367us; max: 396us

256 bytes; average: 376us; min: 364us; max: 395us

512 bytes; average: 373us; min: 364us; max: 425us

1024 bytes; average: 371us; min: 365us; max: 421us

2048 bytes; average: 372us; min: 364us; max: 472us

4096 bytes; average: 372us; min: 365us; max: 411us

8192 bytes; average: 371us; min: 364us; max: 394us

16384 bytes; average: 371us; min: 364us; max: 403us

32768 bytes; average: 374us; min: 365us; max: 554us

65536 bytes; average: 372us; min: 364us; max: 407us

131072 bytes; average: 371us; min: 365us; max: 393us

262144 bytes; average: 372us; min: 364us; max: 394us

524288 bytes; average: 372us; min: 364us; max: 405us

1048576 bytes; average: 373us; min: 364us; max: 482us

2097152 bytes; average: 371us; min: 364us; max: 391us

4194304 bytes; average: 371us; min: 364us; max: 393us

8388608 bytes; average: 380us; min: 363us; max: 487us

16777216 bytes; average: 372us; min: 364us; max: 474us

33554432 bytes; average: 371us; min: 365us; max: 391us

67108864 bytes; average: 373us; min: 349us; max: 593us

134217728 bytes; average: 372us; min: 365us; max: 399us

268435456 bytes; average: 372us; min: 365us; max: 410us

536870912 bytes; average: 376us; min: 364us; max: 473us

(100 runs for every size.) This is a fairly pessimistic measurement, as I am even launching a kernel and doing a (single) write to the memory block - it is unlikely there is any delayed/lazy allocation going on. It does not seem that there is any significant correlation between memory size and allocation cost. Nor should there be - it's just bookkeeping, after all.

> You should go a bit further and allow embedding C code for performance oriented users.

This would have the same effect as embedding assembly into C code, i.e. it basically makes everything nearby off-limits for optimisation. It's even worse in Futhark than for a C compiler, since Futhark depends much more heavily on automatic program transformation. It may be possible to permit very simple near-scalar C functions, but at that point I'm not sure it's worth it any more. Futhark is a perfectly capable scalar language.

> That is interesting. Can Futhark be used as a library apart from Haskell (in which it is written)?

I think we are talking about different things. If you mean can the Futhark compiler be used as a library, then no. Futhark is very much an ahead-of-time language, and not suitable for run-time code generation. In fact, Futhark is a bit of a counter-reaction to the run-time-code-generating embedded languages that were popular in the FP community some years ago. I think the results we have obtained (both in performance and language ergonomics) have validated this approach, but of course it also means we have to make some sacrifices.

What Futhark can do is generate (ahead-of-time) library code that can be used by other languages, without paying the setup/teardown cost (or copying to/from GPU) every time you call a Futhark function from some other language. This works well enough to support real-time applications like particle toys[0] or webcam filters[1], although low latency isn't really what Futhark is built for.

[0]: https://github.com/Athas/diving-beet

[1]: https://github.com/nqpz/futcam

Have you considered looking at something like Obsidian[2]? It is a much lower level functional language that directly exposes GPU concepts. While Obsidian is Haskell-specific, there is a successor called FCL[3] that is standalone. Unlike Futhark, these languages do not depend on aggressive and expensive compiler transformations, and therefore may be a better fit for embedding.

[2]: https://hackage.haskell.org/package/Obsidian [3]: https://github.com/dybber/fcl

abstractcontrol · on June 26, 2017

> It does not seem that there is any significant correlation between memory size and allocation cost. Nor should there be - it's just bookkeeping, after all.

I guess this explains your lack of concern about memory pooling. I am going to have to do some research to see if OpenCL does pooling natively, but I can assure you that this is not the behavior on Cuda devices. I can't really get your example to compile right now, but I'll look into it later.

Also you are right that is strange that allocations are so slow on Cuda.

But if it turns out that OpenCL is doing pooling behind the scenes, this is going to be an issue for you when you decide to do a Cuda backend because the performance profile will start to get dominated by allocations.

> This would have the same effect as embedding assembly into C code, i.e. it basically makes everything nearby off-limits for optimisation.

This is true, but that would also be the case if you were using an FFI. Being able to embed C code would shorten that path a bit by not requiring the user to compile a code fragment to a .dll before invoking it.

> In fact, Futhark is a bit of a counter-reaction to the run-time-code-generating embedded languages that were popular in the FP community some years ago.

Hmmm...I never heard about this. But Then again, relatively speaking I haven't been programming that long.

> Have you considered looking at something like Obsidian[2]? It is a much lower level functional language that directly exposes GPU concepts. While Obsidian is Haskell-specific, there is a successor called FCL[3] that is standalone. Unlike Futhark, these languages do not depend on aggressive and expensive compiler transformations, and therefore may be a better fit for embedding.

I can't find any documentation for Obsidian, but there is a paper for FCL in the Github, so I'll look at that.