> Do you think some of the pain would be alleviated if Futhark had a simple FFI that would allow calling hand-written primitives when they exist?
Yeah, definitely. You should go a bit further and allow embedding C code for performance oriented users. I do not foresee using it, but somebody is going to need it eventually I guarantee it. I can envision some better alternatives than C, such as the language I am working on, but right now it is still incomplete and won't be for some time.
> first, memory allocations on GPUs are not unusually slow
The time it takes to allocate a chunk is linear in its size which is quite slow. I am not sure why that is, but maybe it is faster using OpenCL instead of Cuda? What were your timings for allocating and disposing raw memory plotted against size?
> If used as a library, a Futhark program does not dispose everything once it stops running, but only once the library is unloaded.
That is interesting. Can Futhark be used as a library apart from Haskell (in which it is written)?
> For example, if a Futhark function returns an array, that array still lives on the GPU, and if you use it as an argument to another Futhark function, there will have been no traffic (except bookkeeping stuff) between CPU and GPU.
But still, Futhark will probably not be able to optimize away all the intermediates. And more to the point, some programs like neural nets do in fact accumulate intermediates by necessity. If a particular Futhark program is run multiple times, the memory in those intermediates should be held in a pool.
> Second, I don't see why an interpreter would be any better at managing memory than the current Futhark runtime system.
It could potentially allow memory pool to be shared amongst multiple Futhark programs.
The idea is not to turn Futhark into an interpreted language per se, it would still be a compiled language, but to instead add an extra layer that would allow easier communication with other languages. I do not have a concrete vision of how this should be done.
It goes back to what you mentioned about using Futhark as a library. I am expecting a negative answer that it can be used as a library from anything other than Haskell, but if you were to go more in the direction I am suggesting, instead of making backends for C#, Java and such what you could do is make something that will allow Futhark to be used as a library.
Thinking about to some of the C examples that I have seen, I do not think users will appreciate having massively bloated code files needed to compile the stuff dumped into their projects folders by Futhark.
Not to mention, C# will need to be compiled to C which will result in more temporary files. Futhark as an embedded language could take responsibility for managing all of that.
> The time it takes to allocate a chunk is linear in its size which is quite slow. I am not sure why that is, but maybe it is faster using OpenCL instead of Cuda? What were your timings for allocating and disposing raw memory plotted against size?
I must admit I've never systematically measured it before, so I got curious and wrote this program: http://lpaste.net/356490
(100 runs for every size.) This is a fairly pessimistic measurement, as I am even launching a kernel and doing a (single) write to the memory block - it is unlikely there is any delayed/lazy allocation going on. It does not seem that there is any significant correlation between memory size and allocation cost. Nor should there be - it's just bookkeeping, after all.
> You should go a bit further and allow embedding C code for performance oriented users.
This would have the same effect as embedding assembly into C code, i.e. it basically makes everything nearby off-limits for optimisation. It's even worse in Futhark than for a C compiler, since Futhark depends much more heavily on automatic program transformation. It may be possible to permit very simple near-scalar C functions, but at that point I'm not sure it's worth it any more. Futhark is a perfectly capable scalar language.
> That is interesting. Can Futhark be used as a library apart from Haskell (in which it is written)?
I think we are talking about different things. If you mean can the Futhark compiler be used as a library, then no. Futhark is very much an ahead-of-time language, and not suitable for run-time code generation. In fact, Futhark is a bit of a counter-reaction to the run-time-code-generating embedded languages that were popular in the FP community some years ago. I think the results we have obtained (both in performance and language ergonomics) have validated this approach, but of course it also means we have to make some sacrifices.
What Futhark can do is generate (ahead-of-time) library code that can be used by other languages, without paying the setup/teardown cost (or copying to/from GPU) every time you call a Futhark function from some other language. This works well enough to support real-time applications like particle toys[0] or webcam filters[1], although low latency isn't really what Futhark is built for.
Have you considered looking at something like Obsidian[2]? It is a much lower level functional language that directly exposes GPU concepts. While Obsidian is Haskell-specific, there is a successor called FCL[3] that is standalone. Unlike Futhark, these languages do not depend on aggressive and expensive compiler transformations, and therefore may be a better fit for embedding.
> It does not seem that there is any significant correlation between memory size and allocation cost. Nor should there be - it's just bookkeeping, after all.
I guess this explains your lack of concern about memory pooling. I am going to have to do some research to see if OpenCL does pooling natively, but I can assure you that this is not the behavior on Cuda devices. I can't really get your example to compile right now, but I'll look into it later.
Also you are right that is strange that allocations are so slow on Cuda.
But if it turns out that OpenCL is doing pooling behind the scenes, this is going to be an issue for you when you decide to do a Cuda backend because the performance profile will start to get dominated by allocations.
> This would have the same effect as embedding assembly into C code, i.e. it basically makes everything nearby off-limits for optimisation.
This is true, but that would also be the case if you were using an FFI. Being able to embed C code would shorten that path a bit by not requiring the user to compile a code fragment to a .dll before invoking it.
> In fact, Futhark is a bit of a counter-reaction to the run-time-code-generating embedded languages that were popular in the FP community some years ago.
Hmmm...I never heard about this. But Then again, relatively speaking I haven't been programming that long.
> Have you considered looking at something like Obsidian[2]? It is a much lower level functional language that directly exposes GPU concepts. While Obsidian is Haskell-specific, there is a successor called FCL[3] that is standalone. Unlike Futhark, these languages do not depend on aggressive and expensive compiler transformations, and therefore may be a better fit for embedding.
I can't find any documentation for Obsidian, but there is a paper for FCL in the Github, so I'll look at that.
Yeah, definitely. You should go a bit further and allow embedding C code for performance oriented users. I do not foresee using it, but somebody is going to need it eventually I guarantee it. I can envision some better alternatives than C, such as the language I am working on, but right now it is still incomplete and won't be for some time.
> first, memory allocations on GPUs are not unusually slow
The time it takes to allocate a chunk is linear in its size which is quite slow. I am not sure why that is, but maybe it is faster using OpenCL instead of Cuda? What were your timings for allocating and disposing raw memory plotted against size?
> If used as a library, a Futhark program does not dispose everything once it stops running, but only once the library is unloaded.
That is interesting. Can Futhark be used as a library apart from Haskell (in which it is written)?
> For example, if a Futhark function returns an array, that array still lives on the GPU, and if you use it as an argument to another Futhark function, there will have been no traffic (except bookkeeping stuff) between CPU and GPU.
But still, Futhark will probably not be able to optimize away all the intermediates. And more to the point, some programs like neural nets do in fact accumulate intermediates by necessity. If a particular Futhark program is run multiple times, the memory in those intermediates should be held in a pool.
> Second, I don't see why an interpreter would be any better at managing memory than the current Futhark runtime system.
It could potentially allow memory pool to be shared amongst multiple Futhark programs.
The idea is not to turn Futhark into an interpreted language per se, it would still be a compiled language, but to instead add an extra layer that would allow easier communication with other languages. I do not have a concrete vision of how this should be done.
It goes back to what you mentioned about using Futhark as a library. I am expecting a negative answer that it can be used as a library from anything other than Haskell, but if you were to go more in the direction I am suggesting, instead of making backends for C#, Java and such what you could do is make something that will allow Futhark to be used as a library.
Thinking about to some of the C examples that I have seen, I do not think users will appreciate having massively bloated code files needed to compile the stuff dumped into their projects folders by Futhark.
Not to mention, C# will need to be compiled to C which will result in more temporary files. Futhark as an embedded language could take responsibility for managing all of that.