Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Do you think some of the pain would be alleviated if Futhark had a simple FFI that would allow calling hand-written primitives when they exist?

Yeah, definitely. You should go a bit further and allow embedding C code for performance oriented users. I do not foresee using it, but somebody is going to need it eventually I guarantee it. I can envision some better alternatives than C, such as the language I am working on, but right now it is still incomplete and won't be for some time.

> first, memory allocations on GPUs are not unusually slow

The time it takes to allocate a chunk is linear in its size which is quite slow. I am not sure why that is, but maybe it is faster using OpenCL instead of Cuda? What were your timings for allocating and disposing raw memory plotted against size?

> If used as a library, a Futhark program does not dispose everything once it stops running, but only once the library is unloaded.

That is interesting. Can Futhark be used as a library apart from Haskell (in which it is written)?

> For example, if a Futhark function returns an array, that array still lives on the GPU, and if you use it as an argument to another Futhark function, there will have been no traffic (except bookkeeping stuff) between CPU and GPU.

But still, Futhark will probably not be able to optimize away all the intermediates. And more to the point, some programs like neural nets do in fact accumulate intermediates by necessity. If a particular Futhark program is run multiple times, the memory in those intermediates should be held in a pool.

> Second, I don't see why an interpreter would be any better at managing memory than the current Futhark runtime system.

It could potentially allow memory pool to be shared amongst multiple Futhark programs.

The idea is not to turn Futhark into an interpreted language per se, it would still be a compiled language, but to instead add an extra layer that would allow easier communication with other languages. I do not have a concrete vision of how this should be done.

It goes back to what you mentioned about using Futhark as a library. I am expecting a negative answer that it can be used as a library from anything other than Haskell, but if you were to go more in the direction I am suggesting, instead of making backends for C#, Java and such what you could do is make something that will allow Futhark to be used as a library.

Thinking about to some of the C examples that I have seen, I do not think users will appreciate having massively bloated code files needed to compile the stuff dumped into their projects folders by Futhark.

Not to mention, C# will need to be compiled to C which will result in more temporary files. Futhark as an embedded language could take responsibility for managing all of that.




> The time it takes to allocate a chunk is linear in its size which is quite slow. I am not sure why that is, but maybe it is faster using OpenCL instead of Cuda? What were your timings for allocating and disposing raw memory plotted against size?

I must admit I've never systematically measured it before, so I got curious and wrote this program: http://lpaste.net/356490

The output on an NVIDIA K40 GPU is this:

1 bytes; average: 382us; min: 364us; max: 501us

2 bytes; average: 376us; min: 364us; max: 478us

4 bytes; average: 372us; min: 364us; max: 474us

8 bytes; average: 375us; min: 364us; max: 405us

16 bytes; average: 373us; min: 364us; max: 407us

32 bytes; average: 375us; min: 365us; max: 404us

64 bytes; average: 375us; min: 364us; max: 585us

128 bytes; average: 376us; min: 367us; max: 396us

256 bytes; average: 376us; min: 364us; max: 395us

512 bytes; average: 373us; min: 364us; max: 425us

1024 bytes; average: 371us; min: 365us; max: 421us

2048 bytes; average: 372us; min: 364us; max: 472us

4096 bytes; average: 372us; min: 365us; max: 411us

8192 bytes; average: 371us; min: 364us; max: 394us

16384 bytes; average: 371us; min: 364us; max: 403us

32768 bytes; average: 374us; min: 365us; max: 554us

65536 bytes; average: 372us; min: 364us; max: 407us

131072 bytes; average: 371us; min: 365us; max: 393us

262144 bytes; average: 372us; min: 364us; max: 394us

524288 bytes; average: 372us; min: 364us; max: 405us

1048576 bytes; average: 373us; min: 364us; max: 482us

2097152 bytes; average: 371us; min: 364us; max: 391us

4194304 bytes; average: 371us; min: 364us; max: 393us

8388608 bytes; average: 380us; min: 363us; max: 487us

16777216 bytes; average: 372us; min: 364us; max: 474us

33554432 bytes; average: 371us; min: 365us; max: 391us

67108864 bytes; average: 373us; min: 349us; max: 593us

134217728 bytes; average: 372us; min: 365us; max: 399us

268435456 bytes; average: 372us; min: 365us; max: 410us

536870912 bytes; average: 376us; min: 364us; max: 473us

(100 runs for every size.) This is a fairly pessimistic measurement, as I am even launching a kernel and doing a (single) write to the memory block - it is unlikely there is any delayed/lazy allocation going on. It does not seem that there is any significant correlation between memory size and allocation cost. Nor should there be - it's just bookkeeping, after all.

> You should go a bit further and allow embedding C code for performance oriented users.

This would have the same effect as embedding assembly into C code, i.e. it basically makes everything nearby off-limits for optimisation. It's even worse in Futhark than for a C compiler, since Futhark depends much more heavily on automatic program transformation. It may be possible to permit very simple near-scalar C functions, but at that point I'm not sure it's worth it any more. Futhark is a perfectly capable scalar language.

> That is interesting. Can Futhark be used as a library apart from Haskell (in which it is written)?

I think we are talking about different things. If you mean can the Futhark compiler be used as a library, then no. Futhark is very much an ahead-of-time language, and not suitable for run-time code generation. In fact, Futhark is a bit of a counter-reaction to the run-time-code-generating embedded languages that were popular in the FP community some years ago. I think the results we have obtained (both in performance and language ergonomics) have validated this approach, but of course it also means we have to make some sacrifices.

What Futhark can do is generate (ahead-of-time) library code that can be used by other languages, without paying the setup/teardown cost (or copying to/from GPU) every time you call a Futhark function from some other language. This works well enough to support real-time applications like particle toys[0] or webcam filters[1], although low latency isn't really what Futhark is built for.

[0]: https://github.com/Athas/diving-beet

[1]: https://github.com/nqpz/futcam

Have you considered looking at something like Obsidian[2]? It is a much lower level functional language that directly exposes GPU concepts. While Obsidian is Haskell-specific, there is a successor called FCL[3] that is standalone. Unlike Futhark, these languages do not depend on aggressive and expensive compiler transformations, and therefore may be a better fit for embedding.

[2]: https://hackage.haskell.org/package/Obsidian [3]: https://github.com/dybber/fcl


> It does not seem that there is any significant correlation between memory size and allocation cost. Nor should there be - it's just bookkeeping, after all.

I guess this explains your lack of concern about memory pooling. I am going to have to do some research to see if OpenCL does pooling natively, but I can assure you that this is not the behavior on Cuda devices. I can't really get your example to compile right now, but I'll look into it later.

Also you are right that is strange that allocations are so slow on Cuda.

But if it turns out that OpenCL is doing pooling behind the scenes, this is going to be an issue for you when you decide to do a Cuda backend because the performance profile will start to get dominated by allocations.

> This would have the same effect as embedding assembly into C code, i.e. it basically makes everything nearby off-limits for optimisation.

This is true, but that would also be the case if you were using an FFI. Being able to embed C code would shorten that path a bit by not requiring the user to compile a code fragment to a .dll before invoking it.

> In fact, Futhark is a bit of a counter-reaction to the run-time-code-generating embedded languages that were popular in the FP community some years ago.

Hmmm...I never heard about this. But Then again, relatively speaking I haven't been programming that long.

> Have you considered looking at something like Obsidian[2]? It is a much lower level functional language that directly exposes GPU concepts. While Obsidian is Haskell-specific, there is a successor called FCL[3] that is standalone. Unlike Futhark, these languages do not depend on aggressive and expensive compiler transformations, and therefore may be a better fit for embedding.

I can't find any documentation for Obsidian, but there is a paper for FCL in the Github, so I'll look at that.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: