Streaming Combinators and Extracting Flat Parallelism

purpleog · on June 26, 2017

I'm curious to know how Futhark relates in performance to halide -- halide is more specific to image processing, but uses a similar approach (domain-specific language compiled to multiple target platforms): http://halide-lang.org

tibbe · on June 26, 2017

The Data Parallel Haskell work on flattening ran into problems with space blow-up due to replicating arrays to perform flattening. Does Futhank avoid those problems?

Athas · on June 26, 2017

Yes. Since Futhark does not perform full flattening, replication is limited to how much parallelism is exploited, not how much parallelism is available.

(That said, memory explosion is still a common symptom when the compiler misjudges how much parallelism should be exploited.)

fulafel · on June 25, 2017

Has anyone here tried Futhark? Is it ready to use or mostly PL research?

ngws · on June 26, 2017

As someone else pointed out, Futhark can be compiled into a Python program or library that uses PyOpenCL. For some "ecosystem" uses of this, you can look for the 'futhark' tag on GitHub:

https://github.com/search?q=topic%3Afuthark&type=Repositorie...

(Disclaimer: Currently just my own silly projects, but hopefully more will follow.)

DannyBee · on June 25, 2017

These types of things are mostly PL research. Staring at their repo, Futhark is better than most in terms of "readyness", but it's still probably not something you'd want to start programming all your GPU code in :) It lacks a lot of developer friendliness.

Athas · on June 25, 2017

What would be a good starting point for adding more friendliness?

DannyBee · on June 26, 2017

It doesn't even always produce sane error messages right now :)

It also misses a bunch of syntactic sugar that make programming inconsistent from what one would expect in a bunch of cases (they acknowledge this up front).

Past that, in part, getting people to adopt new languages is about either providing them an ecosystem that is amazing to start and trying to convince them to jump ... OR meeting them where they are now (IE building integration into existing editors, tooling, etc).

Usually people try to get people to jump. You'll note that this rarely works and is a very long and slow burn when it does. This is true even when the new ecosystem is huge. The successful jump cases are usually somewhere in the middle, where they are leveraging existing ecosystems but providing something better enough.

Futhark is neither providing a better ecosystem to jump to, or meeting people where they are.

Athas · on June 26, 2017

Insightful and reasonable, thanks! In particular, I agree that the error messages could probably do with a few hints of what to do next. I find Elm's work on user-friendly error messages rather inspiring.

Editor integration would definitely also be interesting. I suspect the best approach would be to implement the language server protocol, as there are too many editors to handle all of them individually.

The hope for Futhark is that a "jump" will be much smaller than for other languages, since it's not really supposed to replace any language, just to augment existing ones. It should feel more like using a library than adding a new language to a project. Of course, there is still a compilation step, but if necessary, that can be crudely circumvented by simply adding the (portable) code generated by the Futhark compiler to source control.

So, the strategy is definitely to meet people where they are - building an entire ecosystem is not really feasible for such a small and specialised language. Most people don't even have the problems that Futhark tries to solve, which is certainly a nontrivial hindrance to usage.

tombert · on June 25, 2017

I would say good tooling really lowers the barrier of entry for these things. Things like good editor integrations (such as being able to see the type under the cursor, or highlight errors inline) and a package manager make things a million times easier for a developer to get started on this. I'm not entirely sure how to do the package management , as that's kind of a religious argument, but it's relatively easy to make basic Vim/NeoVim, Emacs, or Atom integrations, which I think would make a lot more people be willing to check it out.

I am nearly positive that the tooling that came with Go is a pretty big reason for its success.

CyberDildonics · on June 26, 2017

I don't know why this seems to be contentious, tooling is a major issue and driving force behind languages and modern programming. Debuggers, syntax checking editors etc are huge factors in productivity.

abstractcontrol · on June 26, 2017

I do not feel like writing an entire essay on language integration right now - most likely I will open an issue 6-12 months from now on Futhark's Github repo just to talk about it, but let me start with agreeing everything what DannyBee said and adding a few thoughts of my own.

1) Let me just say that language integration is a very serious issue. It is not just between completely different languages, but also between modules written in the same language. Once you start using algebraic datatypes to emulate language features the main language lacks, you essentially step into a dynamic sublanguage and have to deal with the friction caused by crossing module boundaries. This friction is both on the programmer's side – he has to deal with writing boilerplate for crossing the boundaries, and on the computer's side which has to do marshaling which is terrible for performance and just nasty.

2) This friction in the case of Futhark will be magnified manifold as it is a completely different language. Since Futhark is not a general purpose language, a realistic use case for it to be called indirectly directly by other languages who will generate code for it. This is generally how high-speed anything is used today. There is a collection of highly optimized, assembly written routines (such as BLAS) bundled into a library and they are called from very slow high-level languages such as Python.

Futhark today is not fit for such a purpose.

* It would be difficult to partition the program written for it into separate pieces. For very simple programs, at a minimum it will generate 2k lines of code (in the C backend).

* This is compounded by the fact that it does not link to the aforementioned optimized libraries, but generates all the code internally. You could then imagine using Futhark intermittently – calling those fast libraries in the main language and using Futhark for the rest since writing code in Futhark is much more convenient compared to C, but then you would need to partition the program and will immediately run into the code bloat issues in the first bullet point.

* Futhark is a very high level language and takes on all the responsibility for managing memory on itself. That even further clashes with the idea of partitioning the program. Memory allocations are extremely slow on the GPU, and in addition to that, they block the whole device meaning they are not asynchronous.

* A minor point of friction compared to the above is that Futhark supports OpenCL which has minor market share instead of Cuda.

3) Based on the above, I question the current integration strategy by Futhark of making backend for different languages. It currently has a C and a Python backend, and an OpenCL CPU backend, and F# is planned, and you can imagine many different backends…

What might be worth trying instead would be to make Futhark an embedded interpreted language. This is not as crazy as it seems – it would define a natural API point for other languages to access it and the interpreter would be responsible for managing GPU memory. It would be a much better model than disposing everything once the program stops running and would allow for efficient intermingling of multiple Futhark programs that could reside in memory. Right now, that sort of thing would be very high friction.

I do not have much advice on how this could be accomplished and would no doubt require much design work, but I am going to try something like that in my own language at some point. I had this crucial insight when I was trying to use it from the language it was written in and realized that it is actually very difficult.

Scala, Clojure and F# in particular had the master stroke of latching themselves to already established ecosystems. Languages targeting the GPU cannot use that strategy directly and will need to be more inventive.

Athas · on June 26, 2017

Thanks for your response! Do you think some of the pain would be alleviated if Futhark had a simple FFI that would allow calling hand-written primitives when they exist? (With the property that these would be black boxes and not fused with anything else.) There will still be some friction, but it would be lessened.

There are a few things I should correct: first, memory allocations on GPUs are not unusually slow (although copies from CPU to GPU are). Futhark presently stores all memory on the GPU at all times, so CPU<->GPU traffic is very low (this has other problems, however, but it seems the newest GPU hardware has features that can be used to solve this more elegantly).

I've also started cooling a bit on the idea of making a lot of language-specific backends. While convenient, they are not really scalable. What will likely happen is that we will focus on improving the C backend as a target for the FFIs of other languages (since most languages provide convenient ways of calling C libraries). Possibly we'll also keep a few other strategic backends around, like the Python backend for demonstration, and possibly Java and C# backends as they can be targeted by huge (non-C) ecosystems.

Using Futhark as an embedded language is an interesting idea, but I'm not sure it would solve the problems you bring up. First, the compilation technique needed to obtain good performance leads invariably to fairly slow compile times, which makes it a bit more awkward to re-compile on startup. Second, I don't see why an interpreter would be any better at managing memory than the current Futhark runtime system. If used as a library, a Futhark program does not dispose everything once it stops running, but only once the library is unloaded.

For example, if a Futhark function returns an array, that array still lives on the GPU, and if you use it as an argument to another Futhark function, there will have been no traffic (except bookkeeping stuff) between CPU and GPU.

abstractcontrol · on June 26, 2017

> Do you think some of the pain would be alleviated if Futhark had a simple FFI that would allow calling hand-written primitives when they exist?

Yeah, definitely. You should go a bit further and allow embedding C code for performance oriented users. I do not foresee using it, but somebody is going to need it eventually I guarantee it. I can envision some better alternatives than C, such as the language I am working on, but right now it is still incomplete and won't be for some time.

> first, memory allocations on GPUs are not unusually slow

The time it takes to allocate a chunk is linear in its size which is quite slow. I am not sure why that is, but maybe it is faster using OpenCL instead of Cuda? What were your timings for allocating and disposing raw memory plotted against size?

> If used as a library, a Futhark program does not dispose everything once it stops running, but only once the library is unloaded.

That is interesting. Can Futhark be used as a library apart from Haskell (in which it is written)?

> For example, if a Futhark function returns an array, that array still lives on the GPU, and if you use it as an argument to another Futhark function, there will have been no traffic (except bookkeeping stuff) between CPU and GPU.

But still, Futhark will probably not be able to optimize away all the intermediates. And more to the point, some programs like neural nets do in fact accumulate intermediates by necessity. If a particular Futhark program is run multiple times, the memory in those intermediates should be held in a pool.

> Second, I don't see why an interpreter would be any better at managing memory than the current Futhark runtime system.

It could potentially allow memory pool to be shared amongst multiple Futhark programs.

The idea is not to turn Futhark into an interpreted language per se, it would still be a compiled language, but to instead add an extra layer that would allow easier communication with other languages. I do not have a concrete vision of how this should be done.

It goes back to what you mentioned about using Futhark as a library. I am expecting a negative answer that it can be used as a library from anything other than Haskell, but if you were to go more in the direction I am suggesting, instead of making backends for C#, Java and such what you could do is make something that will allow Futhark to be used as a library.

Thinking about to some of the C examples that I have seen, I do not think users will appreciate having massively bloated code files needed to compile the stuff dumped into their projects folders by Futhark.

Not to mention, C# will need to be compiled to C which will result in more temporary files. Futhark as an embedded language could take responsibility for managing all of that.

Athas · on June 26, 2017

> The time it takes to allocate a chunk is linear in its size which is quite slow. I am not sure why that is, but maybe it is faster using OpenCL instead of Cuda? What were your timings for allocating and disposing raw memory plotted against size?

I must admit I've never systematically measured it before, so I got curious and wrote this program: http://lpaste.net/356490

The output on an NVIDIA K40 GPU is this:

1 bytes; average: 382us; min: 364us; max: 501us

2 bytes; average: 376us; min: 364us; max: 478us

4 bytes; average: 372us; min: 364us; max: 474us

8 bytes; average: 375us; min: 364us; max: 405us

16 bytes; average: 373us; min: 364us; max: 407us

32 bytes; average: 375us; min: 365us; max: 404us

64 bytes; average: 375us; min: 364us; max: 585us

128 bytes; average: 376us; min: 367us; max: 396us

256 bytes; average: 376us; min: 364us; max: 395us

512 bytes; average: 373us; min: 364us; max: 425us

1024 bytes; average: 371us; min: 365us; max: 421us

2048 bytes; average: 372us; min: 364us; max: 472us

4096 bytes; average: 372us; min: 365us; max: 411us

8192 bytes; average: 371us; min: 364us; max: 394us

16384 bytes; average: 371us; min: 364us; max: 403us

32768 bytes; average: 374us; min: 365us; max: 554us

65536 bytes; average: 372us; min: 364us; max: 407us

131072 bytes; average: 371us; min: 365us; max: 393us

262144 bytes; average: 372us; min: 364us; max: 394us

524288 bytes; average: 372us; min: 364us; max: 405us

1048576 bytes; average: 373us; min: 364us; max: 482us

2097152 bytes; average: 371us; min: 364us; max: 391us

4194304 bytes; average: 371us; min: 364us; max: 393us

8388608 bytes; average: 380us; min: 363us; max: 487us

16777216 bytes; average: 372us; min: 364us; max: 474us

33554432 bytes; average: 371us; min: 365us; max: 391us

67108864 bytes; average: 373us; min: 349us; max: 593us

134217728 bytes; average: 372us; min: 365us; max: 399us

268435456 bytes; average: 372us; min: 365us; max: 410us

536870912 bytes; average: 376us; min: 364us; max: 473us

(100 runs for every size.) This is a fairly pessimistic measurement, as I am even launching a kernel and doing a (single) write to the memory block - it is unlikely there is any delayed/lazy allocation going on. It does not seem that there is any significant correlation between memory size and allocation cost. Nor should there be - it's just bookkeeping, after all.

> You should go a bit further and allow embedding C code for performance oriented users.

This would have the same effect as embedding assembly into C code, i.e. it basically makes everything nearby off-limits for optimisation. It's even worse in Futhark than for a C compiler, since Futhark depends much more heavily on automatic program transformation. It may be possible to permit very simple near-scalar C functions, but at that point I'm not sure it's worth it any more. Futhark is a perfectly capable scalar language.

> That is interesting. Can Futhark be used as a library apart from Haskell (in which it is written)?

I think we are talking about different things. If you mean can the Futhark compiler be used as a library, then no. Futhark is very much an ahead-of-time language, and not suitable for run-time code generation. In fact, Futhark is a bit of a counter-reaction to the run-time-code-generating embedded languages that were popular in the FP community some years ago. I think the results we have obtained (both in performance and language ergonomics) have validated this approach, but of course it also means we have to make some sacrifices.

What Futhark can do is generate (ahead-of-time) library code that can be used by other languages, without paying the setup/teardown cost (or copying to/from GPU) every time you call a Futhark function from some other language. This works well enough to support real-time applications like particle toys[0] or webcam filters[1], although low latency isn't really what Futhark is built for.

[0]: https://github.com/Athas/diving-beet

[1]: https://github.com/nqpz/futcam

Have you considered looking at something like Obsidian[2]? It is a much lower level functional language that directly exposes GPU concepts. While Obsidian is Haskell-specific, there is a successor called FCL[3] that is standalone. Unlike Futhark, these languages do not depend on aggressive and expensive compiler transformations, and therefore may be a better fit for embedding.

[2]: https://hackage.haskell.org/package/Obsidian [3]: https://github.com/dybber/fcl

abstractcontrol · on June 26, 2017

> It does not seem that there is any significant correlation between memory size and allocation cost. Nor should there be - it's just bookkeeping, after all.

I guess this explains your lack of concern about memory pooling. I am going to have to do some research to see if OpenCL does pooling natively, but I can assure you that this is not the behavior on Cuda devices. I can't really get your example to compile right now, but I'll look into it later.

Also you are right that is strange that allocations are so slow on Cuda.

But if it turns out that OpenCL is doing pooling behind the scenes, this is going to be an issue for you when you decide to do a Cuda backend because the performance profile will start to get dominated by allocations.

> This would have the same effect as embedding assembly into C code, i.e. it basically makes everything nearby off-limits for optimisation.

This is true, but that would also be the case if you were using an FFI. Being able to embed C code would shorten that path a bit by not requiring the user to compile a code fragment to a .dll before invoking it.

> In fact, Futhark is a bit of a counter-reaction to the run-time-code-generating embedded languages that were popular in the FP community some years ago.

Hmmm...I never heard about this. But Then again, relatively speaking I haven't been programming that long.

> Have you considered looking at something like Obsidian[2]? It is a much lower level functional language that directly exposes GPU concepts. While Obsidian is Haskell-specific, there is a successor called FCL[3] that is standalone. Unlike Futhark, these languages do not depend on aggressive and expensive compiler transformations, and therefore may be a better fit for embedding.

I can't find any documentation for Obsidian, but there is a paper for FCL in the Github, so I'll look at that.

jlg23 · on June 26, 2017

Apparently you can write APL and have it transpiled to Futhark[1].

Now to the question whether APL is ready for production use... ;)

[1] http://futhark-lang.org/blog/2016-06-20-futhark-as-an-apl-co...

throwaway7645 · on June 26, 2017

I know that's a joke, but it has been around in production since the 60's, so probably not that bad lol.