Optimization Example: Mandelbrot Set (part 1)

crazygringo · on Feb 19, 2024

Wow, this article opened my eyes to what is possible in optimizing code for a CPU.

I was expecting this article to rehash the usual algorithmic tricks for speeding up Mandelbrot set calculation -- filling in boxes where the four corners already have the same color, checking for repeating iteration cycles to fill in black, etc.

But I was very surprised to learn about how important CPU pipelines are, and therefore about "sheep race optimization" and interleaving, and then about the possibility of using vectorization on CPU. And in the end I'm utterly astonished that adding everything up results in the algorithm being calculated eight times faster. I would never have dreamed that such a level of speed improvement would be possible on such a simple, straightforward mathematical algorithm.

Of course, in the real world everybody today is going to calculate the Mandelbrot set using the GPU (as the author notes -- this is just an exercise). But now it has me tremendously curious as to whether there are similar optimizations that can be applied on the GPU side?

And especially whether those are accessible if done in WebGL/WebGPU? There are quite a number of Mandelbrot set generators available in the browser that use the GPU for calculations (e.g. [1]).

[1] http://mandelset.ru/ - https://github.com/iskandarov-egor/mandelset

superjan · on Feb 19, 2024

On the GPU, vectorization is done for you by the compiler, and the cores are designed to hide latency. These techniques will do little to improve that. OTOH, you have control over how the code is vectorized (stripes or tiles), that could make a difference. And it might be worth exploring if it’s worthwhile to emulate doubles with smaller types. GPU’s are slow at 64 bit math.

But, as the author already suggested, first check if there is a need to optimize at all.

RoyWard · on Feb 19, 2024

Thank you! Part 2 is going to largely be a rehash of some of the usual algorithmic tricks, although they do have to be adapted somewhat to using a pipeline.

There are several issues that I can see with trying this level optimization on a GPU - firstly, given the limited depth available, renders with doubles are going to be pretty fast already (as long as the GPU supports doubles, which some of the older ones or the ones in phones don't) so there is pretty limited bang for the buck there.

Secondly it is really easy to get good information about CPU microarchitecture and timings (on x86 anyway, thank you Agner Fog), whereas I'm not aware of that level of detail for graphics cards, so they are a bit of a black box. Compounding this, GLSL does compile down to an intermediate code (SPIR-V), but then each graphics card will have that converted to its internal structure in it's own opaque way. It's much harder to optimize when all you can do is try things and look at timings. If I get enough bandwidth, I may look into this.

QuadmasterXLII · on Feb 19, 2024

"Are there similar optimizations that can be applied on the GPU side" is a fascinating question.

There are actually two communities trying to write the fastest mandelbrot renderer with very different definitions of fastest. Weirdly, the two communities basically don't acknowledge each other- I don't mean there's animosity, I mean they seem to just not show up in each other's google searches- total lack of mention.

On one side is "What is the fastest way to render the mandelbrot set in floating point precision," such as the parent article. This is where you might see tricks like filling in boxes, directly checking for repeating cycles etc, but advanced attempts always seem to get into writing assembly and carefully managing CPU multipliers- The OP's article is a wonderful example of optimizing in this community. I don't really know of many high-man-hour attempts from this branch that use GPU rendering, since it's more about delving into the CPU's details and being faster than other codebases in the same constraints than about achieving a real world task that would otherwise be too slow. The real center of this community is https://benchmarksgame-team.pages.debian.net/benchmarksgame/..., where not only is GPU banned because it would be missing the point, but optimizations like checking whether four corners are all the same color are also verboten, as it's all about how fast the inner loop can spin.

The second community is trying to go deep, deep into the mandelbrot set- way past where it makes sense to represent the location of a pixel as a floating point number. Here speed is a practical concern. Images like "Evolution of Trees" https://www.deviantart.com/dinkydauset/art/Evolution-of-tree... used to take weeks to render, and it's been a no-holds-barred brawl to reduce that time- of course the GPU and alternative algorithms are allowed. The fastest codebases that I know of are FractalShark https://github.com/mattsaccount364/FractalShark and Kallas Fraktaler 3 https://mathr.co.uk/kf/kf.html#top and both have CUDA kernels, but I don't think that these kernels have been optimized to the point of finding and fixing sheep races.

The reason is twofold- first, the algorithmic progress has been lightning fast in recent years. The biggest breakthrough, bilinear approximation with rebasing, was discovered in 2022- and it (roughly, the actual runtime is currently unknown) takes the time to calculate an image from O(num pixels * num iterations) to O((num pixels * log (num iterations)) + num iterations). As a result, there just hasn't been time since 2022 to optimize the constant factors to the level in OP, especially now that the algorithm is so much more complicated than multiply, add, repeat. Second, with just lightly optimized CUDA for the num pixels log num iteration term, the overall runtime is currently dominated by the term with no dependency on the number of pixels. No one knows how to parallelize that term (at least not to many cores- fractal shark can use three cores I believe), so improvements in the CUDA kernel don't translate to unlocking deeper renders.

The "render deep images fast" community mostly hangs out on fractalforums.org- I highly recommend swinging by, it's a wonderful and active forum (note- heavily moderated to stay on topic- makes hackernews look like the wild west)

RoyWard · on Feb 19, 2024

Thank you! I had no idea that this had such communities either, and I've now got some links to check out. I was just picking a small problem because I'm job hunting and had some time to show some of what I can do - for this problem I could give decent coverage of some of the optimization possibilities without writing a whole book.

I would fit far more naturally into the second community, as I will normally avoid using fixed point wherever possible. I only went with doubles because it made it so much easier to highlight what can be done with the pipelining.

Part of the attraction of floating point is that a lot more CPU silicon goes into floating point than integer - so for instance AVX2 supports multiplication of 64 bit doubles but not 64 bit integers. Before this post, the last time I looked at the Mandelbrot Set was on a PowerPC750 - the best I could do with 32 bit integers was 22 clock cycles, the best I could do with floating point was (I think) 11 clock cycles for doubles, 9 clock cycles for floats, simply because floating point pipelined and integers didn't.

crazygringo · on Feb 19, 2024

Wow, thank you! I had no idea at all this was such an active area.

When I was a teen I wrote my own renderer that could zoom and pan in real time using DirectX on my 486, back when Fractint was the most popular tool.

So I'm utterly fascinated to see that this is something people still care about! There was a brief period in the 90's when you'd see images of the Mandelbrot set on T-shirts you could buy at the mall, as part of a broader psychedelic trend, but then it had pretty much disappeared it seemed like.

igouy · on Feb 19, 2024

> … benchmarksgame … missing the point

"We ask that contributed programs not only give the correct result, but also use the same algorithm to calculate that result."

In the vain hope that experts would contribute comparable programs written in different languages, so that the performance of different language implementations could be compared.

Let's say that again: compare language implementations — not compare algorithms.

Instead of that vain hope, we're left with transliterations — https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

> not only is GPU banned

"banned"? Really.

dist1ll · on Feb 19, 2024

Cool article. I think what the author calls "sheep race" is known as loop rotation. This allows you to mask off latency, at the cost of wasted computation at the tail of the loop.

RoyWard · on Feb 20, 2024

Do you have a reference for that? I might be looking in different places, but my reading suggests that loop rotation is where a conditional is moved to the end to avoid an unconditional branch, sometimes by the compiler. "Sheep race" with it's wasted computation is a little different - it's not a transformation it would be legal for a compiler to make, and possibly doesn't have enough use cases to really have its own well known name.

Oddly enough, I've noticed that the compiler generated code will sometimes not do loop rotation, and and will have a conditional branch in the middle and unconditional at the end. I suspect this is because (a) unconditional branches are often effectively free on modern processors (if all the instructions are cached appropriately), and (b) there are all sorts of subtle instruction layout and ordering that a good compiler does to make things map efficiently to the underlying microarchitecture that means that non-intuitive orderings sometimes have a better cost function.

I did find out that what I called "interleaving" is more properly called "software pipelining".

amadio · on Feb 19, 2024

In VecCore (a small C++ SIMD abstraction library on top of Vc and std::simd), I created some simple examples to show how to use the library to optimize code using SIMD in a somewhat generic way. You can find it on GitHub at https://github.com/root-project/veccore

I have examples for Julia sets and the Mandelbrot set, including an implementation with AVX2 intrinsics.

These days with std::simd more widely available there's less of a reason to use VecCore, but the examples may still be educational enough. I chose Julia sets and the Mandelbrot since they are perfect examples of simple problems that compilers fail to vectorize on their own.

camel-cdr · on Feb 19, 2024

Somewhat related, ispc also uses the mandelbrot set as a motivating example [0], they get a 5x speedup from writing the code in almost the same way and using ispc. C compilers aren't good at autovectorizing outer loops, which is required here.

I recently learend however that intels icx c compiler, based on llvm, can now autovectorize the outer loop [1]. Upstream clang sadly still can't, same goes for gcc.

[0] https://ispc.github.io/example.html

[1] https://godbolt.org/z/jbn1robKz

thanatos519 · on Feb 19, 2024

There's also the not-a-CPU-optmization where we skip iteration entirely for the main cardioid and circle ... not in C but here's the shader-esque version:

ac = vec2(x,y); c2 = dotp(ac,ac); if ((256.0c2c2-96.0c2+32.0x-3.0>0.0)&&(16.0(c2+2.0x+1.0)-1.0>0.0)) { ... do the iterations ... } else { ... it's in the Mandelbrot set ...}

LorenDB · on Feb 19, 2024

Could you drop a link on your site to https://www.orange-kiwi.com/index.xml? Hugo generates that RSS feed by default, but you haven't linked it anywhere. Just a friendly heads up :)

tsukurimashou · on Feb 19, 2024

  <link rel="alternate" type="application/rss+xml" title="Orange Kiwi" href="https://www.orange-kiwi.com/index.xml" />

yes just add this inside your <head>

alkonaut · on Feb 19, 2024

Cool article. Does anyone have any pointers to what the state of the art is for arbitrary precision? Is that even viable or will it just be a boring extremely slow exercise long before you reach the zoom levels where 64bit float breaks down?

QuadmasterXLII · on Feb 19, 2024

The state of the art for arbitrary precision is in rapid flux right now. The naive approach is O(pixels * iterations * bits_precision^2) which is just hopeless. K. L. Martin[1] took it to O(iterations * bits_precision^2 + pixels * iterations) which was a massive improvement. I explain my webgl implementation of that algorithm here [2] and I think it's currently the fastest web implementation for depths up to 2000 bits. However, in 2022 Zhouran [3] dramatically improved the constant factor for the first term, then made a breakthrough in overall complexity, taking it to O(pixels * log(iterations) + iterations * bits_precision^2) which blows the approach I used out of the water for deep images. It's so recent that there's not really a blog post properly explaining it anywhere, you just have to sort through the forum thread., There is a webassembly implementation at https://fraktaler.mathr.co.uk/live/latest/ with public source.

1 http://www.science.eclipse.co.uk/sft_maths.pdf

2 https://www.hgreer.com/JavascriptMandelbrot/

3 https://fractalforums.org/fractal-mathematics-and-new-theori...

dahart · on Feb 19, 2024

Thanks for this; I’ve been meaning to work through the ‘perturbation theory’ stuff and implement a deep zoom capable Mandel myself one of these years.

What are the options today for doing this stuff on a GPU? The core of these techniques is to have pre-computed an arbitrary precision orbit, right? Does that mean in practice that the reference orbit is computed on the CPU, and rendering could optionally be done on the GPU? Does anyone compute reference orbits on the GPU?

QuadmasterXLII · on Feb 19, 2024

No one knows how to accelerate computing the reference orbit with a GPU yet- I'm hopeful we'll find a way! Until then, yeah computing the reference orbit on CPU and then sending it to the GPU is the usual approach.

sigil · on Feb 19, 2024

Incredible reply, thanks! The Mandelbrot set is my Hello World when learning new graphics APIs, and I was learning WebGL with it recently. Immediately ran into the precision issues you describe in [2]!

dahart · on Feb 19, 2024

64 bit doubles break down pretty quickly when zooming into the Mandelbrot: you can zoom in by 2x about 64 times. Doubles only get you twice the zoom level as 32 bit floats, meaning with fp32, you can zoom 2x about 32 times. I feel like using doubles makes zooming a little better but isn’t all that meaningful of an improvement over floats.

You can use arbitrary precision, and yes it will be an exercise in patience (both in coding and in waiting for Mandel to render), but there are some really crazy tricks for deep zooming into Mandel that don’t use arbitrary precision, tricks that I know of but don’t understand yet. I’ve seen hints on some of the Mandel deep-zoom videos on YouTube, and also on ShaderToy. BTW I wrote a 128 bit quad-float emulator for ShaderToy, and it seems that it’s somewhere around 100x slower than using floats. It was a very fun exercise.

selimthegrim · on Feb 19, 2024

Are the streams problems good concurrency practice?