Writing an efficient Vulkan renderer (2020)

oddity · on Jan 30, 2022

Sadly, it's even more complicated than what's described here. There are a lot of undocumented fast paths and nonorthogonalities between features where you can really only learn about them by talking to your friendly neighborhood driver engineer in a backdoor conversation (that only a select few developers have access to). Otherwise, it's easy to miss a lot of spooky action at a distance. Vulkan and DX12 might have been sold as "low-level" APIs but they didn't (and can't) do much to address the fundamental economic issues that push hardware engineers to build patchwork architectures.

I'm half of the opinion that writing an efficient Vulkan or DX12 renderer has not been and cannot be done. If you're not one of the developers shipping a AAA title with enough weight to get attention from the RGBs, you're probably better off going as bindless and gpu-driven as you can. It's not optimal (or entirely possible on some architectures), but it probably has the most consistent behavior across architectures when it's supported.

throwaway17_17 · on Jan 30, 2022

Question: By gpu-driven do you mean targeting actual GPU specific functionality and then creating a renderer for each manufacturer and card generation you intend to support?

Also, what do you mean by bindless?

Sorry if those are basic questions, but I find graphics programming fascinating as someone who doesn’t get to do it professionally.

oddity · on Jan 30, 2022

Most GPUs are not self-sufficient to the degree that CPUs are. Where on a CPU, I can spawn an arbitrary child process/thread running (almost) any code with customized communication between the threads/processes, on a GPU, the hardware (and/or driver) implements the process loading and scheduling logic, usually restricted to a pattern that suits graphics applications. This means that there's a greater surface area for the graphics APIs to have unintuitive behavior.

GPU-driven rendering relies on some of the limited ability for the application programmer to schedule more work for the GPU _on_ the GPU (example: https://www.advances.realtimerendering.com/s2015/aaltonenhaa...). Bindless rendering refers to the ability for the GPU shader to use resources on the GPU without having to announce to the driver that it intends to use those resources. Essentially, I'm saying that the layers between the application and its shaders are so opaque that for most ordinary people, the most reasonable solution is to use less of them wherever possible so that they can actually intuit the numbers they're seeing through microbenchmarking. In both cases, there's a code complexity and performance penalty, so if you're trying to get peak performance (like a AAA might), then there are good reasons to not do it. This is on top of portability concerns since not all hardware supports all the features that might be needed to do this.

If you're just trying to get into graphics though, I'd recommend starting with webgpu (where these techniques aren't possible, yet). The API is relatively clean and easy to work with compared to everything else, and I'm assuming performance won't be the primary concern.

dagmx · on Jan 31, 2022

> get attention from the RGBs

What is RGB in this context?

thecompilr · on Jan 31, 2022

Probably red green blue = amd nvidia intel

dagmx · on Jan 31, 2022

Ah that makes sense.

pjmlp · on Jan 31, 2022

My point of view is that outside AAA studios one is better off with middleware, or stick with Metal, GL and DX 11, they aren't going away any time soon.

Heck, WebGL 2.0 is based on a GL ES 3.0 subset and WebGPU is going to be a MVP 1.0 currently scheduled for July in Chrome, with plenty of stuff to eventually come years down the pipe, so expect a similar timeframe (measured in years) for large scale adoption.

shmerl · on Jan 30, 2022

Is Godot efficient enough? They are preparing to release their Vulkan based version.

Vulkan drivers for AMD and Intel are also open source (at least for Linux), so developers can look into how things are implemented without asking for some special attention. Plus driver developers are pretty open to questions.

oddity · on Jan 30, 2022

For most games that people might want to make with Godot, it's probably good enough. Not everything needs to be perf-optimal.

Also, the driver source being open doesn't mean that all their secrets are too.

shmerl · on Jan 30, 2022

What do you mean by secrets? Mesa especially uses open development, so decisions about drivers are made in the open.

oddity · on Jan 30, 2022

Even if the driver were a thin layer over DMA talking to the hardware such that every button being pressed was obvious, the performance characteristics of the buttons the driver pushes are not unless you can see their RTL or otherwise know the right combination to push to hit the right potholes.

Except, we're not in that situation, and even for mostly transparent drivers there can be an opaque firmware blob in between you and the hardware. Without changing the economics, pushing the API lower only pushes the implementation of the opaque behavior lower.

shmerl · on Jan 30, 2022

There is some level of reverse engineering that went into ACO from what I understood, because even with documentation on AMD's ISA, some edge cases remain. But not sure how much of that is affected by the firmware blob.

It would be nice to have an open firmware, yeah, but is that really preventing making efficient drivers or using them efficiently? For the most part how hardware operates is more or less understood.

oddity · on Jan 30, 2022

Performance, at the level of a single frame, is a complicated multi-system interaction where only a few layers are visible, and small bubbles can cascade into missed frames. None of the IHVs are incentivized to make all layers visible, so I'd argue that how hardware operates is not more or less understood by the application developer unless they have the influence to talk to the IHVs directly.

I'm not saying that the APIs can't be used to get good enough performance, but I am saying that it's a fiendishly hard problem, and even harder if you're not one of the handful to get (more) complete information. Perhaps much more so than the CPU world where microbenchmarking is more mature.

shmerl · on Jan 30, 2022

I'd say it would be more of a concern for driver developers than to application developers, because their focus is on making the layer that's interacting with hardware efficient. I haven't seen them talking so far about having such kind problems with firmware in case of AMD, but may be you've seen that (I'd be interested in reading about it).

I've seen other kind of issues described by Nouveau developers, due to Nvidia being nasty and hiding stuff like ability to reclock the GPU behind the firmware, making it very hard to develop open drivers because they can't even make the GPU to run on higher clocks.

oddity · on Jan 31, 2022

Driver developers would like for it to be their concern and their concern only, but this doesn't scale when an application developer needs their app to run on hardware from vendors with different levels of driver quality. Vulkan and DX12 were built with the assumption that application developers would take on more of the burden that the driver developers were doing for them (or not doing, in the case of some vendors). The problem is that hardware vendors will always build a lower level as long as it's economical to do so.

AMD GPUs starting with GCN 1.2 include a microcoded hardware scheduler for virtualizing queues. NVIDIA also has a custom (slightly different functionality) control processor (https://riscv.org/wp-content/uploads/2017/05/Tue1345pm-NVIDI...). NVIDIA is certainly the worst of the two for this, but in both cases they're convenient places to shove proprietary secrets while still keeping the driver open source. But even if you see an open source driver on one platform, you have very little guarantee that similar optimizations are running when you use a closed source driver on an entirely different platform.

But, honestly, open source driver/firmware is a low bar. Some of the nastier secrets are in _hardware_. Things like instruction X halving the throughput of a seemingly unrelated shader only on hardware Y might not be obvious unless someone tells you, and they won't tell you in public because it might give away IP. These are the kinds of things you'd hope to find in microbenchmarks, but the search space at the API level is so large that I doubt anyone could come up with a reasonable model on their own. Smaller (or even larger) devs just can't microbenchmark at the scale needed to find these issues.

I don't think hardware vendors are in the wrong for this. Hardware vendors have very understandable motives to guard their IP for competitive and forward-compatibility reasons. Honestly, I think most of the development in graphics APIs has been for the better overall, but that doesn't change that using these APIs is undeniably miserable. With CPUs, we dodge the problem by recommending that people minimize context switches (an IP boundary between different layers of the computing stack) so that most people can assume that most of the code that ran is theirs when they measure things. For the same reasons, I recommend that most people minimize graphics API exposure.

shmerl · on Jan 31, 2022

Yeah, from what I've seen some weird edge cases often are fixed after bug reports, not becasue there is some sweeping benchmark that catches them. I suppose GPUs are in general more complicated to address than CPUs in this sense.

And it's not always even some intentional secrecy, possibly GPU makers themselves don't anticipate every problem in advance. Driver developers just find some of these issues after some known case exposes them and try to work around that.

Mikeb85 · on Jan 31, 2022

Efficient enough for what? Not everything is AAA...

shmerl · on Jan 31, 2022

AAA is a moot term really. So indeed, efficient or not efficient for what?

Mikeb85 · on Jan 31, 2022

I mean, a basic Godot package is 1/10 the size of a basic Unreal package. And it scales to quite a few objects/triangles. There is an eventual drop off point, but where is it? Somewhere close to the highest fidelity 3D games. But there's a ton of factors. Also opportunities for more optimization if you drop into C++. Not sure anyone has truly found the limit because it's mostly small studios working with it.

Different engines are efficient for different amounts of objects/triangles. Godot is unquestionably the most efficient for basic games.

shmerl · on Jan 31, 2022

It might be lacking features to address some more specific use cases, but I mean from performance standpoint is their Vulkan renderer somehow not using hardware efficiently enough? That was basically my question.

Their long term goal is to compete with Unreal and the like, so I'd assume they aim for efficiency in general.

shmerl · on Jan 30, 2022

Slightly off-topic, but I recently saw a very interesting dive into comparing Vulkan to D3D12:

https://themaister.net/blog/2021/11/07/my-personal-hell-of-t...

andrewmcwatters · on Jan 31, 2022

The difference in between using Vulkan and using OpenGL 3.3/4.0+ Core Profile is quite significant. It's hard for me to be motivated to upgrade existing code.

shmerl · on Jan 31, 2022

There is zink project to translate OpenGL into Vulkan. That should help maintain legacy cases in scenarios when OpenGL drivers will stop being supported.

pjmlp · on Jan 31, 2022

As someone that only does 3D as hobby, it is hardly to even be motivated to learn it.

GL and DX 11 will outlast my time on this sphere.

dang · on Jan 30, 2022

A small thread at the time:

Writing an Efficient Vulkan Renderer - https://news.ycombinator.com/item?id=24368353 - Sept 2020 (2 comments)

aliswe · on Jan 30, 2022

How about automating this functionality that youre currently doing manually?

dang · on Jan 31, 2022

I don't think it's so easy to automate. More likely we'll make it a community-collaborative feature.

WithinReason · on Jan 31, 2022

Click on the "past" link