I'm an undergrad and know nothing about GPUs so forgive me for these questions (...

atq2119 · on April 14, 2022

A modern renderer does rasterization followed by a sequence of post-processing steps that you can think of as compute dispatches (kernels) running one thread per pixel. The reality is often somewhat more complex, but that's a good first approximation.

Those steps tend to be local, so instead of running step 1 on the whole screen, then step 2 on the whole screen, and so on, you could run all steps on the top-left tile (of e.g. 128x128 pixels), then all steps on the next tile, and so on. The downside is that you'll likely have to compute some data in the boundary regions between tiles multiple times. The upside is that the bulk of intermediate data between post-processing steps never has to be written out to and read back from memory (modern render targets are too large to fit into traditional caches, though that may be different with the huge cache AMD has built for their latest GPUs).

The same principle can be applied to GPGPU algorithms that have similar locality. This tends to be discussed under the label of "kernel fusion".

vup · on April 14, 2022

My understanding is, that the rasterization process also happens per tile.

One of the cases where this can cause performance problems is, when you want to read the output of a previous render pass. If you want to be able to read arbitrary parts of the output of the previous render pass, the output buffer probably needs to be copied from the tile memory into a memory that can hold the whole buffer at once. Furthermore, this also means that all tiles of the previous render pass need to execute before the next one can run. This limits how much work can be done in parallel.