There’s an idea that’s been bouncing around in my head for awhile about a kind of deferred renderer that (as far as I know) hasn’t been done before. I’m sure I’m not the first to have this idea, but recently it crystallized in my head a bit more and I wanted to write a bit about it.
For the last several years, graphics folks have been experimenting with a variety of deferred approaches to rendering—deferred shading, light-prepass, tiled deferred, tiled forward, and so on. These techniques improve both the cleanliness and performance of a renderer by doing some or all of the following:
- Elegantly separate lighting code from “material” code (that is, code that controls the properties of the BRDF at each point on a surface), avoiding shader combination explosion
- Only process the scene geometry once per frame (excluding shadow maps and such)
- Reduce unnecessary shading work for pixels that will end up occluded later
- Reduce memory bandwidth costs incurred in reading, modifying and writing screen-sized buffers
Different rendering methods offer different tradeoffs among these goals. For example, light-prepass renderers reduce the memory bandwidth relative to deferred shading, at the cost of requiring an extra geometry pass.
Deferred texturing is another possible point in the deferred constellation—which as far as I know has not been implemented, though it’s been discussed. The idea here, as the name suggests, is to defer sampling textures and doing material calculations until you know what’s on screen.
Renderers these days use quite a few textures per material: diffuse, normal, specular, gloss, emissive, and so on. And materials can involve substantial amounts of math too, if you’re doing things like compositing layers based on blend maps, or applying modifications for wetness/snow/dust/suchlike. It might be worthwhile to avoid doing these operations for pixels that will eventually be occluded.
Beyond that, deferred texturing can also save you from the pressure of squeezing all your material parameters into a G-buffer—a common pain point for deferred shading. With deferred texturing, material and lighting code are in the same shader once again, so you can afford more material parameters and a greater variety of BRDFs, just as with tiled forward shading. But unlike tiled forward shading, deferred texturing only requires the geometry to be processed once per frame.
Keep in mind that this is all theoretical, and I haven’t actually implemented it; however, I don’t think there are any insurmountable obstacles.
First of all, to sample textures later, we’ll need to store the UVs in the G-buffer. We’ll also need to store a material ID, and later use bindless textures so that we can sample whichever textures we want, based off the material ID. Of course, this will require bindless texture support, which for now means OpenGL and NVIDIA hardware—but should be possible on current AMD and Intel architectures with driver support, I believe. However, there are some caveats here, which I’ll explain later.
What about mip levels, or derivatives? These are needed to get correct mipmapping and anisotropic filtering when we do our deferred texture samples, but I’m not sure it’s necessary to store them explicitly in the G-buffer. It “should” be possible to reconstruct derivatives well enough later, by comparing with adjacent G-buffer pixels. After all, ordinary texture sampling uses quad-uniform UV derivatives, which aren’t all that precise. As long as you have a cluster of a few neighboring pixels using the same material, we should be able to filter them to get good derivatives. We just have to be careful not to filter across material boundaries or UV seams.
We’ll also need to keep interpolated vertex normals in the G-buffer, since there’s no other way to get that information later. (I’m assuming we’ll do our normal mapping with per-pixel tangent space, based on UV and position derivatives grokked from the G-buffer as mentioned above, so we don’t need to store tangent/bitangent/flip values.)
Therefore, the G-buffer layout for deferred texturing might look something like this:
- Vertex normals—2× 16-bit fixed point (use whichever 2-component encoding you like—I personally favor octahedral)
- Material ID—16-bit int. 64K materials should be enough for anybody, right? ;)
- UVs—2× 16-bit fixed point in [0, 1], or maybe [0, 2] to make wrapping simpler. We might need to go to 32-bit for large virtual textures.
- [maybe] LOD, 16-bit float, or alternatively UV derivatives, 4× 16-bit float.
(These are just my best guesses at bit depths and formats that would give you sufficient precision.)
Here’s something interesting: the G-buffer here is only 80 bits per pixel, in the best case. If you do need to explicitly store LOD, it goes up to 96 bpp; with full derivatives, 144 bpp. With 32-bit UVs and no explicit LOD, it’s 112 bpp. The point is that it can be a reasonably small G-buffer—many deferred rendering pipelines have bigger G-buffers than this. So another possible benefit of deferred texturing is to reduce the G-buffer memory and bandwidth costs.
The rendering pipeline for deferred texturing would then look something like this:
- Render the scene geometry and write the G-buffer, as listed above.
- Do a light-culling pass, just as in tiled deferred or tiled forward methods, using the depth buffer and generating per-tile light lists.
- Do a fullscreen pass in which for each pixel:
- Look up the pixel’s material information (texture handles and so forth) from a giant flat array indexed by material ID.
- Sample all the textures for that pixel’s material, using bindless texturing.
- Do any material computations necessary (compositing layers, applying modifications such as wetness, etc.)
- Apply lighting, using the per-tile light lists output previously.
Passes 2 and 3 here could also be combined in a single pass, if the light lists don’t need to be saved out for later re-use.
So assuming this works, why might you want to render this way? I already mentioned some advantages in passing, but here they are collected:
- You avoid doing the work of texture sampling and material computations for pixels that end up occluded. This could also help when using very small primitives, as less work would be done by quad overshading.
- If you would have a fat G-buffer otherwise, the deferred texturing G-buffer can be thinner, reducing memory bandwidth costs.
- You don’t have to squeeze all your material parameters into a G-buffer—you can have more complex BRDFs and a greater variety of them, as with tiled forward shading.
- Geometry only needs to be processed once per frame.
There are a number of problems that would need to be solved to make deferred texturing a viable approach.
The biggest issue is this: I cavalierly assumed that using bindless textures, we would be able to sample any texture per pixel freely. If you want to be portable, this is not the case. On the AMD GCN architecture, the texture handle can’t be divergent within a wavefront, since it’s stored in scalar registers; on NVIDIA GPUs, divergent texture handles are technically possible, but my guess is that they still don’t perform well (though I haven’t tried). I have no idea about Intel GPUs.
It should be possible to work around this limitation using a tile-based strategy: in a compute shader, gather all the material IDs present in a screen-space tile, then loop over them (uniformly, so no divergence) and issue all the texture lookups for each material together. As far as I know, this should be workable on GCN, but it’s an unusual enough case that I wouldn’t be surprised if you have trouble getting the shader compiler to understand what you’re trying to do!
If you store your textures in sparse arrays, as suggested by AZDO, then this is slightly less crazy—you don’t have to segregate texture lookups by material ID, but only by the size/format combination used to decide which array to look in. The array slice index can vary freely per pixel, with no consequences (besides cache locality).
Incidentally, one could also consider a version of this technique in which instead of using bindless textures, we stored all our textures in one giant sparse atlas. That would avoid the divergence issues entirely—but we’d sacrifice hardware UV addressing modes, as well as requiring all textures to be in the same format.
Another possible variation is to do a separate pass per material, operating only on the pixels that use the current material. This could be optimized by generating per-tile material lists, like the per-tile light lists, and using them to restrict each material pass to regions of the screen where that material is present. It’s possible that this might end up even being faster than the bindless approach, due to better texture cache locality or some other microarchitectural issue.
Arm-twisting bindless textures into doing what we want is the biggest risk here, but there are a few other challenges as well:
- Sampling textures with explicit LODs or derivatives can be a good deal slower than using automatic LODs.
- All the material code and lighting code is in one giant shader, which can lead to a variety of issues—compilation, occupancy, and divergence, to name a few.
- You can only have one UV set per primitive. In the common case that the various textures are all sampled using the same UV set (or transformations of a single UV set), this is fine; but it does make it difficult to use an independent UV set for lightmaps and such.
- Like deferred shading, deferred texturing is not particularly friendly to MSAA. We would need to use methods like simple/complex pixel classification, similar to efficient deferred MSAA.
- Deferred texturing isn’t any more friendly to transparencies than deferred shading, either. In fact, it’s actually less friendly, because in deferred shading at least you can create decal shaders that modify the material properties in the G-buffer—for painting stripes on roads and things of that sort. Deferred texturing doesn’t offer a way to do this. However, at least the per-tile light list generated for deferred texturing can be re-used in a forward pass for transparencies.
I would be remiss if I didn’t mention the recent JCGT paper, The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading. This paper proposes a rendering pipeline that not only defers texturing, but actually defers vertex attribute interpolation as well! It stores only primitive IDs in its G-buffer; then in a later pass, it fetches vertex data, re-runs the vertex shader per pixel (!), finds the barycentric coordinates of each fragment within its triangle, interpolates the vertex attributes, then finally samples all the textures and does the shading work.
The visibility-buffer approach is explicitly designed for mobile devices where bandwidth is a big concern (both for power and performance), so they don’t want to even fetch vertex attributes for triangles that aren’t going to be seen. On the other hand, they pay for it by doing quite a bit of extra per-pixel computation; they also sacrifice the ability to use tessellation or other methods that generate geometry on the fly, since all the vertex data has to be in memory. Still, it’s an interesting idea—and they actually got it to work, and proved a performance win in some cases. It’s also a partial proof of concept for deferred texturing, since some of the implementation overlaps.
Despite all of the problems just mentioned, I think deferred texturing is an interesting enough idea to be worth pursuing a little bit. I may try to build a working demo when I can find the time, but April is shaping up to be a busy month for me, so it may be awhile! If anyone else wants to try it, please do and let me know how it goes. Or if you have any other thoughts on this method—including if I missed something obvious and it can’t possibly work—please share in the comments. 🙂