Mesh Shader Possibilities
NVIDIA recently announced their latest GPU architecture, called Turing. Although its headlining feature is hardware-accelerated ray tracing, Turing also includes several other developments that look quite intriguing in their own right.
One of these is the new concept of mesh shaders, details of which dropped a couple weeks ago—and the graphics programming community was agog, with many enthusiastic discussions taking place on Twitter and elsewhere. So what are mesh shaders (and their counterparts, task shaders), why are graphics programmers so excited about them, and what might we be able to do with them?
The GPU Geometry Pipeline Has Gotten Cluttered
The process of submitting geometry—triangles to be drawn—to the GPU has a simple underlying paradigm: you put your vertices into a buffer, point the GPU at it, and issue a draw call to say how many primitives to render. The vertices get slurped linearly out of the buffer, each is processed by a vertex shader, the triangles are rasterized and shaded, and Bob’s your uncle.
But over decades of GPU development, various extra features have gotten bolted onto this basic pipeline in the name of greater performance and efficiency. Indexed triangles and vertex caches were created to exploit vertex reuse. Complex vertex stream format descriptions are needed to prepare data for shading. Instancing, and later multi-draw, allowed certain sets of draw calls to be combined together; indirect draws could be generated on the GPU itself. Then came the extra shader stages: geometry shaders, to allow programmable operations on primitives and even inserting or deleting primitives on the fly, and then tessellation shaders, letting you submit a low-res mesh and dynamically subdivide it to a programmable level.
While these features and more were all added for good reasons (or at least what seemed like good reasons at the time), the compound of all of them has become unwieldy. Which subset of the many available options do you reach for in a given situation? Will your choice be efficient across all the GPU architectures your software must run on?
Moreover, this elaborate pipeline is still not as flexible as we would sometimes like—or, where flexible, it is not performant. Instancing can only draw copies of a single mesh at a time; multi-draw is still inefficient for large numbers of small draws. Geometry shaders’ programming model is not conducive to efficient implementation on wide SIMD cores in GPUs, and its input/output buffering presents difficulties too. Hardware tessellation, though very handy for certain things, is often difficult to use well due to the limited granularity at which you can set tessellation factors, the limited set of baked-in tessellation modes, and performance issues on some GPU architectures.
Simplicity Is Golden
Mesh shaders represent a radical simplification of the geometry pipeline. With a mesh shader enabled, all the shader stages and fixed-function features described above are swept away. Instead, we get a clean, straightforward pipeline using a compute-shader-like programming model. Importantly, this new pipeline is both highly flexible—enough to handle the existing geometry tasks in a typical game, plus enable new techniques that are challenging to do on the GPU today—and it looks like it should be quite performance-friendly, with no apparent architectural barriers to efficient GPU execution.
Like a compute shader, a mesh shader defines work groups of parallel-running threads, and they can communicate via on-chip shared memory as well as wave intrinsics. In lieu of a draw call, the app launches some number of mesh shader work groups. Each work group is responsible for writing out a small, self-contained chunk of geometry, called a “meshlet”, expressed in arrays of vertex attributes and corresponding indices. These meshlets then get tossed directly into the rasterizer, and Bob’s your uncle.
(More details can be found in NVIDIA’s blog post, a talk by Christoph Kubisch, and the OpenGL extension spec.)
The appealing thing about this model is how data-driven and freeform it is. The mesh shader pipeline has very relaxed expectations about the shape of your data and the kinds of things you’re doing to do. Everything’s up to the programmer: you can pull the vertex and index data from buffers, generate them algorithmically, or any combination.
At the same time, the mesh shader model sidesteps the issues that hampered geometry shaders, by explicitly embracing SIMD execution (in the form of the compute “work group” abstraction). Instead of each shader thread generating geometry on its own—which leads to divergence, and large input/output data sizes—we have the whole work group outputting a meshlet cooperatively. This mean we can use compute-style tricks, like: first do some work on the vertices in parallel, then have a barrier, then work on the triangles in parallel. It also means the input/output bandwidth needs are a lot more reasonable. And, because meshlets are indexed triangle lists, they don’t break vertex reuse, as geometry shaders often did.
An Upgrade Path
The other really neat thing about mesh shaders is that they don’t require you to drastically rework how your game engine handles geometry to take advantage of them. It looks like it should be pretty easy to convert most common geometry types to mesh shaders, making it an approachable upgrade path for developers.
(You don’t have to convert everything to mesh shaders straight away, though; it’s possible to switch between the old geometry pipeline and the new mesh-shader-based one at different points in the frame.)
Suppose you have an ordinary authored mesh that you want to load and render. You’ll need to break it up into meshlets, which have a static maximum size declared in the shader—NVIDIA’s blog post recommends 64 vertices and 126 triangles as a default. How do we do this?
Fortunately, most game engines currently do some form of vertex cache optimization, which already organizes the primitives by locality—triangles sharing one or two vertices will tend to be close together in the index buffer. So, a quite viable strategy for creating meshlets is: just scan the index buffer linearly, accumulating the set of vertices used, until you hit either 64 vertices or 126 triangles; reset and repeat until you’ve gone through the whole mesh. This could be done at art build time, or it’s simple enough that you could even do it in the engine at level load time.
Alternatively, vertex cache optimization algorithms can probably be modified to produce meshlets directly. For GPUs without mesh shader support, you can concatenate all the meshlet vertex buffers together, and rapidly generate a traditional index buffer by offsetting and concatenating all the meshlet index buffers. It’s pretty easy to go back and forth.
In either case, the mesh shader would be mostly just acting as a vertex shader, with some extra code to fetch vertex and index data from their buffers and plug them into the mesh outputs.
What about other kinds of geometry found in games?
Instanced draws are straightforward: multiply the meshlet count and put in a bit of shader logic to hook up instance parameters. A more interesting case is multi-draw, where we want to draw a lot of meshes that aren’t all copies of the same thing. For this, we can employ task shaders—a secondary feature of the mesh shader pipeline. Task shaders add an extra layer of compute-style work groups, running before the mesh shader, and they control how many mesh shader work groups to launch. They can also write output variables to be consumed by the mesh shader. A very efficient multi-draw should be possible by launching task shaders with a thread per draw, which in turn launch the mesh shaders for all the individual draws.
If we need to draw a lot of very small meshes, such as quads for particles/imposters/text/point-based rendering, or boxes for occlusion tests / projected decals and whatnot, then we can pack a bunch of them into each mesh shader workgroup. The geometry can be generated entirely in-shader rather than relying on a pre-initialized index buffer from the CPU. (This was one of the original use cases that, it was hoped, could be done with geometry shaders—e.g. submitting point primitives, and having the GS expand them into quads.) There’s also a lot of flexibility to do stuff with variable topology, like particle beams/strips/ribbons, which would otherwise need to be generated either on the CPU or in a separate compute pre-pass.
(By the way, the other original use case that, it was hoped, could be done with geometry shaders was multi-view rendering: drawing the same geometry to, say, multiple faces of a cubemap or slices of a cascaded shadow map within a single draw call. You could do that with mesh shaders, too—but Turing actually has a separate hardware multi-view capability for these applications.)
What about tessellated meshes?
The two-layer structure of task and mesh shaders is broadly similar to that of tessellation hull and domain shaders. While it doesn’t appear that mesh shaders have any kind of access to the fixed-function tessellator unit, it’s also not too hard to imagine that we could write code in task/mesh shaders to reproduce tessellation functionality (or at least some of it). Figuring out the details would be a bit of a research project for sure—maybe someone has already worked on this?—and perf would be a question mark. However, we’d get the benefit of being able to change how tessellation works, instead of being stuck with whatever Microsoft decided on in the late 2000s.
New Possibilities
It’s great that mesh shaders can subsume our current geometry tasks, and in some cases make them more efficient. But mesh shaders also open up possibilities for new kinds of geometry processing that wouldn’t have been feasible on the GPU before, or would have required expensive compute pre-passes storing data out to memory and then reading it back in through the traditional geometry pipeline.
With our meshes already in meshlet form, we can do finer-grained culling at the meshlet level, and even at the triangle level within each meshlet. With task shaders, we can potentially do mesh LOD selection on the GPU, and if we want to get fancy we could even try dynamically packing together very small draws (from coarse LODs) to get better meshlet utilization.
In place of tile-based forward lighting, or as an extension to it, it might be useful to cull lights (and projected decals, etc.) per meshlet, assuming there’s a good way to pass the variable-size light list from a mesh shader down to the fragment shader. (This suggestion from Seb Aaltonen.)
Having access to the topology in the mesh shader should enable us to calculate dynamic normals, tangents, and curvatures for a mesh that’s deforming due to complex skinning, displacement mapping, or procedural vertex animation. We can also do voxel meshing, or isosurface extraction—marching cubes or tetrahedra, plus generating normals etc. for the isosurface—directly in a mesh shader, for rendering fluids and volumetric data.
Geometry for hair/fur, foliage, or other surface cover might be feasible to generate on the fly, with view-dependent detail.
3D modeling and CAD apps may be able to apply mesh shaders to dynamically triangulate quad meshes or n-gon meshes, as well as things like dynamically insetting/outsetting geometry for visualizations.
For rendering displacement-mapped terrain, water, and so forth, mesh shaders may be able to assist us with geometry clipmaps and geomorphing; they might also be interesting for progressive meshing schemes.
And last but not least, we might be able to render Catmull–Clark subdivision surfaces, or other subdivision schemes, more easily and efficiently than it can be done on the GPU today.
To be clear, a great deal of the above is speculation and handwaving on my part—I don’t want to mislead you that all of these things are for sure doable with the new mesh and task shader pipeline. There will certainly be algorithmic difficulties and architectural hindrances that will come up as graphics programmers have a chance to dig into this. Still, I’m quite excited to see what people will do with this capability over the next few years, and I hope and expect that it won’t be an NVIDIA-exclusive feature for too long.