Hash Functions for GPU Rendering
Back in 2013, I wrote a somewhat popular article about pseudorandom number generation on the GPU. In the eight years since, a number of new PRNGs and hash functions have been developed; and a few months ago, an excellent paper on the topic appeared in JCGT: Hash Functions for GPU Rendering, by Mark Jarzynski and Marc Olano. I thought it was time to update my former post in light of this paper’s findings.
Jarzynski and Olano’s paper compares GPU implementations of a large number of different hash functions along dual axes of performance (measured by time to render a quad evaluating the hash at each pixel) and statistical quality (quantified by the count of failures of TESTU01 “Big Crush” tests). Naturally, there is quite a spread of results in both performance and quality. Jarzynski and Olano then identify the few hash functions that lie along the Pareto frontier—meaning they are the best choices along the whole spectrum of performance/quality trade-offs.
When choosing a hash function, we might sometimes prioritize performance, and other times might prefer to sacrifice performance in favor of higher quality (real-time versus offline applications, for example). The Pareto frontier provides the set of optimal choices for any point along that balance—ranging from LCGs at the extreme performance-oriented end, to some quite expensive but very high-quality hashes at the other end.
In my 2013 article, I recommended the “Wang hash” as a general-purpose 32-bit-to-32-bit integer hash function. The Wang hash was among those tested by Jarzynski and Olano, but unfortunately it did not lie along the Pareto frontier—not even close! The solution that dominates it—and one of the best balanced choices between performance and quality overall—is PCG. In particular, the 32-bit PCG hash used by Jarzynski and Olano goes as follows:
uint pcg_hash(uint input)
{
uint state = input * 747796405u + 2891336453u;
uint word = ((state >> ((state >> 28u) + 4u)) ^ state) * 277803737u;
return (word >> 22u) ^ word;
}
This has slightly better performance and much better statistical quality than the Wang hash. It’s fast enough to be useful for real-time, while also being high-quality enough for almost any graphics use-case (if you’re not using precomputed blue noise, or low-discrepancy sequences). It should probably be your default GPU hash function.
Just to prove it works, here’s the bit pattern generated by a few thousand invocations of the above function on consecutive inputs:
Yep, looks random! 👍
PCG Variants
Incidentally, you might notice that the PCG function posted above doesn’t match that found in other sources, such as the minimal C implementation on the PCG website. This is because “PCG” isn’t a single function, but more of a recipe for constructing PRNG functions. It works by starting with an LCG, and then applying a permutation function to mix around the bits and improve the quality of the results. There many possible permutation functions, and O’Neill’s original PCG paper provides a set of building blocks that can be combined in various ways to get generators with different characteristics. In particular, the PCG used by Jarzynski and Olano corresponds to the 32-bit “RXS-M-XS” variant described in §6.3.4 of O’Neill. (See also the list of variants on Wikipedia).
Hash or PRNG?
One of the main points I discussed in my 2013 article was the distinction between PRNGs and hash functions: the former are designed for a good distribution within a single stateful stream, but do not necessarily provide good distribution across streams with consecutive seeds; hash functions are stateless and designed to give a good distribution even with consecutive (or otherwise highly correlated) inputs.
PCG is actually designed to be a PRNG, not a hash function, so it may surprise you to see it being used as a hash here. What gives? Well, apparently PCG is just so good that it works well as a hash function too! ¯\_(ツ)_/¯
It’s worth noting that PCG does support more or less efficient jump-ahead, owing to the LCG at its core; it’s possible to advance an LCG by $n$ steps in only $O(\log n)$ work using modular exponentiation. However, that is not what Jarzynski and Olano’s code does: it’s not jumping ahead to the $n$th value in a single PCG sequence, but essentially just taking the first value from each of $n$ sequences with consecutive initial states. The fact that this works at all is somewhat surprising, and a testament to the power of permutation functions.
In my previous article, I also recommended that if you need multiple random values per pixel, you could start with a hash function and then iterate either LCG or Xorshift using the hash output as an initial state. You can still do that, using PCG as the initial hash—but it might be just as fast to iterate PCG. The interesting thing about PCG’s design is that only the LCG portion of it actually carries data dependencies from one iteration to the next, and LCGs are super fast. The permutation parts are independent of each other and can be pipelined to exploit instruction-level parallelism when doing multiple iterations.
For completeness, the “PRNG form” of the above PCG variant looks like:
uint rng_state;
uint rand_pcg()
{
uint state = rng_state;
rng_state = rng_state * 747796405u + 2891336453u;
uint word = ((state >> ((state >> 28u) + 4u)) ^ state) * 277803737u;
return (word >> 22u) ^ word;
}
That’s about it! Be sure to check out Jarzynski and Olano’s paper for some more tidbits, including a discussion of hashes with multi-dimensional inputs and outputs.