Separate GPU compute kernels vs. a single GPU kernel

(L) [2015/09/30] [szellmann] [Separate GPU compute kernels vs. a single GPU kernel] Wayback!

Hi,
in several papers you can read that it's beneficial for ray tracing (and the like) to separate computations into different GPU kernels: e.g. ray generation, accel traversal, shading, tone mapping, vs. having one single kernel do the whole computation.
Aila / Laine also have separate kernels, e.g. for primary ray generation. I believe that they sort the rays.
In theory, such an approach should be beneficial because you can control the occupancy of each compute kernel individually, but at the cost of having to store intermediate buffers in potentially costly global memory, and at the cost (in my opinion) of less readable code.
I've done some experiments on nvidia GPUs (1st gen Titan, Mac Book Pro mobile GPU). I found that separating the control flow into different kernels had little to no effect at all on performance. I had primary ray generation separated into a kernel, and also, but this was a while back, the pixel store/tone mapping code. My more recent experiments were with CUDA and thrust, my experiments from a while ago were with OpenCL. I've done those experiments with coherent workloads (primary ray shading only, whitted ray tracing), and with path tracing and sufficiently complex scenes.
What I actually haven't done so far is separating BVH traversal and shading, because I believe that this has a huge impact on my API that I'd only be willing to accept if the effort is rectified.
I wonder if anyone has done similar experiments and would share his or her experience. In addition, has anyone investigated this on ATI?
Thanks, Stefan

(L) [2015/09/30] [toxie] [Separate GPU compute kernels vs. a single GPU kernel] Wayback!

Opinions still differ on this topic. As an example, Chaos Group is not (yet?) convinced and chose a different way of speeding up their megakernel: [LINK http://dl.acm.org/citation.cfm?id=2668941]
iray on the other hand uses wavefront successfully (see for example [LINK http://on-demand.gputechconf.com/siggraph/2015/presentation/SIG1502-Phil-Miller.pdf])
My personal opinion on this is that both approaches can work well, depending on the complexity of your renderer and a lot of tiny details that will influence the architecture of either megakernel or wavefront approach. So there is no silver bullet yet, also when it comes to any kind of sorting approach.
TLDR: You have to do experiments for each specific case at the moment. :/

(L) [2015/09/30] [szellmann] [Separate GPU compute kernels vs. a single GPU kernel] Wayback!

>> toxie wrote:Opinions still differ on this topic. As an example, Chaos Group is not (yet?) convinced and chose a different way of speeding up their megakernel: [LINK http://dl.acm.org/citation.cfm?id=2668941]
Thx, didn't know that paper!
>> toxie wrote:iray on the other hand uses wavefront successfully (see for example [LINK http://on-demand.gputechconf.com/siggraph/2015/presentation/SIG1502-Phil-Miller.pdf])
I've actually seen this talk at Siggraph this year, so I should have remembered. So maybe I should at least also test on Maxwell.. I remember him saying that they successfully tried wavefronts over the mega kernel because their implementation didn't scale to the most recent architecture.
When looking at Optix (same talk), I think that it was kind of designed around being able to do this from the beginning. The programmer can implement entry points like on_hit, on_miss or so. What the library does in between is (I think) completely opaque. In contrast to that, with embree the user has full control over the whole algorithm and basically just calls intersect(ray,bvh). I find the latter preferable - if the impact on performance is not too high.

(L) [2015/10/01] [papaboo] [Separate GPU compute kernels vs. a single GPU kernel] Wayback!

It really depends a lot on your use case, but separating the ray generation program and the gamma correction post process pass from a traversal-and-shading kernel is almost bound to give you zero improvements. The kernels are so small that anything you could gain by having better occupancy is probably lost on kernel launch overhead and memory operations.
Let's just take a step back first and go over why separating your kernels can be a good idea, because without the why, there's no knowing when to do it. [SMILEY :)]
Disclaimer: This is based on CUDA and NVIDIA GPUs. I don't know the AMD architecture.
One reason to separate your mega kernel into smaller kernels is to reduce register pressure, e.g. number of live variables that the compiler has to fit in registers. This is a good idea, because the fewer registers you use, the more threads you can potentially(!) launch. Additionally, any data that doesn't fit into registers is getting spilled to global memory and will be expensive to read and write. Due to coalescence that overhead might disappear, but if you're out of registers, then there's a fair change that you don't have a lot of threads that can be used to hide the latency from a memory fetch.
So, looking at your overall potential kernels:
ray generation could be register intensive, usually isn't though and the output is usually so well defined, that the compiler shouldn't have any issues detecting which variables are live and which aren't. Hierarchy traversal is register intensive. A good and hand-tuned traversal kernel will still use about 24-32 registers, which is about as high as you'd want to go. Shading can be brutally simple or insanely complex. If you have multiple different materials that you all add to a single ubershading kernel, then you could be looking at something really intense both memory- and computational-wise. If you add something like photon mapping into the mix here, then you can get register pressure that will grind your GPU to a halt, because all it's really doing is swapping memory. Tone mapping is again fairly simple, but you probably want this as a separate pass, just to keep your pipeline modular. [SMILEY :)]
Basically what this all boils down to is that the big gains, if there are any, will come from separating acceleration structure traversal from shading IF you have a complex shading kernel. If all you're doing is a bit of Lambert+Blinn, then I doubt you'll see any performance improvement. (And if you do have these multiple materials in a single kernel, then you can also potentially sort you intersections by material type intersected to have fewer divergent threads, but then you really need some crazy materials to be able to justify it. [SMILEY :D])
As for results. I haven't completely separated the traversal pass from shading myself yet in my current project. I tried it 6 years back, but back then I had a very very simple material and all I got was overhead. I think that with my current material library I could gain something. I did move parts of the shading that were really intense and shared by all materials into it's own kernel (yes writing it like that makes it sound horrible with regards to code structure, but it is actually a very clean separation and I can't go into why unfortunately.) and in some cases we saw a factor 5 speedup.
But try it for yourself and profile profile profile. [SMILEY :)]
TLDR: Split hierarchy traversal from shading if you shading kernel is complex. Always profile it though, because it's not a sure thing!

(L) [2015/10/01] [szellmann] [Separate GPU compute kernels vs. a single GPU kernel] Wayback!

Hi papaboo,
thanks for your answer and for sharing the experience - yes, what you say sounds reasonable: identify the sources of register pressure, and separate kernels if two independent tasks use a lot of registers.
I analyzed register usage for the ray generation kernel, it's ~20 registers (I use template code to generate multiple kernels: ray gen by transforming normalized device coord rays into camera space with matrix mult, ray gen by using u|v|w, jittered ray gen, ...). The high register usage is is why I expected it to be worthwhile to separate it from the rest of the pipeline. But then it's probably not only register pressure. Ray gen needs no branching and can probably be fully pipelined per thread. I also have that ported to SoA now, no improvements in wall clock time from that either.
I'm writing this for a library where the user can decide on his own how complicated shading will be. There are builtin algorithms like whitted or kajiya pt, but the user is rather expected to write his own algorithm and then use intrinsics. We e.g. have tex{1|2|3}D() CPU and GPU intrinsics for texture filtering. With a cubic texture filtering kernel that cannot be separated, register pressure goes insanely high. Everything is templated. The builtin kernels deal with template materials, and the user can pass e.g. a variant for that: generic_material<mirror, plastic, matte, ...>, or simply a single material (e.g. plastic), so shading in general can become arbitrarily complex. On the other hand, if the user knows that there are e.g. no dielectrics, he can completely deactivate the refraction code by not specifying that type of material in the template argument list - so the user decides at compile time how "uber" the shader becomes.
And then the library is not specifically meant for surface rendering. In the beginning I started to write it to consolidate code from a CPU and a GPU scientific volume renderer that we use at our institute - with that there is no BVH traversal involved, so a concept like the Optix intersect program does simply not apply.
Not sure how to deal with this. If it's up to the user, maybe some kind of annotation tool would be great:
Code: [LINK # Select all]// User code
intersect(rays[index], bvh);
#pragma __split_gpu_kernel(buffer[rays], buffer[RNGs])
shade(rays[index], very_complex_material);

But then I'd probably have to do smth. like pre-parsing the code with llvm or so.
TLDR: Yes, it sounds reasonable that wavefronts are beneficial with fairly complex shading, but with my code this is unfortunately up to the user of the library.

(L) [2015/10/01] [papaboo] [Separate GPU compute kernels vs. a single GPU kernel] Wayback!

Well that does make profiling a bit tricky. [SMILEY ;)]
Volume rendering and surface rendering can still use more or less the same primitive operations as in OptiX though as far as I remember. See Wenzel Jakob's thesis for the specific. But I haven't tried out volume rendering yet though, so my memory might be playing tricks on me.
Oh and combining kernel k1, k2 and k3 into megakernel doesn't mean that the megakernel will use the combined register count of all three separate kernels, or registers(k1) + registers(k2) + registers(k3) != registers(k1 + k2 + k3). The combined kernel will most likely use less registers, since there some variables won't live for the entire duration on the combined kernel. So while you ray generation kernel may use 20 registers, it shouldn't affect the megakernel by more than the registers used for the variables that need to be live for ray tracing and shading, which the ray tracing and shading kernel probably needs to allocate anyway, since it really does need these variables. [SMILEY ;)]

(L) [2015/10/03] [atlas] [Separate GPU compute kernels vs. a single GPU kernel] Wayback!

From my experience, the main motivation to split kernels is to encourage more coherent instructions and memory accesses by performing some sort of coherency operation between kernels (e.g. - sorting rays by leaf nodes after the traversal kernel to coalesce triangle access, material evaluation instructions, and texture accesses during the intersection and shading kernels). You definitely want to keep register usage low for things like occupancy, but I feel like proper scoping or reuse would take care of that. I would not split trivial kernels for the sake of having smaller kernels, there needs to be some kind of additional motivation.

Separate GPU compute kernels vs. a single GPU kernel back

Board: Home Board index Raytracing General Development