Re: GPU VCM back
Board:
Home
Board index
Raytracing
Visuals
(L) [2016/06/11] [ost
by jbikker] [Re: GPU VCM] Wayback!I currently use the Microfacet BRDF from "Physically Based Rendering - From Theory to Implementation": Blinn-Phong, Schlick's Fresnel, G from the book.
The material model is from SmallVCM: sampleBRDF() to obtain a direction with a probability proportional to the BRDF; evaluateBRDF() to get a probability for a given set of directions, extended with bidirectional evaluation. This is used consistently and greatly helps to reduce code complexity.
The BRDF evaluation is hardly branchless though, and involves moving from world to local space and back, as well as a 'pow', a 'sqrt', a 'sin' and a 'cos'. As a consequence, switching to Lambert (with the same interface, of course) affects performance quite a bit.
(L) [2016/06/13] [ost
by koiava] [Re: GPU VCM] Wayback!Great results as always [SMILEY :)]
Thing which is strange for me in "vcm gpu" video is that photons are very noticeable.
Directly visible photons correspond to CD.*L path which should have much lower MIS weight then BPT paths. Am I right?
Also in last two videos "BDPT with directional regularization" and "directionally regularizated light transport", caustics visible in mirror(CSDS.*L paths) has much less noise then CD.*L paths, I'm not familiar with with that algorithm itself but it looks very strange for me.
(L) [2016/06/14] [ost
by jbikker] [Re: GPU VCM] Wayback!The directly visible photons should have a low MIS weight, but also carry quite some energy; perhaps this could explain their brightness?
I followed SmallVCM quite closely; the renderer converges to the correct image for 'easy' scenes (large light source, so that unidirectional path tracing can find it), so I assumed I got it 'right'.
Regarding the low noise levels in the mirror, and particularly in the caustic: I find that strange too; so I did some experiments:
- In general the mirorred caustic is not nearly as bright. This turns out to be caused by some clamping I am doing: paths carrying too much energy are normalized, then scaled to the maximum throughput. This greatly reduces fireflies, but introduces bias. Without the clamping, the caustic in the mirror converges slowly, due to the contribution of rare high energy paths. With the clamping, bias is significant for this feature.
- In general the floor viewed via the mirror is not as noisy as the floor seen directly. This effect goes away if I use a diffuse shader; for the microfacet BRDF the regularized bounces are actually less specular than the microfacet BRDF, which causes surfaces to converge quicker when seen in the mirror. After a few passes this effect goes away (the technique is consistent), but by that time, the noise is already low.
So... bias is affecting the renderer.
Does the above make sense? I'm trying to get a grasp on proper light transport so I'm not very confident it's all perfect. [SMILEY :)]
I optimized it a bit since posting the video by the way: it runs at roughly twice the speed now. Looks like 16 samples is quite close to 'enough' for interactive rendering. It would also be interesting to see how some basic filtering would do with this kind of input.
(L) [2016/06/14] [ost
by ingenious] [Re: GPU VCM] Wayback!Nice videos!
I'd recommend using a smaller radius for vertex merging, something close to the projected pixel size, which would remove the nasty photon splotches. With VCM you can most often get away with using a small radius (unlike in PPM), because in the areas where merging is important the photon density is typically high, and in regions where it isn't important other techniques are better (and will be weighted higher). Ideally you'd want to determine the merging radius for every pixel from the pixel footprint at the first non-specular surface. A slight modification to the MIS weight computation is necessary in this case, which I've described in section 8.5.1 of [LINK http://www.iliyan.com/publications/ThesisPhD my thesis]. The nice thing is that this will remove the generally very sensitive world-scale radius parameter from your render settings. If you still want to retain some control, you can have a global radius_scale parameter to multiply the footprint-determined per-pixel radius.
Question: On that scene with the mirror, what's the improvement you see with directional regularization over pure BDPT? I'm asking because directional regularization was originally designed to render "impossible" paths from point light sources and is pretty much equivalent to enlarging your light source a bit. And your scene has a decently sized light source already.
(L) [2016/06/15] [ost
by jbikker] [Re: GPU VCM] Wayback!If I use a radius close to the projected pixel size the number of photons arriving per pixel is going to be low, right? That would probably lead to fireflies for quite a bit longer than the first 16-32 samples, which is the time frame I'm trying to optimize for.
As for the improvement of directional regularization over BDPT: it's also in this small time frame; the light cast by the mirror shows considerably less noise early on.
dirreg.png
This is without any clamping. If I let it run longer, the dirreg actually starts to produce fireflies, I suppose because many near-specular interactions are evaluated.
(L) [2016/06/15] [ost
by jbikker] [Re: GPU VCM] Wayback!If I use a radius close to the projected pixel size the number of photons arriving per pixel is going to be low, right? That would probably lead to fireflies for quite a bit longer than the first 16-32 samples, which is the time frame I'm trying to optimize for.
As for the improvement of directional regularization over BDPT: it's also in this small time frame; the light cast by the mirror shows considerably less noise early on.
[IMG #1 dirreg.png]
This is without any clamping. If I let it run longer, the dirreg actually starts to produce fireflies, I suppose because many near-specular interactions are evaluated.
[IMG #1]:Not scraped:
/web/20190531001341im_/http://ompf2.com/download/file.php?id=244&sid=621f7c95a76ecb2c82193175dc6d312b
(L) [2016/06/15] [ost
by ingenious] [Re: GPU VCM] Wayback!>> jbikker wrote:If I use a radius close to the projected pixel size the number of photons arriving per pixel is going to be low, right? That would probably lead to fireflies for quite a bit longer than the first 16-32 samples, which is the time frame I'm trying to optimize for.
Indeed, this will inevitably increase the noise in vertex merging and in the entire VCM as well, because the variance of vertex merging is inversely proportional to the squared radius.
 >> jbikker wrote:As for the improvement of directional regularization over BDPT: it's also in this small time frame; the light cast by the mirror shows considerably less noise early on.
This above is a great view point. I'd be very interested to see how pure BDPT, dirreg and VCM compare with a slightly higher number of samples (without clamping, of course)!
(L) [2016/06/15] [ost
by jbikker] [Re: GPU VCM] Wayback!Here you go:
VCM, ~4spp: [IMG #1 Image]
VCM, ~32spp: [IMG #2 Image]
VCM, converged: [IMG #3 Image]
BDPT, ~4spp: [IMG #4 Image]
BDPT, ~32spp: [IMG #5 Image]
BDPT, converged: [IMG #6 Image]
dirreg, ~8spp: [IMG #7 Image]
dirreg, ~32spp: [IMG #8 Image]
dirreg, converged: [IMG #9 Image]
VCM versus BDPT, raw difference:
[IMG #10 Image]
dirreg versus BDPT, raw difference:
[IMG #11 Image]
Sample counts are approximate.
[IMG #1]:Not scraped:
https://web.archive.org/web/20210508222213im_/http://www.cs.uu.nl/docs/vakken/magr/materials/vcm_4spp.png
[IMG #2]:
![[IMG:#1]](images/b5db186ee37a7d42ff7270bed38e63b112cb128bdf3a7d2e4da02bf0d6353a7a.png)
[IMG #3]:
![[IMG:#2]](images/4c4e1d25366345564213bfd26355991cd5e5416a5f5a131df9b64660fd6b0bdb.png)
[IMG #4]:
![[IMG:#3]](images/ab827b92b4b40d1d2b10bcbf2d6838325513a1ff9ca43b9d8505c5114cf3f0e2.png)
[IMG #5]:
![[IMG:#4]](images/dc192df6d0c30d9d45066ef97e655c33a825ab5e8247cb4f9e06d2ef88e0fd77.png)
[IMG #6]:Not scraped:
https://web.archive.org/web/20210508222213im_/http://www.cs.uu.nl/docs/vakken/magr/materials/bdpt_converged.png
[IMG #7]:Not scraped:
https://web.archive.org/web/20210508222213im_/http://www.cs.uu.nl/docs/vakken/magr/materials/dirreg_8spp.png
[IMG #8]:
![[IMG:#7]](images/148e33c86ebd8a6459ec2f14606db3c6b0d4b17e61d3a0a4225bbbe6c4a4573f.png)
[IMG #9]:Not scraped:
https://web.archive.org/web/20210508222213im_/http://www.cs.uu.nl/docs/vakken/magr/materials/dirreg_converged.png
[IMG #10]:Not scraped:
https://web.archive.org/web/20210508222213im_/http://www.cs.uu.nl/docs/vakken/magr/materials/vcm_vs_bdpt.png
[IMG #11]:Not scraped:
https://web.archive.org/web/20210508222213im_/http://www.cs.uu.nl/docs/vakken/magr/materials/bdpt_vs_dirreg.png
(L) [2016/06/15] [ost
by ingenious] [Re: GPU VCM] Wayback!Thanks, very interesting!
(L) [2016/07/06] [ost
by jbikker] [Re: GPU VCM] Wayback!Made some progress:
[IFRAME n/a]
Full quality video: [LINK http://www.cs.uu.nl/docs/vakken/magr/materials/bdpt_wavefront.avi]
To be able to use BVHs instead of the hardcoded scene, I implemented wavefront path tracing. Took some time to get it efficient (and working [SMILEY ;)] ), but right now for the hardcoded scene the impact is ~40% which seems reasonable considering the massive I/O to global memory. Benefit of this approach is obviously that occupancy is restored to 100% due to compaction at several points, most notably before BVH traversal starts.
Next step is getting rid of the remaining hardcoded scene parts.
(L) [2016/07/06] [ost
by beason] [Re: GPU VCM] Wayback!Very nice! 500m rays/sec is fast.
(L) [2016/07/07] [ost
by rtpt] [Re: GPU VCM] Wayback!nice. is there source/demo available for tests ?
(L) [2016/07/07] [ost
by jbikker] [Re: GPU VCM] Wayback!Sorry, no, this will be closed source.
(L) [2016/07/07] [ost
by atlas] [Re: GPU VCM] Wayback!Yikes, reading/writing ray states to global memory every bounce sounds scary, but GPUs never manage to surprise me. I suppose you're also in a good state now for doing complex materials as long as you're sorting for coherency like the megakernel paper does. Good work, keep us updated.
(L) [2016/07/07] [ost
by jbikker] [Re: GPU VCM] Wayback!Yes I was a bit surprised to get it this fast. I mean, there's an impact for all the I/O, but pretty low; for any decent amount of triangles the gains of the improved occupancy are going to outweigh the costs. I noticed before that the wavefront paper is not optimistic enough about this; even a very basic shader benefits from wavefronts, not just complex shaders.
By the way, big factor was a 'SoA' data layout; e.g. ray origins are interleaved, and every read is a float4 read. This way, a single read becomes a 32*float4=512 byte consecutive read, which is very optimal. Since this works so well I made the target buffer SoA as well, so writing red components also means writing 32*int=128 bytes of consecutive memory. Saved another 5%. For light vertices, this didn't work, because these are read in random order after the first bounce.
Also important: wavefronts requires keeping track of counters (e.g. remaining extension / connection rays). I managed to keep all the counters on the device by using persistent kernels, so nothing ever gets copied to the CPU. Without this, the GTX980Ti suffers from very low GPU utilization (~45%)  compared to mobile Quadro 4100, which suggests that this will be even worse on 1080. Now that kernels execute back-to-back, GPU utilization is near-optimal.
I was hoping to put everything in a single kernel, with gpu-wide synchronization between stages ([LINK http://synergy.cs.vt.edu/pubs/papers/xiao-ipdps2010-gpusync.pdf]), but I can't seem to get this working. I keep getting hangs which require a full system restart.
By the way, quick question for CUDA gurus: I use persistent kernels where the number of threads is simply the number of SMs times the blocksize (128 in my case), where each kernel 'fights for food' until work runs out (as in Aila & Laine, "Understanding the Efficiency..."). Strange thing is, this is more efficient if I start 4 or 8 times as many blocks as there are SMs. I can't figure out the reason for this behavior. Any ideas?
(L) [2016/07/07] [ost
by atlas] [Re: GPU VCM] Wayback!A SM can keep more than 1 block 'resident' at a time, as long as resources (register usage, shared memory, etc.) allow for it, and the warp scheduler can actually pull a warp from any of these 'resident' blocks at a given time. So it may be that your kernel permits more than 1 block to be resident on a SM at a time, so using more blocks per SM is allowing greater latency hiding, etc. as the warp scheduler has more blocks to pull from.
(L) [2016/07/13] [ost
by atlas] [Re: GPU VCM] Wayback!I was actually kind of curious about persistent kernels, so I added it to my pathtracer and got about a 10% slowdown. I also noticed the same behavior, where using about 4 blocks per SM was optimal.
I think with modern cards it may just be best to throw everything at it and let the hardware scheduler do its thing, I don't know how effective persistent threads is these days and it also makes it more difficult to run something across different cards with similar efficiency.
Did you see a performance boost with your implementation of persistent threads?
(L) [2016/07/14] [ost
by jbikker] [Re: GPU VCM] Wayback!I didn't implement them for a performance boost, but to be able to keep counters on the GPU. If you produce N shadow rays in one kernel (shading), you either spawn N threads for the next kernel (tracing shadows), or you run SMCount*128*4 threads persistent shadow tracing code. Spawning N threads requires that the host knowns N; syncing this info was my primary bottleneck.
(L) [2016/07/22] [ost
by atlas] [Re: GPU VCM] Wayback!Just curious, when do  you decide to present a frame? Are you trying to pull a full 8-bounce sample off per pixel before you present, or presenting at every bounce? I always struggle with this because frame display has a bit of overhead, but for interactivity you want to present quite often.
(L) [2016/07/25] [ost
by jbikker] [Re: GPU VCM] Wayback!I present a frame after doing a full 8-bounce sample for every pixel (although most paths will be shorter due to Russian Roulette; the cap of max length 8 is for individual light paths, individual eye paths and combined paths, as in SmallVCM). Doing this in a reasonable amount of time is currently not a problem. I just switched to full scenes, for which I get ~150Mrays/s on the GTX980Ti. This yields real-time frame rates for a single BDPT sample per pixel (and even for multiple BDPT samples per pixel).
Presenting every bounce yields biased results for most frames; I'm not sure this would look good. Of course a cap on the path length (especially at a low value like 8) also introduces bias, but I suspect it is far less noticeable.
Due to Russian Roulette, the performance impact of longer paths is minimal by the way; only problem is that buffers get very large.
Since the (compacted) buffers are mostly empty for the deeper bounces, it may be possible to allocate for depth = 8 and bounce to depth = 64, with some kind of a safety cap in case the RNG decides that every path should reach 64 for a particular frame. Theoretically, this situation has a non-zero probability; practically this should never happen of course.
Also note that when using CUDA/OpenGL interop the pixel data never leaves the device. Overhead of presenting results is very small that way.
(L) [2016/08/19] [ost
by atlas] [Re: GPU VCM] Wayback!You mentioned that you use surface-local space to compute your microfacet BRDF. Here's an optimization for converting between the two spaces, this paper describes a faster way to calculate the basis from your surface normal.
[LINK https://www.semanticscholar.org/paper/Building-an-Orthonormal-Basis-from-a-3D-Unit-Frisvad/36dd42635aa896245b039c1334f30b5ab5b4299c/pdf]
Keep us updated.
(L) [2016/08/31] [ost
by szellmann] [Re: GPU VCM] Wayback!Edit: seems like there is no problem with the handedness after all, but rather with me not properly understanding tangent space [SMILEY :)] Sorry for capturing the thread, this errant is unrelated..
@atlas:
The method in the paper is especially nice because it is SIMD/SoA vector friendly. With conventional methods that are robust, you would check if the auxiliary vector you chose (e.g. (0,1,0)) is by accident just the vector about which you construct your basis, then find the min_element of that vec3, add 1 to that and renormalize. Finding min_element is however not SIMD friendly because you have to unpack the SoA vec3 for that.
Just wanted to mention for a warning that the robust methods unfortunately don't guarantee that the basis has a certain handedness, which is however necessary for many tangent space operations. I had up until now always used Hughes' and Möller's method and never came across this problem (I e.g. used it for AO rays where the handedness doesn't matter). Will need to find a way to work around this..
Not robust:
[IMG #1 Image]
Naive:
[IMG #2 Image]
Hughes and Möller:
[IMG #3 Image]
[IMG #1]:
![[IMG:#0]](images/602f0db18fd71a6c97a665db5cd5fef536fed5f58e118aac87d0231b54decc19.png)
[IMG #2]:
![[IMG:#1]](images/cee246f10996775f10397dcb5512162efbc064a44c67b23f0f70e1ce33bdd7b0.png)
[IMG #3]:
back