VCM GPU implementation (+ some extras)

(L) [2014/06/01] [tby MohamedSakr] [VCM GPU implementation (+ some extras)] Wayback!

I'm trying to implement VCM (from SmallVCM project) to GPU using CUDA, I've read "Progressive Light Transport Survey on the GPU: Survey and Improvements"
I have a few questions:
1- in the light samples loop, should I sort every light bounce iteration? (to avoid divergence within warps), I've tested sorting and it should take around 2.6 ms (1 sort and 1 copy_if) per iteration (assuming we have 10 light bounces, this will lead to 26 ms for the whole light loop in a full HD resolution (2 million light paths)
2- same as question 1, but in the camera loop  [SMILEY :)]
3- in the above paper, the BDPT algorithm shows that in the camera Paths loop, he is connecting to random Light Vertex Cache (LVC) while it is commented in SmallVCM
                    // For VC, each light sub-path is assigned to a particular eye
                    // sub-path, as in traditional BPT. It is also possible to
                    // connect to vertices from any light path, but MIS should
                    // be revisited.
how MIS should be revisited "or what should I add to the code in other words"

(L) [2014/06/02] [tby tomasdavid] [VCM GPU implementation (+ some extras)] Wayback!

Yo.
ad 1/2) I am not entirely sure what do you intend to sort here. I don't think I sort anything in either of the two, but it's been a bit since I touched the implementation.
ad 3) The MIS used is derived for connecting to all vertices of a single "companion" path. When you connect to a subset of vertices from a superset of paths, it is possible that the optimal MIS is different. I haven't looked into it too much, but Dietger's thesis/techreport (cannot find either right now, I am sure jacco will be able to point it out) had some interesting pointers regarding the influence on MIS. Bottomline, it is probably not that important.

(L) [2014/06/02] [tby MohamedSakr] [VCM GPU implementation (+ some extras)] Wayback!

about sorting:
in this loop "for(;; ++lightState.mPathLength)" and "for(;; ++cameraState.mPathLength)"
I can sort 2 times per iteration, 1 for BSDF (each BSDF is interacting in a different way, and lots of checks about BSDF.isdelta() or BSDF.isvalid() etc...
and 1 at the end for sample Scattering "so one thread can't wait for others to finish the 10 iterations for example"
after some "theoritical" measurements by CPU, for a full HD resolution, in the sun scene, 1 thread from CPU @ 4GHz takes around 33 seconds to finish 1 full image iteration
let's assume code efficiency of 50% "we don't use the full 8 instructions/cycle"
my processor at 4 GHz 6 cores 12 threads can do 192 GFlops, so 50% will be 96 GFlops (with all threads)
so total flops / iteration for 1 thread =(96 / 12) * 33 = around 264 GFlops for a full HD image iteration
so if we average them out over 10 ray bounces, 1 iteration (for light + camera) will require 26.4 GFlops (if sorted) with 50% divergence, this may lead to 52.8 GFlops (if non-sorted)
to sort or not to sort!! :
sorting with thrust on my GTX 780 on 2 million elements takes 1.7 ms ( or 7.65 GFlops), copy_if in thrust (so it drops terminated rays after each iteration) takes 0.7 ms ( 3.15 GFLops)
so my guess I will make an adaptive algorithm so that the GPU knows when to sort [SMILEY :D]

(L) [2014/06/02] [tby tomasdavid] [VCM GPU implementation (+ some extras)] Wayback!

Ah, for compaction purposes, got it.
I have only fairly simple BRDFs (the most complicated is Ashikhmin-Shirley), so compacting for BRDFs didn't pay of for me.
Also, as stated in the paper, on the newer generation (6xx, and it goes for 7xx as well) the compaction doesn't buy you as much as it did on 5xx, and it leads to more complicated code.
Don't forget that even without compaction you do in-place ray regeneration (as per Novak's paper), so the only differences are:
a) code divergence during the regeneration (the non-regenerated paths don't do anything useful)
b) ray divergence, because the freshly regenerated primary rays are not traced together.

(L) [2014/06/03] [tby Dietger] [VCM GPU implementation (+ some extras)] Wayback!

>> tomasdavid wrote:ad 3) The MIS used is derived for connecting to all vertices of a single "companion" path. When you connect to a subset of vertices from a superset of paths, it is possible that the optimal MIS is different. I haven't looked into it too much, but Dietger's thesis/techreport (cannot find either right now, I am sure jacco will be able to point it out) had some interesting pointers regarding the influence on MIS. Bottomline, it is probably not that important.
Shameless self-promotion: [LINK http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.217.7304]

(L) [2014/06/03] [tby ingenious] [VCM GPU implementation (+ some extras)] Wayback!

>> MohamedSakr wrote:3- in the above paper, the BDPT algorithm shows that in the camera Paths loop, he is connecting to random Light Vertex Cache (LVC) while it is commented in SmallVCM
                    // For VC, each light sub-path is assigned to a particular eye
                    // sub-path, as in traditional BPT. It is also possible to
                    // connect to vertices from any light path, but MIS should
                    // be revisited.
how MIS should be revisited "or what should I add to the code in other words"
Ordinary BPT connects the vertices of each eye subpath to the vertices of a single light subpath. Therefore, the number of samples each (s,t) technique takes is 1. The numerator in the balance heuristic weight is this equal to the pdf of the constructed path.
Now, say you have sampled N light subpaths and stored their vertices whose number is V. You then trace an eye subpath through every pixel, connecting each eye vertex to C randomly chosen light vertices.
In order to derive the MIS weight for a connection, we can reinterpret the above process as follows. Conceptually, you're connecting each eye vertex to the vertices of all N light subpaths. Thus, the number of samples that each (s,t) technique takes is not 1 anymore, but is N. However, the probability for each connection is now C/V. Therefore, the numerator in the balance heuristic weight needs to be multiplied by N * C/V.
Finally, let's consider the special case where C is set to the average light subpath length, which will make roughly as many connections as ordinary BPT. The average light subpath length is V/N. In this special case everything cancels out in the above multiplier, so the MIS weight remains unchanged. In practice though, C will almost always will be a fractional number, which you can round to the closest integer. So the "proper" weight will be different, although very slightly.
In summary, if you pick the number of connections to be equal to the (integer-rounded) average light subpath length, then you can use the traditional MIS weight.

(L) [2014/06/03] [tby MohamedSakr] [VCM GPU implementation (+ some extras)] Wayback!

@Dietger really nice thesis [SMILEY :D] , what I see "I may be mistaken so correct me" is that you put the whole camera loop inside the light loop, is this safe to do in smallVCM? "as there is photons merging (ppm)", I see the main benefit here is the memory consumption, why it is GPU friendly? "I sense it should give the same performance as the separated light loop and camera loop"
@ingenious thanks for clarification [SMILEY :)] , I think I will leave MIS weight as it is for now

(L) [2014/09/06] [tby MohamedSakr] [VCM GPU implementation (+ some extras)] Wayback!

I'm trying to implement VCM in PBRT, I have successfully done so far "vertex to vertex connection" , "camera vertex shadow ray" , "camera vertex hitting a light (BG or Area)"
what I failed in is "light vertex hitting the camera lens" , it gives weird results!! and I tried all what I can do but somehow couldn't figure out what makes the problem
for some reason, the light vertices which are closer to camera appear brighter!!, and the whole image is not balanced
here are some results:(note: this is just the BDPT, vertex merging part is not in the results)

[IMG #1 LVtoCameraLens.jpg]
the bad light vertex connection to camera lens

[IMG #2 otherConnections_correct.jpg]
the good "vertex to vertex connection" , "camera vertex shadow ray"

[IMG #3 room-photon.jpg]
reference image rendered with photon mapping and final gather

[IMG #1]:Not scraped: /web/20161005161209im_/http://ompf2.com/download/file.php?id=164&sid=83ff7a3de767932753c80abe58d6e9fc
[IMG #2]:Not scraped: /web/20161005161209im_/http://ompf2.com/download/file.php?id=165&sid=83ff7a3de767932753c80abe58d6e9fc
[IMG #3]:Not scraped: /web/20161005161209im_/http://ompf2.com/download/file.php?id=166&sid=83ff7a3de767932753c80abe58d6e9fc

(L) [2014/09/06] [ost by MohamedSakr] [VCM GPU implementation (+ some extras)] Wayback!

I'm trying to implement VCM in PBRT, I have successfully done so far "vertex to vertex connection" , "camera vertex shadow ray" , "camera vertex hitting a light (BG or Area)"
what I failed in is "light vertex hitting the camera lens" , it gives weird results!! and I tried all what I can do but somehow couldn't figure out what makes the problem

for some reason, the light vertices which are closer to camera appear brighter!!, and the whole image is not balanced
here are some results:(note: this is just the BDPT, vertex merging part is not in the results)
LVtoCameraLens.jpg
otherConnections_correct.jpg
room-photon.jpg

VCM GPU implementation (+ some extras) back

Board: Home Board index Raytracing General Development