(L) [2013/07/17] [spectral] [Optimization strategies] Wayback!Hi there,
There is a long time I work on the GPU and have already play a lot with several optimizations strategies, but would like to discuss with you some optimization subjects.
Sorting / scanning
It seems that one one the main method to "optimize" the processing on the GPU is based on "sorting" (By using sort, scan, etc...) (It is why I have create the CLPP [LINK https://code.google.com/p/clpp/] library in fact)
It is something I have try in the past, by example:
1) Trace a set or rays
2) Sort all the rays based on intersection/no-intersection
3) Do the shading on the set of "intersection"
It allow to improve the 'thread' coherence, but in your experience :
a) We should sort a set of 'ids', and then the shader kernel only acces the sorted set of 'ids' datas. Something like this :
Code: [LINK # Select all]uint pathId = pathsIds[ threadIdx.x ];
float3 normal = paths[pathId]->Ns;
The problem here is that acces to the paths are not coalescents !
b) Should we directly sort (and Copy/Swap) a complete structure in global memory ? This improve the data coalescence ?
Code: [LINK # Select all]float3 normal = paths[threadIdx.x]->Ns;
Dividing in smaller kernels
In your experience, does dividing a kernel in several smaller kernels help in performance ?
In my experience, the problem is that:
a) often we need to save more data in global memory, then read it in the second kernel
b) controlling the kernels by the CPU request some times... and so here we can loose performance
One advantage is that kernels are smallers... and then... we avoid to freeze the GPU/display
How many threads / items ?
In my experience doing processing on the GPU require between 200.000 and 1.000.000 items (or more) to be processed at once (to be optimal),
but in rendering we have some choice to do :
1) We can launch one thread for each pixel, then the number of threads depends of the resolution. It seems not a good approach because for small resolution we will not have a lot thread and on bigger resolution we will use a lot of memory and can freeze the display it the kernels are big. On the other side it can improve coherence and has better visual interactivity !
2) We can have a fixed set of threads (Like in Megakernels Considered Harmful: Wavefront Path Tracing on GPUs [LINK https://mediatech.aalto.fi/~timo/]). The problem there is that :
a) We have less coherence because each thread can correspond to a very different pixel !
b) We have to distribute correctly the threads for each pixel (Using by example a QMC 02-sequence). Imagine I use "path restart" then once a path is completed then I will continue to another pixel, but which one ? (The goal here is to avoid patterns and keep good visual interactivity)
c) Methods like "Randomized Coherent Sampling for Reducing the Perceptual Error of Rendered Images" are harder to implement !
Buckets
Does someone has experience with bucket rendering on the GPU... I have do some test on my side but have'nt see any great improvement [SMILEY :-P]
Thx for your advice & experience
(L) [2013/07/17] [Dade] [Optimization strategies] Wayback!>> spectral wrote:Sorting / scanning
[...]
a) We should sort a set of 'ids', and then the shader kernel only acces the sorted set of 'ids' datas.
I have tried this route but, at least in my tests, it wasn't effective (the cost of sorting the IDs was higher than the performance gained by having less GPU thread divergence, etc.).
 >> spectral wrote:How many threads / items ?
In my experience doing processing on the GPU require between 200.000 and 1.000.000 items (or more) to be processed at once (to be optimal),
I use a set of 65K "path states" (i.e. items). I really never had the need to increase that value (but I can decrease that value, in on order to reduce the memory requires to stores all the states, without loosing too much performance).
(L) [2013/07/17] [spectral] [Optimization strategies] Wayback!>> Dade wrote:I have tried this route but, at least in my tests, it wasn't effective (the cost of sorting the IDs was higher than the performance gained by having less GPU thread divergence, etc.).
It was in my case ... but honestly in OpenCL creating good sorting libraries is harder than in CUDA (Also in CUDA it is also provided and well optimized). Another problem is that we have to create several algorithms... one for each kind of device [SMILEY :-(]
But notice that most of the sorting algorithms are efficient for "at least" 1.000.000 items... not less !!!
 >> Dade wrote:I use a set of 65K "path states" (i.e. items). I really never had the need to increase that value (but I can decrease that value, in on order to reduce the memory requires to stores all the states, without loosing too much performance).
I currently use 200.000 path states... and as explained in the Timo Aila papers... increasing it increase the perf. but not a lot [SMILEY :-P]
(L) [2013/07/17] [tomasdavid] [Optimization strategies] Wayback!ad dividing into smaller kernels:
I now have in the review pipeline a paper that claims that single kernel solution is a good choice.
On 580 it is worse than Dietger's approach (see his HPG2011 paper), but on 680 it is comparable it (seems the L1 caching of global accesses that is missing on Kepler hurts a bit).
Otoh, if you have expensive shaders, Laine now has an aptly named paper: Megakernels Considered Harmful: Wavefront Path Tracing on GPUs.
(That said, I couldn't reproduce their results in my framework).