Optimization strategies

(L) [2013/07/17] [spectral] [Optimization strategies] Wayback!

Hi there,
There is a long time I work on the GPU and have already play a lot with several optimizations strategies, but would like to discuss with you some optimization subjects.
Sorting / scanning
It seems that one one the main method to "optimize" the processing on the GPU is based on "sorting" (By using sort, scan, etc...) (It is why I have create the CLPP [LINK https://code.google.com/p/clpp/] library in fact)
It is something I have try in the past, by example:
1) Trace a set or rays
2) Sort all the rays based on intersection/no-intersection
3) Do the shading on the set of "intersection"
It allow to improve the 'thread' coherence, but in your experience :
a) We should sort a set of 'ids', and then the shader kernel only acces the sorted set of 'ids' datas. Something like this :
Code: [LINK # Select all]uint pathId = pathsIds[ threadIdx.x ];
float3 normal = paths[pathId]->Ns;
The problem here is that acces to the paths are not coalescents !
b) Should we directly sort (and Copy/Swap) a complete structure in global memory ? This improve the data coalescence ?
Code: [LINK # Select all]float3 normal = paths[threadIdx.x]->Ns;
Dividing in smaller kernels
In your experience, does dividing a kernel in several smaller kernels help in performance ?
In my experience, the problem is that:
a) often we need to save more data in global memory, then read it in the second kernel
b) controlling the kernels by the CPU request some times... and so here we can loose performance
One advantage is that kernels are smallers... and then... we avoid to freeze the GPU/display
How many threads / items ?
In my experience doing processing on the GPU require between 200.000 and 1.000.000 items (or more) to be processed at once (to be optimal),
but in rendering we have some choice to do :
1) We can launch one thread for each pixel, then the number of threads depends of the resolution. It seems not a good approach because for small resolution we will not have a lot thread and on bigger resolution we will use a lot of memory and can freeze the display it the kernels are big. On the other side it can improve coherence and has better visual interactivity !
2) We can have a fixed set of threads (Like in Megakernels Considered Harmful: Wavefront Path Tracing on GPUs [LINK https://mediatech.aalto.fi/~timo/]). The problem there is that :
a) We have less coherence because each thread can correspond to a very different pixel !
b) We have to distribute correctly the threads for each pixel (Using by example a QMC 02-sequence). Imagine I use "path restart" then once a path is completed then I will continue to another pixel, but which one ? (The goal here is to avoid patterns and keep good visual interactivity)
c) Methods like "Randomized Coherent Sampling for Reducing the Perceptual Error of Rendered Images" are harder to implement !
Buckets
Does someone has experience with bucket rendering on the GPU... I have do some test on my side but have'nt see any great improvement [SMILEY :-P]
Thx for your advice & experience

(L) [2013/07/17] [dbz] [Optimization strategies] Wayback!

It all sounds rather complex and time consuming what you have planned there. Moreover, what works well on a certain manufacturers gpu's, does not necessarily work well on other manufacturers gpu's. And optimizations for one generation of gpu's may turn out not to work well on newer generations of the same manufacturer. So will all the efforts you put into optimizations pay off in the long run?

(L) [2013/07/17] [Dade] [Optimization strategies] Wayback!

>> spectral wrote:Sorting / scanning
[...]
a) We should sort a set of 'ids', and then the shader kernel only acces the sorted set of 'ids' datas.

I have tried this route but, at least in my tests, it wasn't effective (the cost of sorting the IDs was higher than the performance gained by having less GPU thread divergence, etc.).
>> spectral wrote:How many threads / items ?
In my experience doing processing on the GPU require between 200.000 and 1.000.000 items (or more) to be processed at once (to be optimal),

I use a set of 65K "path states" (i.e. items). I really never had the need to increase that value (but I can decrease that value, in on order to reduce the memory requires to stores all the states, without loosing too much performance).

(L) [2013/07/17] [spectral] [Optimization strategies] Wayback!

>> Dade wrote:I have tried this route but, at least in my tests, it wasn't effective (the cost of sorting the IDs was higher than the performance gained by having less GPU thread divergence, etc.).
It was in my case ... but honestly in OpenCL creating good sorting libraries is harder than in CUDA (Also in CUDA it is also provided and well optimized). Another problem is that we have to create several algorithms... one for each kind of device [SMILEY :-(]
But notice that most of the sorting algorithms are efficient for "at least" 1.000.000 items... not less !!!
>> Dade wrote:I use a set of 65K "path states" (i.e. items). I really never had the need to increase that value (but I can decrease that value, in on order to reduce the memory requires to stores all the states, without loosing too much performance).
I currently use 200.000 path states... and as explained in the Timo Aila papers... increasing it increase the perf. but not a lot [SMILEY :-P]

(L) [2013/07/17] [Zelcious] [Optimization strategies] Wayback!

I currently use 16k items in parallell. I take a slight hit with a gtx 780 but it's worth it since I use a mlt like algorithm and it converges faster with fewer chains. I'm heavily compute bound, use no resorting and split too long traversals to reduce incoherence.

(L) [2013/07/17] [tomasdavid] [Optimization strategies] Wayback!

ad dividing into smaller kernels:
I now have in the review pipeline a paper that claims that single kernel solution is a good choice.
On 580 it is worse than Dietger's approach (see his HPG2011 paper), but on 680 it is comparable it (seems the L1 caching of global accesses that is missing on Kepler hurts a bit).
Otoh, if you have expensive shaders, Laine now has an aptly named paper: Megakernels Considered Harmful: Wavefront Path Tracing on GPUs.
(That said, I couldn't reproduce their results in my framework).

(L) [2013/07/18] [Dade] [Optimization strategies] Wayback!

>> Zelcious wrote:I use a mlt like algorithm and it converges faster with fewer chains.
This is a serious problem with MLT and GPUs: my implementation of MLT on GPU (with 65K chains) looks pathetic when compared to same MLT code on CPU (with 12 chains, i.e. 12 threads) .

(L) [2013/07/18] [Zelcious] [Optimization strategies] Wayback!

>> This is a serious problem with MLT and GPUs: my implementation of MLT on GPU (with 65K chains) looks pathetic when compared to same MLT code on CPU (with 12 chains, i.e. 12 threads) .
I hope multiple-try will remedy some of that. Plan to experiment with some higher order ensamble mcmc as well.
I have some other ideas but I agree it's a big problem.

(L) [2013/07/18] [Zelcious] [Optimization strategies] Wayback!

Perhaps the solution would be to let the cpu do the mlt and the gpu just looking for seed paths. For really really hard to sample scenes like some of the ones I experiment with it could perhaps be a winner.

(L) [2013/07/18] [ingenious] [Optimization strategies] Wayback!

Are you saying that many shorter chains is much worse than fewer longer chains? On what scenes? I find it a bit hard to believe that having a larger number of chains is always worse.

Optimization strategies back

Board: Home Board index Raytracing General Development