Can't understand Timo Aila's GPU traversal paper back

Board: Home Board index Raytracing General Development

(L) [2013/02/14] [ost by shiqiu1105] [Can't understand Timo Aila's GPU traversal paper] Wayback!

I have been reading Alia's paper on the GPU ray travsal.

But don't quite understand it.

So basically, to do fast GPU ray tracing, we need to launch enough threads to fill the gpu,and use while-while loop?

Anything like QBVH he's using?

Can anyone who understand the paper give a short review [SMILEY :)] ?
(L) [2013/02/14] [ost by shiqiu1105] [Can't understand Timo Aila's GPU traversal paper] Wayback!

What's packet trevsal? I thought that's on the CPU using SIMD
(L) [2013/02/14] [ost by graphicsMan] [Can't understand Timo Aila's GPU traversal paper] Wayback!

Probably you should re-read the paper a few times.  Packet traversal is simply tracing a group of rays together at once.  Even sans wide hardware ALU benefits, it should improve memory access, since the groups of rays traced together will be coherent.  Traditionally, it's been tuned to CPU SIMD, but nothing requires that.
(L) [2013/02/15] [ost by shiqiu1105] [Can't understand Timo Aila's GPU traversal paper] Wayback!

>> graphicsMan wrote:Probably you should re-read the paper a few times.  Packet traversal is simply tracing a group of rays together at once.  Even sans wide hardware ALU benefits, it should improve memory access, since the groups of rays traced together will be coherent.  Traditionally, it's been tuned to CPU SIMD, but nothing requires that.
Okay. What does persistent threads mean? Launching just enough threads to fill the machine, how do you know how many threads can do this?

Does that mean we don't launch the kernel by calling trace<<<PixelX, PixelY>>>()? But then how?
(L) [2013/02/15] [ost by shiqiu1105] [Can't understand Timo Aila's GPU traversal paper] Wayback!

I have also downloaded his code. He uses a lot of SOA pointers which makes it quite hard to read..
(L) [2013/02/15] [ost by jbikker] [Can't understand Timo Aila's GPU traversal paper] Wayback!

>> shiqiu1105 wrote:graphicsMan wrote:Probably you should re-read the paper a few times.  Packet traversal is simply tracing a group of rays together at once.  Even sans wide hardware ALU benefits, it should improve memory access, since the groups of rays traced together will be coherent.  Traditionally, it's been tuned to CPU SIMD, but nothing requires that.
Okay. What does persistent threads mean? Launching just enough threads to fill the machine, how do you know how many threads can do this?

Does that mean we don't launch the kernel by calling trace<<<PixelX, PixelY>>>()? But then how?
Persistent threads: determine the number of SMs you have, multiply by 32 (for the number of threads in the warp). If you then launch a kernel with trace<<<SMCount * 32, dim(32,1,1)>>> each SM will be executing exactly 1 warp. That's typically not optimal, so you increase SMCount * 32 by some constant. 8 blocks per SM seems optimal for Kepler in many cases, but it depends on the number of registers that you need. Kepler has 256, so launching 8 blocks per SM gets you 32 registers for each warp. In practice, you either use the occupancy calculator to find the optimal number of blocks per thread, or you hand-tune it for best performance. Note that 2 is the minimum to use Fermi's and Kepler's hardware optimally, although I found that kernels that are heavily bandwidth-bound sometimes benefit from very small numbers of concurrent threads, to distribute cache misses optimally over the available SMs: for a box filter kernel, it seems even optimal to have 16 threads per block active (so: half-filled warps).

Then, you need special thread code: each thread remains active until all work has been done. This requires some atomics to fetch new jobs:
Code: [LINK # Select all]do
{
    volatile int& jobIdxBase = nextJobArray[threadIdx.y];
    if (threadIdx.x == 0) jobIdxBase = atomicAdd( &warpCounter, 32 );
    idx = jobIdxBase + threadIdx.x;
    if (idx >= totalJobs) return;
    // perform work for task 'idx'
} while (true);
So: thread 0 of each warp basically 'allocates' 32 jobs to work on: [jobIdxBase..jobIdxBase+31]. Individual threads in the warp get their job idx by calculating jobIdxBase + threadIdx.x.
Make sure the global memory variable 'warpCounter' is set to 0 before starting the kernel.

In practice, this approach is rather efficient, even for simple kernels.

- Jacco.
(L) [2013/02/15] [ost by tomasdavid] [Can't understand Timo Aila's GPU traversal paper] Wayback!

As far as I remember, the approach does not use packets at all (each ray has its own stack), and the code for Fermi and Kepler does not use persistent threads anymore.
(L) [2013/02/15] [ost by spectral] [Can't understand Timo Aila's GPU traversal paper] Wayback!

Hi,

1) This code is able to handle both a packets of rays or a ray at once.
2) Persistent thread is no more efficient today
3) Don't forget that this code is based on a "simple study", to keep the same performance than Timo you have to care about the number of registers you use
in your kernel. If there are too much used registers in your code then you will not have the same performance
4) SOA stuffs are very important [SMILEY ;-)]
5) It is a BVH but not a QBVH (more suited for the CPU)... maybe "Dade" has more experience with his SQBVH ?

What don't you understand, read and re-read it... it is a simple BVH traversal... with a stack. Try to understand each step... then study the next one.
If you have a precise question, I can answer you...

Regards
(L) [2013/02/15] [ost by shiqiu1105] [Can't understand Timo Aila's GPU traversal paper] Wayback!

Thanks for you guys answer! They helped a lot [SMILEY :D]

I do want to know, if persistent threads are not efficient nowadays, what's the efficient approach now then?
(L) [2013/02/15] [ost by spectral] [Can't understand Timo Aila's GPU traversal paper] Wayback!

The best approach developed by Timo does not use persistent thread ! Check the paper [SMILEY ;-)]

Also, it depends of what you want to do... if you want to improve this code to support instances etc... by example... and maybe depend of the video card and the scene !
It is up to you to test... or check the results in the paper.

back