Kesen Huang's HPG 2010 papers list is UP !

(L) [2010/06/02] [Shadow007] [Kesen Huang's HPG 2010 papers list is UP !] Wayback!

Samuli Laine' paper "Restart Trail for Stackless BVH Traversal" is now available from his homepage.
It expresses an idea that enables the use of short-stacks for BVH traversal (for GPU), but concludes that it's (with current hardware and architecture) less efficient than full stacks.
I have 2 observations :
1- This paper seems to me one more notable exception to the general rule that no "fail" is published... but I think it clearly has a place in the light, for other researchers to pick/improve... I, for myself, am glad to see it published, and to have had the occasion to read it.
2- The (1bit per level) Trail only enables to know if the "first choice" at each intersection was already visited, but forces that choice to be computed again (ie. intersecting both nodes, and choosing which one should go first). I have that naive idea that an easy enhancement would be to store instead 2 bits per level (1 per child) That way, when restarting, one does not need to do the intersections all over again, and could climb down the tree again without having to make any choices before getting to the previously unexplored areas...
Is that really naive or do any of you there could be some substance to that intuition ?

(L) [2010/06/05] [jbarcz1] [Kesen Huang's HPG 2010 papers list is UP !] Wayback!

>> Shadow007 wrote:Samuli Laine' paper "Restart Trail for Stackless BVH Traversal" is now available from his homepage.
It expresses an idea that enables the use of short-stacks for BVH traversal (for GPU), but concludes that it's (with current hardware and architecture) less efficient than full stacks.
I have 2 observations :
1- This paper seems to me one more notable exception to the general rule that no "fail" is published... but I think it clearly has a place in the light, for other researchers to pick/improve... I, for myself, am glad to see it published, and to have had the occasion to read it.
Agreed.  It's actually a neat idea, despite the negative results.  IMO though the paper is too light on details.  I'd really like to see some measurements on how much memory bandwidth is saved by going stackless and what the raw performance numbers actually were.
I'm actually not convinced that stackless traversal is really worth the effort.   If you set up the loop to do two child tests in each iteration, then you can avoid manipulating the stack except in cases where both children are visited.  This happens pretty rarely.  Further, if you keep the stack in on-chip shared memory, then it should have a fixed access time, comparable to cache, and the problem should be moot.  I actually don't understand why CUDA raycasters don't tend to do this.  Is it some kind of architectural issue with shared memory or did Nvidia just not put enough on there to get good occupancy?
>> 2- The (1bit per level) Trail only enables to know if the "first choice" at each intersection was already visited, but forces that choice to be computed again (ie. intersecting both nodes, and choosing which one should go first). I have that naive idea that an easy enhancement would be to store instead 2 bits per level (1 per child)  That way, when restarting, one does not need to do the intersections all over again, and could climb down the tree again without having to make any choices before getting to the previously unexplored areas...

You could, but you still need to access the nodes to know where 'near' and 'far' are, so if your bottleneck isn't compute, it won't buy you very much.  This is one of the reasons I'm so skeptical of restart approaches.  Even if you save bandwidth by going stackless, you still need to hit the cache a whole bunch more times to get at the nodes.  This is expensive even if your rays are spatially coherent.  It would be telling if the slowdowns they saw were proportional to the number of extra nodes, because that would suggest that they aren't actually bandwidth limited to begin with.

(L) [2010/06/09] [straaljager] [Kesen Huang's HPG 2010 papers list is UP !] Wayback!

The paper "Architecture Considerations for Tracing Incoherent Rays" by Timo Aila and Tero Karras is finally available: [LINK http://www.tml.tkk.fi/~timo/] Seems very interesting on a quick read through. I like the idea of fixed function hardware for traversal and intersection on the GPU.
>> Architecture Considerations for Tracing Incoherent Rays
This paper proposes a massively parallel hardware architecture for efficient tracing of incoherent rays, e.g. for global illumination. The general approach is centered around hierarchical treelet subdivision of the acceleration structure and repeated queueing/postponing of rays to reduce cache pressure. We describe a heuristic algorithm for determining the treelet subdivision, and show that our architecture can reduce the total memory bandwidth requirements by up to 90% in difficult scenes. Furthermore the architecture allows submitting rays in an arbitrary order with practically no performance penalty. We also conclude that scheduling algorithms can have an important effect on results, and that using fixed-size queues is not an appealing design choice. Increased auxiliary traffic, including traversal stacks, is identified as the foremost remaining challenge of this architecture.

(L) [2010/06/09] [Jaeho] [Kesen Huang's HPG 2010 papers list is UP !] Wayback!

I also like this paper because it is very clear and interesting. Thank you for your posting.

>> straaljager wrote:The paper "Architecture Considerations for Tracing Incoherent Rays" by Timo Aila and Tero Karras is finally available: [LINK http://www.tml.tkk.fi/~timo/] Seems very interesting on a quick read through. I like the idea of fixed function hardware for traversal and intersection on the GPU.
Architecture Considerations for Tracing Incoherent Rays
This paper proposes a massively parallel hardware architecture for efficient tracing of incoherent rays, e.g. for global illumination. The general approach is centered around hierarchical treelet subdivision of the acceleration structure and repeated queueing/postponing of rays to reduce cache pressure. We describe a heuristic algorithm for determining the treelet subdivision, and show that our architecture can reduce the total memory bandwidth requirements by up to 90% in difficult scenes. Furthermore the architecture allows submitting rays in an arbitrary order with practically no performance penalty. We also conclude that scheduling algorithms can have an important effect on results, and that using fixed-size queues is not an appealing design choice. Increased auxiliary traffic, including traversal stacks, is identified as the foremost remaining challenge of this architecture.

Kesen Huang's HPG 2010 papers list is UP ! back

Board: Board index ‹ Ray tracing ‹ Links & papers