Prefetching back
(L) [2006/05/07] [Phantom] [Prefetching] Wayback!I know I discussed this topic with tbp a couple of times already, but it's still bugging me:
How can we effectively prefetch data in a tracer?
I'm asking now because of two odd events today:
1. I replaced Wald's original intersection test (involving barycentric coordinates) with Carsten's more intuitive code. It's slightly faster, and quite a bit more accurate. However, it uses a small lookup table. I initialized this time initially for each frame. Once I moved it to the 'engine' constructor, I lost about 3% performance. What the... !?
2. This evening, I added checking for incoherent packets. For a 4x4 packet, this involves no less than 12 movemasks, quite a bit of integer arithmetic and an 'if' statement. Impact on rendering: Nada. Exactly the same speed.
But events appear to be linked to prefetching. Apparently, random events can hurt or benefit performance. But, rather than adding and rearranging code at random, I would like a more focussed approach, but I have no idea where to start. Anyone?
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/08] [tbp] [Prefetching] Wayback!I can't make sense of your post, i may need more coffee.
Anyway given how cpu work <insert technical gibberish> there's a "minimal distance" for prefetching to be effective and not a hinderance - prefetch instructions need to be decoded etc and you need to compute addresses etc -.
That minimal distance is about 200 cycles, give or take a hundred cycles.
Put another way you need to prefetch 200 cycles in advance, the very data you'll need then.
Say a traversal step takes  ~20 cycles. You'd need to be able to devise which node you're going to need 10 traversal ahead. Obviously computing that would take as much time as getting there.
Unless you make your way around that problem, prefetching isn't what you're looking for.
Cache oblivious layouts are certainly more reasonable (at least for traversal).
Note: Prefetching is a hint, another way to fill in the cache is to access data. I've tried to touch nodes in advance as soon as possible, ie at the start of each traversal as you compute childs force an access. Not a win. But that was a crude try.
(L) [2006/05/08] [playmesumch00ns] [Prefetching] Wayback!Can you point me to where carsten's intersection method is documented? I can't find any reference to it here.
(L) [2006/05/08] [Phantom] [Prefetching] Wayback!Check here:
[LINK http://ompf.org/forum/viewtopic.php?t=9]
That's the canonical list of must-read material; see Carsten's thesis for various intersection algorithms.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/08] [playmesumch00ns] [Prefetching] Wayback!Ahh thanks phantom. I'd forgotten about that doc. [SMILEY Smile]
(L) [2006/05/08] [Phantom] [Prefetching] Wayback!While putting the kids to bed I solved my problem with the incoherent packets: The traversal code first checks the sign of the first ray to determine the near and far planes for each axis. As I send the complete packet for each octant, this is now wrong; the octant should be calculated for the first valid ray. I assume this fixes the problem; will try it in a minute or so.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/08] [tbp] [Prefetching] Wayback!I've quickly looked at your code, but i'm unsure how many rays you're trying to handle.
Anyway if your problem is to coalesce incoherent rays into sub-groups of coherent rays it can be done (at least for 4 rays) in 1 branchless 'pass' with a bunch of logic + 1 lookup for the final mask per sub group.
back