Anger & Frustration back
(L) [2006/02/14] [UlfJack] [Anger & Frustration] Wayback!We have been discussing this in the "Visuals" forum, where it doesn't really belong.
I have now implemented the SSE kd-tree traversal from Wald's PhD thesis and it is _very_ slow. That may be in part due to non-optimal kd-tree building, but even then there is still a significant gap. And I certainly don't get the 22 times speedup that tbp reported.
The source code is available at [LINK http://www.yvert.de/] once more.
It also contains the cg shader source code that we've been working on for the last week. However, we havn't released the cg compiler yet, so you can only eye the source.
If anyone could give me a hint about what's going wrong, that would be much appreciated.
-- Ulf
(L) [2006/02/14] [Phantom] [Anger & Frustration] Wayback!Have you checked the paper I wrote for Intel? There's full source code for SIMD packet traversal in it.
(L) [2006/02/14] [tbp] [Anger & Frustration] Wayback!That's unsollicited commercial information!
(L) [2006/02/14] [UlfJack] [Anger & Frustration] Wayback!Yeah, I'm sorry, I was a bit with posting that question. Phantoms' traversal code looks very much like my own, except for a few details.
(L) [2006/02/15] [UlfJack] [Anger & Frustration] Wayback!An explanation of why Jaccos code is semantically illegal can be found here:
[LINK http://www.thescripts.com/forum/thread139774.html]
I'm in the process of rewriting it without constructors.
(L) [2006/02/15] [Phantom] [Anger & Frustration] Wayback!UlfJack: Try to ignore tbp's attitude, he's actually both skillful and helpful.
The demo tbp had in mind is at this page:
[LINK http://www.flipcode.com/cgi-bin/fcarticles.cgi?show=65370]
I am using an SAH kd-tree and Wald-style traversal (the demo uses the code from the Intel paper). There are specialized branches for ray packets that don't have equal signs for all rays (a mono ray tracer) and branches for shadow rays (both packet and mono).
The biggest difference I see with your result is (besides performance) the number of intersected triangles. Wald suggests that 2 or 3 intersections per ray is typical; I didn't manage to get that low, 6 or 7 is typical for my kd-tree. This is obviously purely a kd-tree construction issue: Even a very slow ray tracer should not do a lot of intersections if the kd-tree is constructed properly.
To get a decent kd-tree, you should use SAH, and split off empty space. Havran shows the process in great detail. That way, as many steps as possible traverse empty space (which should be efficient in your code), and only a couple of node visits actually require ray/triangle intersections.
Once you've got that right, all tbp's suggestions are applicable:
- Avoid branches;
- Reduce memory access;
- Try to help the cache;
- Use const as much as possible;
- etc.
But really, as long as you get 230 intersections per ray, you're wasting time. You really need to bring this down to 6 or 7 to get a 40x speed improvement.
- Jacco.
(L) [2006/02/15] [tbp] [Anger & Frustration] Wayback!Dear UlfJack, you're barking at the wrong tree.
You've been spoon feeded a traversal that has the potential to dish out x million primaries / s given proper conditions (ie kd-tree quality). No need to question that, really.
It's your job to match such conditions and get whatever compiler you feel like using to spit adequate code.
I've quickly scanned your code as released and noted obvious shortcomings. That's it, i'm not going to fix it for you.
Right now as Jacco pointed, you're benchmarking your intersection code. That's certainly interesting but not what you're going after atm.
And yes i have an attitude problem with ppl not doing their homework.
(L) [2006/02/15] [tbp] [Anger & Frustration] Wayback!One last thing, before getting your hands dirty with packet traversal train yourself on a mono version; you'll need one anyway as a fallback when your packet isn't coherent and it's easier to get right. Plus then you can see where you stand with Ono-sendai's benchmark.
I took the time to find you some (really old) reference for that.
[LINK http://www.gamedev.net/community/forums/topic.asp?topic_id=293020&whichpage=2�]
The kind of codegen you should get.
[LINK http://www.gamedev.net/community/forums/topic.asp?topic_id=293020&whichpage=2�]
(L) [2006/02/15] [Phantom] [Anger & Frustration] Wayback!You can't just add and subtract and divide these numbers. When you are intersecting triangles that you shouldn't intersect, you may be trashing the cache and all kinds of other things that could influence the performance.
And tbp is not suggesting you're stupid, he's suggesting you first fix the things that hurt most. And he is explaining that good speed through a kd-tree is possible; that's the same thing Havran concluded in his thesis: The kd-tree is the most suitable structure for generic scenes. You cannot ignore that. If you can't reproduce that, you shouldn't get angry and frustrated, you should look where you are taking a wrong turn.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/02/16] [Phantom] [Anger & Frustration] Wayback!Glad you're on track!
I just completed the 'Havran style' kd-tree compiler (most basic setup; no proper automatic termination, no empty space cut-off and no proper final node generation, just proper splits without any wrong data in the various linked lists), and it compiles 20k in 2 secs (that's unoptimized code, I think it will be easy to get it down to 1 sec).
Apparently, other methods give faster results, so if you want a really fast compiler, you may wish to investigate 'binning' and so on. What I like so much about the Havran approach however is that it absolutely requires perfect data structures and subdivision. I'm quite sure that I don't get false duplicates and other nasty stuff this time.
About the odd early-out test in Wald's thesis: This is indeed a bug. Check my code for the correct line. I didn't mention that, sorry.. Had me scratching my head for a while too. 'Bugs' like these led me to the conclusion (or suspision) that Wald intentionally made it hard to duplicate his results, but maybe I'm being negative here. He did something similar in his explanation of the triangle acceleration structures: There's missing crucial info there too. While trying to find the correct calculations of the fields of this structure, I found an optimization of his approach, mailed him about it and got completely ignored.
Your new figures look good, but you still have too many intersections per ray, obviously. So the next thing to look at is indeed the tree compiler.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/02/16] [Phantom] [Anger & Frustration] Wayback!Why is everyone mentioning these flat cells everywhere? So far I have simply prevented cells from becoming completely flat, but I am not sure wether or not my traversal code supports flat cells. What pitfalls can I expect when I do allow flat cells?
_________________
--------------------------------------------------------------
Whatever
(L) [2006/02/16] [tbp] [Anger & Frustration] Wayback!NaNs, ugly NaNs everywhere: clipping, scoring, traversal... you name it.
And yes you can dodge that issue via epsilons inflation or something, but there's some potential in their use (like embedding a whole bunch of coplanar triangle into the split plane itself).
(L) [2006/02/16] [Lynx] [Anger & Frustration] Wayback!To throw in some more statistics:
on the sibenik cathedral ( [LINK http://hdri.cgtechniques.com/~sibenik2/] ) my statistics for the tree:
build time: 5.109s
primitives in tree: 80131
leaf nodes: 182771 (empty: 54594 = 29.8702%)
leaf prims: 367717 (4.58895x prims in tree, leaf size:1)
   => 2.86882 prims per non-empty leaf
leaves due to depth limit/bad splits: 50678/61805
clipped triangles: 1295820 (26286 sorted out)
primitives (=triangles so far) tested per ray: 4.9735
That's all "normal" rays, not shadow rays where you can quit early, and only few rays hit the background as it's an interiour rendering.
With the stanford dragon (871306 tris) it however reaches 9.298 tris/ray tested, maybe i'm too conservative on memory usage (leaves are created at <= 3 triangles without further testing), however tree build time is already at 45s...
Hm flat cells...so far (i hope) i don't allow them either...
(L) [2006/03/24] [UlfJack] [Anger & Frustration] Wayback!Some more Anger & Frustration:
I'm back from whatever and I'll post my rantings here about trying to get the kd-compiler to work reliably. Last thing I did was realize that the event list thingy that Havran describes is not as easy as it sounds:
(L) [2006/03/24] [UlfJack] [Anger & Frustration] Wayback!IT WORKS! Yay!
*dances around the table*
Hurray!
(L) [2006/03/28] [tbp] [Anger & Frustration] Wayback!Again massaging boundaries instead of segments leads to various kinds of cruft: larger memory footprint, bad data locality, more nasty corner cases to name a few.
The other day i've tried re-implementing it that way, but i've given up 2 hours later due to my failure to camouflage the implicit pointer gymkhana in a convincing fashion.
Still a compact 76/80 bytes structure on 32 bit platforms (and 128 on 64 bit) per triangle is well worth the effort required to hide/abstract away all the pointer  tinkering uglyness, so i'll try again. Better. Hopefully.
PS: Sorry for the lag in answering that post and the lack of activity lately but i got busy... you know the drill... and Oblivion going gold didn't help [SMILEY Wink]
PPS: Those segments structure would look like
(L) [2006/03/29] [lycium] [Anger & Frustration] Wayback!yup, i can feel oblivion slowing these forums (and indeed the whole world) down ;)
(L) [2006/04/04] [UlfJack] [Anger & Frustration] Wayback!Not here. My machine is not fast enough to handle it.  [SMILEY Sad]
And I don't have enough money for an upgrade.  [SMILEY Crying or Very sad]
(L) [2006/05/06] [tbp] [Anger & Frustration] Wayback!Thread necromancy ftw! eh.
Hmm, no that was about that inner traversal part where you search leaves and have to handle basically 3 culling cases (left, right, both); branches are expensive on modern cpu - even more so when facing SSE & large latency - and there's plenty of options: going with 0, 1, 2 branches. Some make more sense statistically wrt to prediction.
It's not surprising that Carsten is going with 0 branches on a P4 and yours truely with 2 on a K8.
(L) [2006/05/06] [tbp] [Anger & Frustration] Wayback!Well, not exactly 0 as you still need to loop over until you hit a leaf.
3 cases:
1.a) cull left
1.b) cull right
2) visit both
For 1.a & 1.b we only need to update the current node to the proper child, so it's easy to fold those cases. But it's not a win because then the pattern becomes "to cull or not to cull" which is not predicted as efficiently.
So, we're left with the option of either completly removing branches or going for 2 (left, right, fallthrough).
Going branchless.
You notice that in 1.a) & 1.b) we update the node pointer.
In 2) we update the node pointer, update the ray segment and push the complementary part onto the stack.
Then it's easy to remove branches:
. conditionally update the node pointer.
. conditionally update the ray segment.
. always write to the top of stack.
You effectively trade branches for memory bandwidth & constant latency. On some severly crippled & weird chip it's a win [SMILEY Wink]
Carsten thesis has code doing that... that's supposed to be a draft and we were asked to not propagate it, but it's out there [SMILEY Smile]
(L) [2006/05/17] [beason] [Anger & Frustration] Wayback!Thanks for the explanation, tbp.
Sounds like the solution is to upgrade g++, add some consts, drop an if, and pray, but only if your target is the p4 [SMILEY Smile]
I don't think I'm going to bother with any of it, though. Hope that doesn't mean I'll have to "measure in thousands" instead of "millions".
(L) [2006/05/17] [tbp] [Anger & Frustration] Wayback!You don't have to upgrade specificaly for that, it's just that gcc 4.x codegen is much better when dealing with SSE (intrinsics or -mfpmath=sse).
Gcc knows how to conditionaly move floats on the x87, but i've never seen it emiting equivalent sequence for scalar SSE. And i doubt a scalar branchless version can beat a branchy version, even on a P4 (due to the lack of symetry in the SSE instruction set) anyway. But who knows.
To give you an idea a branchy mono ray path should get you a couple Mray/s.
(L) [2006/05/18] [tbp] [Anger & Frustration] Wayback!Yes, that was me. You've blown my cover! Argh.
Those results are like, old. But yeah, the kd-tree version was nearing 2M/s if i remember correctly (or was it 2.1M?) even if my tree was subpar at that time. Suffice to say it was slightly faster. Nick never posted those numbers, or i've never sent them.
Note that this kind of scene/"view" is almost ideal for a BVH as it doesn't excercise partitionning. And there's like 0 coherence [SMILEY Smile]
back