Havran's note on kd building

(L) [2006/02/11] [Phantom] [Havran's note on kd building] Wayback!

My kd-tree compiler will be better than yours. [SMILEY Smile]

Edit: Dude, that's hardcore algorithm stuff. I wrote the sorting stuff, implemented the splitting with a first basic SAH score approach (will add in empty space cutoff and so on later), and now I am testing code that updates the next and prev pointers for the other two axes. Looks like it worked in one go, I'm amazed with myself. [SMILEY Smile]

Final steps are insertion of nodes for straddling tris, and of course recursing / termination. That should be easier than the previous steps.

(L) [2006/02/12] [Phantom] [Havran's note on kd building] Wayback!

I have a bit of a problem, and since everyone is working on the same thing, I'll just explain the issue:

I have no idea where to split the list of potential split plane positions. Sounds silly with SAH and all, but the problem is, where to split EXACTLY. Suppose you have several polygons with the maximum on the x-axis, and several polygons with a minimum on the x-axis, equal to the maximum-x of the other polygons. Something like:

(L) [2006/02/12] [Phantom] [Havran's note on kd building] Wayback!

OK, solved it, placing the solution here for reference for you lazy people. [SMILEY Smile]

During sorting, I now take into account wether a node is a minimum or a maximum of a primitive. Minima will be placed before maxima. This way, I can put a split plane position between a series of minima and maxima that share the same position.

Next, I look for the best split position, using the original code. My code returns a node rather than a split plane position; this saves me another walk over the list, I can now simply cut up the double linked list at this position. I suppose this is the logical way to do things.

This 'split node' becomes the first node of the right branch list. Everything before it is the left branch list. But before it's used, I do the following:

- If the node is a minimum: while { prev node exists, has the same position and is also a minimum: prev node becomes new split node }

- If the node is a maximum: while { next node exists, has the same position and is also a maximum: next node becomes new split node }

This results in perfect lists for all three axes, for the first split of Sponza.

This approach doesn't mess with source data and is fast (although the more expensive 'a < b' test in the sort code bothers me slightly). Tweaking the split node pointer like I described is really fast and shouldn't hurt performance at all.

I hope this saves you guys some time. [SMILEY Smile] I'm moving on to straddling.

(L) [2006/02/12] [Phantom] [Havran's note on kd building] Wayback!

I don't do a second walk to split the list after I calculate the cost. During cost calculation, I keep track of the lowest cost, the node where it occurs and the position. So once the cost calculation is done, I already have the node where I wish to split. The double linked list allows me to do the split itself instantly.

Please check my solution, it appears to work correctly, although I have to admit I'm not completely sure why. I will keep checking once I have recursion in (need straddling stuff first).

Using a list of unique splits is not harder perse, you could simply store a number in each entry describing how many primitives start or end at that location. That's what I did in my original compiler. Problem in that case are the other axes, that probably don't overlap at the same location. You would get different numbers of nodes in each list. That's not something I want. Would be very hard to verify the correctness of the solution.

P.S. unique entries in the list would of course make the sorting substantially faster...

(L) [2006/02/12] [Lynx] [Havran's note on kd building] Wayback!

Very interesting thread...

fpsunflow pointed me to this forum [SMILEY Smile]

I'm really curious on the performance once it's working...

on first impression this list stuff does look a bit memory intensive, but i haven't really racked my brains about it yet (have to learn for some exams too right now...)

i have written a kd-tree for yafray, and building is still somewhat on the slow side too, (although originally i didn't mind so much as yafray + blender was rather heavy on memory usage anyway => people had to avoid large scenes more or less, but Moore will always get you...), so i want to update it sooner or later.

So far i use a binning technique, as described in "Fast Ray Tracing for Modern General Purpose CPU" by Jim Hurley, Alexander Kapustin, Alexander Reshetov, Alexei Soupikov. But only until the number of primitives drops below a certain threshold (then uses the STL O(n*log(n)) sort) to not loose too much precision (on cost calculation, that is). Probably it could be optimized some more too...

Most extreme test i did was the 10 Mio triangle "thai statue" from stanford's repository. It took about 6 minutes (!!) to build on a (dual) Athlon XP 1800+...however, a bit of it may be due to some swapping, it took a bit more memory than the 1GB RAM i have, but CPU usage really only dropped sometimes in the last minute (the HD didn't really suffer until the thing went into actual rendering [SMILEY Twisted Evil] )

The tree only used triangle bounds for splitting (no bound clipping or overlap testing) and "brute force" tested all 3 axis each time. So with a bit tuning make it maybe ~2.5 minutes...

In case someone is interested, i have written C++ code to clip triangles against a AABB, with Sutherland-Hodgman algorithm. So it simply creates a polygon out of the triangle fitting into the bounding box. I don't know if there's a faster method...

oh and MSVC 7.1 does produce wrong code when compiling it with /Ox, which cost me a whole day to figure out how to overcome...and made the code a bit weird, i should undo that again since it didn't help with the compiler bug anyway...

(L) [2006/02/12] [Guest] [Havran's note on kd building] Wayback!

russian_hacker, eh?

Maybe i should check out who you people are first *g*

Btw, about the cost thing, i think i'm confused again...shouldn't maxima be before minima, i.e. compare as smaller?

Then, when going through the array (from low to high), the best cost is "automagically" at the position where all coincident maxima are below and minima are above, because at the series of identical split positions first the number in the upper part decreases (cost becomes smaller) while in the lower they keep the same, until the number in the lower part increases while in the upper they stay the same...?

Tell me i got it wrong, otherwise i think my code is currently missing the optimal split...and PBRT too...arg!

Let me try...gosh! I think...it was wrong all the time!  [SMILEY Shocked]

before:

=== kd-tree stats (30.687s) ===

used/allocated kd-tree nodes: 1909253/2097152 (91.0403%)

primitives in tree: 871414

interior nodes: 954626 / leaf nodes: 954627 (empty: 196326 = 20.5657%)

leaf prims: 4201180 (4.82111x prims in tree, leaf size:3)

   => 5.54025 prims per non-empty leaf

leaves due to depth limit/bad splits: 103554/454512

clipped triangles: 0 (0 bad clips, 0 null clips)

after:

=== kd-tree stats (19.031s) ===

used/allocated kd-tree nodes: 1248109/1572864 (79.3526%)

primitives in tree: 871414

interior nodes: 624054 / leaf nodes: 624055 (empty: 167520 = 26.8438%)

leaf prims: 1486427 (1.70576x prims in tree, leaf size:3)

   => 3.25589 prims per non-empty leaf

leaves due to depth limit/bad splits: 9308/161153

clipped triangles: 0 (0 bad clips, 0 null clips)

can't see any render errors/differences, though i have to admit my renderer does not even have anti-aliasing yet...

one 3rd lower build time, and each triangle is on average referenced in 1.7 nodes rather than 4.8! OUCH!

fpsunflow, i think yours in sunflow might be wrong too, unless i misinterpret java there...

(L) [2006/02/12] [Lynx] [Havran's note on kd building] Wayback!

yikes!

why wasn't i logged in...must've been the browser tab from before the registration [SMILEY Rolling Eyes]

sorry

-edit-

oh and ignore the "clipped" statistics, the code is not there due to the re-design of yafray...

-edit (yet again)-

okay, now i remember why the bound edges were sorted the other way round: primitives with zero width in one projection, because they will disappear in the uppder side before they appear on the lower one...there has to be a nice solution...

(L) [2006/02/14] [Phantom] [Havran's note on kd building] Wayback!

I was just reading the RTNews notes again, and discovered a wealth of info that I can only appreciate now that I am working on a new compiler:

[LINK http://www.acm.org/tog/resources/RTNews/html/rtnv18n1.html#art9]

There's discussions of overall approaches (binning or sorting), reducing the number of nodes to check, reducing the number of axes to check and so on. If you're working on a compiler (right now I know four people including myself who are working on a rewrite) you may want to read this in detail.

(L) [2006/02/14] [Phantom] [Havran's note on kd building] Wayback!

My kd-tree compiler (Havran style) is ready, apart from debugging. I'm taking down small and large bugs one by one. And so I returned to the split plane positioning: I have a case now where a primitive maximum ends up in the left kd-tree node, and the accompanying minimum in the right kd-tree node... Obviously, the primitive is in the same plane as the split plane.

I'll probably go for tbp's wise advice: Cherry picking, unless someone has a better idea?

(L) [2006/02/15] [Lynx] [Havran's note on kd building] Wayback!

Hm somehow we're not really living in the same world *g*

There are those who care about every millisecond but not about the few extra bytes a few pointers for 20k tris need, and those who only need "reasonably" fast tree building up to several million polys, where 2 or even more extra pointers per triangle do matter quite a bit...

fpsunflower: i think i just killed the most ugly bugs in my kd-tree and now do distinguish between left edges, right edges, and edges that are both. So i they compare left < both < right and the difference in overhead as in the stats above are now true without omitting axis aligned triangles...though i think i have to revise the binning code too, on some scenes a few unexpected results occur (final number of left&right prims do not match the ones used to calculate cost), but it still seems not to cause errors...and maybe i should do binning up to the leaves anyway. I've commited the changes to my yaf(a)ray subversion repository. You can get it with:

svn co [LINK https://ssl.little-isp.de/svn/lynx/yafaray/mainline] yafaray

The triangle clipping code (just cleaned it up a little, but really only a little...) is there too (src/yafraycore/triclip.cc).

That code however only stupidly clips to all 6 cube sides and gives the new bounding box. But since each subsequent split of a tree node only changes one of the 6 bound values, it'd safe a lot of time to safe the clipped polygon and add a re-clip method which then only needs to do 1/6th the work...

(L) [2006/02/15] [Phantom] [Havran's note on kd building] Wayback!

Lynx: Be careful with cumulative clipping: You're going to introduce rounding errors. Probably nothing serious, but still something to keep in mind.

About different worlds: I used to work with relatively small scenes (60k-200k), and didn't really care about compile times or memory usage, as I was really just working on a ray tracer. I'm still doing that, but I'm rewriting the compiler since I don't trust the output of my old one. The new one should be faster, but I'm not going to sqeeze out every byte or cycle.

I think once the kd-tree compiler delivers a nice and clean tree (somewhat tweakable please), I'll revisit LMRTA (it will probably just work once the tree is good), and then either GI or realtime rendering of small dynamic scenes. The Q2 demo was VERY inspiring.

Tbp: Could you add that 10M statue to the repository?
_________________
--------------------------------------------------------------

Whatever

(L) [2006/02/16] [Phantom] [Havran's note on kd building] Wayback!

You could definitely use multiple threads to build a kd-tree. Problem is the first split: There's little you can do to split that over multiple threads. But once you have the first split, you can leave each branch to a separate process/cpu/machine on the network.

BTW I once discussed an idea with Havran that I would like to test out someday, but perhaps someone has already done something similar:

It's hard to measure the absolute quality of a kd-tree, especially if you introduce things like empty space clipping and other things besides SAH. However, such a measurement would be very interesting: Suppose you know what the optimal tree is for a given scene (doesn't matter how you got that optimal solution, and in how much time). You could then compare a tree, generated by an algorithm that runs in reasonable time, to this optimal tree, and express it's performance in a percentage of the optimal tree.

Obviously, 'optimal' depends on the application of the tree, and possibly other factors. But suppose you calculate the optimal tree based on SAH, a given traversal cost and a given intersection cost. Optimal in this context means: The absolute lowest cost, not just for a single split, but for all splits. So, the split cost at each level is determined by the cumulative cost of all subsequent splits. This obviously requires a lot of processing, even for a simple scene, but at least, you would have something to compare to, and you would now how far off a simple SAH solution is, and how much any given kludge helps to approach the optimal result.

If we could share kd-trees somehow, we could also build some of these 'optimal trees' (over the weekend, on a fast machine, for instance). That would be quite interesting. [SMILEY Smile]
_________________
--------------------------------------------------------------

Whatever

(L) [2006/02/16] [tbp] [Havran's note on kd building] Wayback!

Those thingies (boundaries) are always paired anyway, if you restrain yourself to what Havran described, because they are really about the segment they represent.

A lot of cruft in the data layout is removed (ie leftOrRight) if you manipulate segments instead of their boundaries. Now if you pack those Data3D you'd get something like a 32 bytes struct and if you treat them by pair in a segment you'll have to waste at least 64 bytes on it because otherwise you wouldn't be able to 'compress' pointers etc... so it's not so much of a win wrt memory footprint.

(L) [2006/02/18] [tbp] [Havran's note on kd building] Wayback!

I now truely remember why i was so reluctant to jump yet again into the kd-tree building business.

There's still, euphemism ahead, quite some due fixage but at least i have a traversable something (even if my traversal gets a bit upset at times - yes, that's another euphemism).

I can't really provide quantitative figures, it's still choke full of integrity checks & other safety nets, but i'm glad that rewrital at least cleared some long standing issues i had with the previous compiler.

Have a nice saturday [SMILEY Smile]

(L) [2006/02/19] [tbp] [Havran's note on kd building] Wayback!

Pfff.

One thing annoys me tho, and i can't quite point the finger at it yet: that shining brand new compiler is (way) faster, more correct, bubbles up voids higher than ever and creates (generally) shorter trees but, clipping issues aside, the old one still produce a faster-to-traverse tree [SMILEY Shocked]

There's a winning heuristic tweak somewhere.

Differential analysis of trees is just so entertaining. /me faints.

(L) [2006/02/19] [tbp] [Havran's note on kd building] Wayback!

I still can't match the quality of the output from previous version, but here's some (very) rough compilation time figures.

For sponza, 66453 triangles, going up to depth 25:

. the original compiler takes 3.136s

. Havran's edition: sort ~35ms then 453ms with no clipping or 1328ms clipping everything everywhere everytime.

For 270k triangles it's like 1.3s vs 20s.

edit: buddha, 1063544 triangles... ~750ms to sort, 15.890625s to compile up to depth 25 with unconditionnal clipping.

That code isn't tuned at all (i've just shut off some debug output to avoid benchmarking the console), and that was done with the crapiest c++ compiler around.

But at least it's getting somewhere.

(L) [2006/02/20] [Phantom] [Havran's note on kd building] Wayback!

I finally got it working. I have been staring all week at two (really) stupid bugs, adding integrity tests, chasing small problems.

It turned out that I was counting the left and right polygon count wrong in the SAH calculations... Dude. Besides, I was using volume instead of area. [SMILEY Smile] Terrible.

Anyway, it is working reasonably well now. I get some artifacts when clipping empty space all the way down to the lowest levels (so I'm not doing it anymore below level 20 or so, while I continue looking for the cause), the tree is twice as large as it used to be, and, most importantly, it performs worse than the previous compiler in terms of raw ray throughput.

There are good points about it too obviously, besides the negative things that probably will improve over time:

- Average intersections per ray went down drastically, indicating that I met my goal of building a better tree for MLRTA;

- MLRTA now indeed helps, I still need to collect some hard figures but even the lego car is faster with it than without it;

- The compiler is much faster than the previous one, although I am not nearly matching the performance some others report. Legocar takes seconds, Sponza almost a minute. I have lots of ideas to make it faster, but still it feels like I am missing something fundamental.

- The code is much cleaner. No nasty shortcuts, just a clean compiler. The only shortcut right now is empty space cutoff, as I test for empty space before I do SAH calculations.

I only got it working last night, so I will spend tonight tweaking and striving for decent output.

One question: How much empty space should be cut off? Right now I cut off the largest slice relative to the dimension for that axis, with a minimum of 10%. I am considering to lower that number to 5%, and increase it as I reach greater depths (i.e., I won't split off a 10% splinter at level 20, but I would cut off 40% if that's possible). Also, while typing this I just realized that 'percentage of dimension for axis' is a bad figure, as this would favour a 0.5 units cut over a 10 units cut for a aabb that has dimensions 1, 80, 80.

Anyway, I'm tweaking instead of 'getting it to work'. [SMILEY Smile]
_________________
--------------------------------------------------------------

Whatever

(L) [2006/02/20] [tbp] [Havran's note on kd building] Wayback!

Ah, got the wolf out of the wood finally [SMILEY Wink]

First, no i'm not sorting 10M segments in 750ms, but 1M (or put another way 2M boundaries):

. it's dispatched (per axis) on x cpu (here 2), hence the aproximate timings.

. i initially missed your linked list sort thing, and i'm using STL's tuned hybrid quicksort (via a thin wrapper that gets inlined blah blah blah).

. as i really sort an array, regenerating it per axis etc, and i dispatch to threads, this tactic is only a win on a large enough triangle soup

If you could give me some figures for your list sorting that would help me decide if i should dump that approach.

Note: i should really use something with less overhead for smaller soups, but obviously i don't care enough atm.

About voids:

I used to integrate such void cuts into the scoring in previous compiler, but like you (and according to the phrasing of the MLRTA paper) i shortcut it and do that just before scoring.

It goes like this:

(L) [2006/02/20] [Phantom] [Havran's note on kd building] Wayback!

Another problem I encountered is this:

I recurse if any node contains more than MAXPRIMSPERLEAF triangles (set to 2 or 3, usually). This morning, I added a test to see if splitting helps: I don't do the recursion if a split means that one of the nodes gets zero triangles (i.e., the split doesn't help immediately). Sadly, this test reduces the ray throughput, apparently some splits to make a difference, just not right away. It would be nice if I could predict that somehow.

Note to self: A split is not just useless if one of the nodes contains zero prims, it is useless (at first sight) if one of the nodes contains just as many prims as the parent. Need to try that when I get home. [SMILEY Smile]

Also, I have problems with the comparison of the cost of not splitting against the best split cost: Cost of not splitting is calculated as prims_in_leaf * area_of_current_node, but when I set this as the initial best score, I again get a lousy tree...

And one last note (also more or less to self): I would like to add a penalty for clipping triangles. After all, if a split plane position means that some triangles end up in both nodes, the memory load increases, so there's a cost to that. I would like to put that in the formula somehow.

One last question: How do you set your cost of traversal and cost of intersection? Right now my cost of intersection is .5, traversal .1.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/02/20] [Phantom] [Havran's note on kd building] Wayback!

I am again going to suggest an article from RTNews, by Eric Haines:

[LINK http://www.acm.org/tog/resources/RTNews/html/rtnv17n1.html#art8]

I am reading it now, just got to the point where he suggests that it's worthwhile to not just keep track of triangle counts, but also triangle area to estimate the cost of a split plane position.

EDIT:

Eric Haines claimes that empty space cut off is worthwhile if (and only if) all primitives are taking up only 1/3rd of the full range. Interesting.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/02/20] [tbp] [Havran's note on kd building] Wayback!

Heh, that was what i've noted too, that 1/3 ratio, in my diagonal read of that really interesting article. I wonder how i've missed it, i thought i had all rtnews scanned by now. I was wrong.

Seems that i'll have to postpone my tinkerings with the split plane until i have digested that paper.

Btw i've never been able to lay hand on the orignal 'J. David MacDonald and Kellogg S. Booth, "Heuristics for Ray Tracing Using Space Subdivision"', even the ACM failed me. Anyone? [SMILEY Smile]

(L) [2006/02/20] [Lynx] [Havran's note on kd building] Wayback!

I am only using a cost ratio, i.e. cost_ratio = traversal_cost/interection_cost (which i just notice is currently 1.2, so traversal more expensive than intersection *lol* memory safer that is...)

So costs become:

leaf cost = ntriangles

new cost = cost_ratio + (belowSA*nBelow + aboveSA*nAbove - empty_bonus)/nodeSA

One important thing seems to not limit the tree depth by a too low hard value (and for ~1Mio tris i consider 25 as "low", although it may be okay for buddah, but put him on a large empty box and try again...) but make sure the function aborts when splitting becomes too expensive (so to me cost clearly includes memory cost). Otherwise if you try to safe some memory you get occasional huge leafs that kill performance, my weird cost ratio was the result of a memory/performance tradeoff...but likely still room for improvements...

I do think i provoke cutoff of too small empty space, i only scale the empty bonus by the percentage of empty space + a little constant. Although new cost should still be above leaf cost, i do allow a bad split to eventually find a (hopefully much) better one in next iterations, but near leaves cutting off <5% and then finding no other split can't pay off i think...neither traversal nor contruction time wise.

On the other side i think a percentage limit also pushes empty cutoff to deeper levels...

And lately it seems the bare tree statistics can look nice, yet the needed intersections per ray vary greatly, and the binning has some unpredictable influence too, only using 256 bins caused an almost unworkable tree on extremely uneven geometry so i tried 1024 and it resolved it...so i more and more believe, the method you are implementing is the only reliable O(n*log(n)) one...

Are you going to reveal some (commented) code snippets?

(L) [2006/02/20] [tbp] [Havran's note on kd building] Wayback!

That depth 25 limit was/is very artificial indeed, and it's just there to avoid that damn thing to get everything swapped out because it diverged.

Plus my older compiler already takes 80s+ wrestling with el Buddha, so even if depth 25 is clearly not enough it will have to do for now [SMILEY Smile]

And those results aren't meant to be accurate in any shape or form, that was just to give some some crude performance figures.

The only problem i see with a compiler more or less on free wheel is that it better be 'correct'. I never could get the old one to be reasonable in all cases. There's more hope this time.

This method may well be the only one 'reliable', there's still some place for faster/simpler methods if compilation times matter more than quality because that one still requires a full frontal sort.

Havran's note on kd building back