Fast fast kd-tree building back

(L) [2006/04/09] [tbp] [Fast fast kd-tree building] Wayback!

I've been waving hands on that topic a tad too much to my taste lately, so i'm going to try to clarify what i really meant.

In his note Havran advises the use of double linked lists, i think that's overkill if careful. Then, like Wald, he proposes to work with boundaries (events in Wald terminology) as the basic unit; again i think that's not wise because such events/boundaries are always paired and we'd be losing that connection.

That leads us to a structure like...

(L) [2006/04/10] [Phantom] [Fast fast kd-tree building] Wayback!

OK, I have a ton of problems even understanding what you wrote. I will begin at the start of your post and quit at 1/3rd. [SMILEY Smile] Tons of quoting ahead:

(L) [2006/04/10] [tbp] [Fast fast kd-tree building] Wayback!

Glad you stopped by to point at the gibberish [SMILEY Smile]

First things first: why the hell do i insist on not losing the connection between events and the underlying triangle.

We are really trying to partition triangles, we track extremae on each axis for that but ultimately we're massaging a whole triangle all the time. If we keep related data together, avoiding unnecessary data duplication or indirection, that's a win both for locality and memory footprint. It also simplifies some phases.

(L) [2006/04/18] [fpsunflower] [Fast fast kd-tree building] Wayback!

I think I understand the main points of your method. Main problem I see is the extension to 64bit archs. Do you have any plans for that ? Or is there a way to force memory allocators to work in the lower 4Gb - or always in the same "segment" - so you can safely chop the top bits? I doubt we'll get a real scene with enough triangles anytime soon anyway (alignement forces only 28 usable bits in the ptr_t - correct ?) Until someone decides to share the Boeing dataset [SMILEY Wink]

lbox_t *clone; is only needed for overlapping triangles correct ? Could you not use a small hash table to keep track of those? In practice it would always be fairly small and that way you'd only pay for the extra bits when you needed them as opposed to once per tri. It messes up your alignment though =)

My compiler is relatively close in how it stores things (ie: tries to be minimal), except it only needs the boundary_t's (which are expressed as packed 64 bit integers). Since I don't use linked lists I use the extra bits of ptr_t to store type (open/close/planar) (2 bits), axis(2 bits) and primitive_id (28 bits). The whole thing can be sorted quite fast (and only once) using radix as you've figured out by now I guess.

The only other memory I need is an array to encode which triangles should go left/right/both at each step. which is just 2 bits per primitive. I don't think I can do it in one pass since I don't have access to the segments - unless I query the triangle itself which would be expensive (?). So I do a total of 3 passes over the splits: cost, categorizing, splitting. All axes are sorted into a single array.

So my total is 48 bytes+2bits per triangle. Of course the arrays grow a little as you go due to overlap, but so do the linked lists. And you can free the array of splits before recursing - so memory usage goes down nicely as your recursion moves to the right of the tree (in practice of course the garbage collection is stupid and hangs on to everything until the last minute).

The main drawback of my method is that it requires allocating new arrays at each step to store the result of the splitting pass. Although you're free to delete the old one right after you've filled up the 2 new ones, there's still that moment when you're using twice the memory that you should be. Clipping is also alot trickier to do - but I've let that aside for now since Wald and co report gains of only ~30% max from clipped kd trees.

Using extra bit-twidling and loosing some speed in how things are fetched/decoded I could actually store the split values only once and only pass around the 32 prim+flag bits which would save some more space. But then clipping becomes impossible.

I hope that made some sense.

(L) [2006/04/18] [Phantom] [Fast fast kd-tree building] Wayback!

I think I'm finally starting to see what you're doing. Holy crap, that's some data structure. [SMILEY Smile]

Allow me to describe it in my own words, so you can verify that I got it right this time, and so subsequent readers have less problems grasping it:

I'm building from the ground up here.

- For a single axis, one could simply build a linked list of events, with three members: float pos, int flags, and a next pointer. 'Flags' contains either 'tri start', 'planar' or 'tri end'. This list could easily be sorted, with pos as the primary key, and flags as secondary so that 'tri end' events happen before 'planar' and 'tri start'.

- Stepping up to "Havran's addendum style": The three axii are now stored in a single structure. For each event, we need three next pointers, one for each axis, so that events can now be linked differently for each axis. The event position is thus also an array of three floats, one for each axis.

- TBP style: This time we pair events. We store a full bounding box (that's 6 floats), and also 6 next pointers. The first three pointers are used when this event is interpreted as a 'tri start' event, the other three if this event is a 'tri end' event. A polygon that is planar on one axis, has two equal pointers for that axis.

Building the list is done as follows:

- First, for each triangle, an event structure is allocated and initialized:

  * Each of the three 'tri start' next pointers is initialized to point to 'this';

  * Each of the three 'tri end' next pointers is initialized to point to the next event that is allocated.

  * If the polygon is planar for a given axis, the 'tri end' pointer is copied to the 'tri start' pointer for that axis.

  * The 6 extremae are set to the extends of the bounding box of the triangle.

At this point, a huge trick is introduced. We can't just let the next pointers point to a full structure instance, since we wouldn't know which side of that structure we want to use. Allow me to clarify that: A 'tri start' event could point to the opposing 'tri end' event of the same primitive, but it could also point to a 'tri start' event of another primitive that overlaps the first primitive. There's no way to know which side we intended to point to. So, we point directly to the 'next' pointer for the desired axis and side.

This introduces a new problem: How do we determine the address of the structure that this pointer belongs to? Assuming the structure is aligned to an address that is a multiple of two, we can simply 'and' out some bits to get the structure base address. The structure proposed by tbp is 64bytes (after padding) on 32bit machines, but it's larger on 64bit machines, so that's a problem.

An alternative is to encode the side the next pointer is pointing to in the pointer, without changing it's address. So, the pointer (AND 0xFFFFFFFE) points to the start of the structure instance, and the lowest bit (the bit that is masked away) determines wether we want the left side or right side next pointer.

At this point, we can treat this list as three separate lists (one for each axis) by following next pointers; only oddity is that each allocated entry basically represents TWO items in the linked list. I.e., if there's no polygon overlap for a particular axis, we would have to follow the next pointer TWICE to get to the next primitive.

- Next, for each axis, this unsorted list is sorted:

  * The list can be treated as a normal linked list, so standard merge sort for linked lists should work. Better sorting algorithms exist (especially radix sort).

The cool part so far is that no pointers are needed to get from one side of a triangle to another side, which is useful in many cases. Also, memory usage is very optimal.

Before I dive into scoring and sublist scoring I would first like to hear 'affirmative' on the above outline.

If the above is correct, I can see why you had problems explaining it. And I also believe you should really write a paper on this approach, as it's vastly different from Wald/Havran's approach and also clearly a lot better, both in speed and in memory usage. I can help you writing it if you wish.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/04/18] [Phantom] [Fast fast kd-tree building] Wayback!

@fpsunflower:

If my interpretation is correct (see above), I don't see how you could apply the same idea to a structure that has no linked lists?
_________________
--------------------------------------------------------------

Whatever

(L) [2006/04/18] [Phantom] [Fast fast kd-tree building] Wayback!

Code is always nice. Is it an O(NlogN) implementation? Perhaps you should post it in a new thread, for easy reference.

BTW, if I grasped tbp's idea correctly, I think it's quite awesome. It's definitely a great way of reducing data and structuring memory access.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/04/18] [Phantom] [Fast fast kd-tree building] Wayback!

@tbp:

A couple of things I would like more info on:

- Clipping: I can't imagine any cases where you can't get away with simply adjusting the aabb for straddling tris (after cloning; the clone will obviously get a different aabb, or rather: The complimentary aabb).

- Sorting: How on earth did you use radix sort to sort this data nightmare?

And I still think we should do a paper. Your approach is significantly faster and better for the cache. It's also non-trivial to deduct from Havran's/Wal's approach. And, a 3x speedup (which comes from a better algorithm in this case, not just a more optimized implementation) is VERY significant considering the fact that SAH kd-trees are only *just* too slow to use on dynamic scenes. This approach makes the grid based approach insignificant. So I do think it deserves a paper. Besides, it would be really cool to have that paper floating around the internet. [SMILEY Smile]
_________________
--------------------------------------------------------------

Whatever

(L) [2006/04/18] [fpsunflower] [Fast fast kd-tree building] Wayback!

I agree with phantom. This would be perfect for a small paper at the upcoming rtrt-fest in utah. I might be able to provide a couple "production" scenes (~1-2 million polys) from the movie we're doing at work. Might not be able to post any pics, but it'll provide some interesting datapoints at least.

We should also run through Wald's animation test scenes and see how close we can get to his grid paper.

Oh and I guess there's a way of multi-threading the whole thing ? [SMILEY Wink] (*sound of tbp's head exploding*)

(L) [2006/04/20] [Phantom] [Fast fast kd-tree building] Wayback!

It's kinda quiet here all of a sudden, and I'm rather busy with the new game academy, but I have managed to do some coding nevertheless. Current status: First iteration of the NlogN compiler now produces the exact same results as the Nlog2N compiler, so it appears to work. List splitting is the next step, but that shouldn't be too hard, I hope.

Here are my data structures:

(L) [2006/04/20] [tbp] [Fast fast kd-tree building] Wayback!

You're an heretic. The One True Nomenclature is where you postfix anything looking remotly like a type with '_t', à la C99.

Not much to comment, so i'll digress into nitpicking. There 2, exclusive, sides so that's one bit not two; if you decide to also pack in a planar flag then that's 2 bits total. Also you can only hide so much details with syntactic sugar: in that pointer jungle there's lots of potential aliasing and/or undue reloads. Thusly the 'restrict' idiom is useful, and also a method - let's call it 'extract' - with a prototype like lbox_t *extract(ptr_t p, sideness_t &side) that produces both the pointer & side bit at once.

You directly cast pointers to 'unsigned long', that's a bit brutal as C++ comes with ptrdiff_t & intptr_t to help with portability. What is more worrysome is your use of constants: 0xffffffff won't fly on a 64bit arch, on the other hand the shorter equivalent ~0 when properly casted will behave (or directly ~3 in your case).

(L) [2006/04/23] [Phantom] [Fast fast kd-tree building] Wayback!

I got working output from my NlogN compiler, but it gets stuck sometimes (iterates to max tree depth on scene6), causing severe artifacts, probably because it sometimes (often?) puts far more than the max of 31 prims in a leaf. Did you encounter the same problem? Could it because of the clipping?

I'm going to implement full clipping first, to see if this is the problem. What I intend to do is this: I will remove data for straddling prims from the lists, then clip the prims and reinsert the events. That should give the same output as the nlog2n compiler. If I still get problems, I must have done something wrong... Problem is, the first couple of splits appear to be identical to the output of the nlog2n compiler, so the basis seems to be right.

And another problem: The mergesort is painfully slow, but I don't yet see how radix sort can be implemented on top of the current data structures... My understanding of radix sort is lacking though, so perhaps I should just read up on that first. Right now, quicksort is vastly outperforming the linked list merge sort. Perhaps I should use an intermediate data structure for sorting, then convert it to linked lists afterwards to gain some performance... Tree builds for scene6 are now 'sub-second' even in debug mode on a slow machine, so that looks promising.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/04/24] [tbp] [Fast fast kd-tree building] Wayback!

With partial clipping, that is with cloning and clamping to the split plane on the winning axis but no 'fixing' on the other 2, on scene6 i get:

compiler_t::compile(): 4 ms

        max_lvl 22

        num_nodes 3083

        num_max_ids 12 (3.98 avg/leaf)

        num_leaves 1542

        num_leaves_empty 364 (23.6 %)

        num_space_cuts 0

        num_ids 4683

... and that seems right; of course it would be better with perfect-clipping.

If you do no clipping, or only partially, you have to be more careful in your scoring otherwise it's easy to get stuck in useless cycles where you always produce the same split. It sounds like that's what you're experiencing.

I'd suggest explicitely checking split candidates that shouldn't get scored: splits should observe cell.min < split < cell.max and that's not warranted if you don't clip. More precisely skip scoring when split <= cell.min and consider the scan for the axis being done if split >= cell.max. That should fix most of your problems.

Some references for radix sort:

[LINK http://en.wikipedia.org/wiki/Radix_sort]

Skip Tardiman's contrived article and code and jump directly into [LINK http://www.stereopsis.com/radix.html]

That's much clearer and faster.

Like i've said somewhere else, i've benched various versions and checked the effect of the number of passes. On my machine that gives: 3, 2, 4 when sorting a simple float array.

Now that doesn't address the fact that we need to produce singly linked lists as the final output. While going for 3 passes is the fastest option (because it fits caches, i guess) it lacks some useful symetry that reduces the requirement in auxilary storage. So i've gone with 2 passes and only need to temporally hold N indices.

That nice for larger sets but certainly have a negative impact on small scenes (heck i have more counters than there's triangles in toxie's q2 scene).

I'll post code for the final pass later on if you will.

EDIT: forgot to say there's a simple method to really see what's happening with those crowded nodes, export all those triangles + cell for visualization (i export a triangulated box + triangles to .obj); then it's pretty clear when you're having clipping issues

(L) [2006/04/24] [tbp] [Fast fast kd-tree building] Wayback!

I've uploaded a scene known to annoy my frail older compiler which somehow gets stuck on it (or takes an innordinate amount of time, i have no patience) and produce crappy tree if handled without perfect clips: [LINK http://ompf.org/forum/viewtopic.php?t=95]

381064 triangles

(L) [2006/04/24] [tbp] [Fast fast kd-tree building] Wayback!

Damn, 1M not-so-trivially-compressible triangles to play with, i'm already drooling.

Not being an artist i've never thought about hairs, cue in jokes about my hair-dresser being dead, but i can see how they could be a major pain to get through a kd-tree once triangulated and even then raytraced efficiently (and that's not even addressing shading). Seems like an interesting problem.

(L) [2006/04/24] [Shadow007] [Fast fast kd-tree building] Wayback!

I'm a lurker here and don't know nothing about Raytracing apart from Phantom's articles, but I've got a lingering question : why should the KD-tree compiler output be a linked list ? why not an array ?

Thank U, and please feel free to ignore the question if it feels to annoying/trivial...

(L) [2006/04/24] [tbp] [Fast fast kd-tree building] Wayback!

Hello lurker,

the output of a kd-tree compiler is a kd-tree [SMILEY Smile]

If you want to acheive compilation in n.log(n) time you have to sort your primitives once and then maintain that order throughout the rest of the process. As clipping against node's cells will introduce redundancy that means you'll face, one way or another, insertions. That's why lists are favored.

(L) [2006/04/24] [Shadow007] [Fast fast kd-tree building] Wayback!

Yes, I should have made myself clearer : I had indeed at least understood that the KDtree compiler output would be a KDTree [SMILEY Smile] , What I now understand is that the Sorting is done as a "pre-processing step for the actual compilation. Have I got that right ?

But do the final KD-tree's leafs need to contain a linked-list of primitives, or is it just the way they are obtained after the compilation (and all the included insertions) ?

Would they not be better arranged in arrays for the actual ray-tracing ?

PS thanks for the quick and clear reply

(L) [2006/04/24] [tbp] [Fast fast kd-tree building] Wayback!

Yes, that initial sort on the whole set can be viewed as preprocessing: if you sort at every step you'll fall in the n.log²(n) category. Now one could argue that building a kd-tree is nothing but sorting as a kd-tree is a structure allowing faster searches (in our case spatial queries) to begin with [SMILEY Wink]

Your other point makes the - false - assumption that you'll be using the same tree used to build at run-time. In fact there's not many compelling reasons, in my opinion, to do that:

. memory required to store a kd-tree is some order of magnitude smaller than the data set it indexes, there would be no point in using such structure otherwise.

. translating/copying such tree is done in linear time.

. the data you need to build a tree have little in common with what you need to traverse it, plus given how tight is the final representation it would be cumbersome to say the least.

. also, the final representation might be quite funky regarding its memory layout.

. in fact while building you don't need to refer to earlier/later nodes, those could be just streamed out to disk.

. but then you also may want to run various passes on the tree once totally built (and might need cross references or auxillary data etc), ie to optimize away silly splits.

So, really, it doesn't matter how your tree is layed out while you build it as you'll cheaply lower the representation before rendering (8 bytes per node, adjacent childs etc).

(L) [2006/04/25] [Phantom] [Fast fast kd-tree building] Wayback!

Good news: I got a fully working nlogn compiler now with full clipping. Adding the clipping was a bit of a pain, by the way. Right now I have many passes, so I need to do a lot of optimizations, but that will have to wait till tomorrow.

List of passes:

(note: Each primitive is represented by a bounding box structure, like tbp suggested. The bounding box has a flag that denotes wether the primitive is completely to the left, completely to the right or straddling)

- First, there is the normal cost calculation. I do the usual walk of all three axii, nothing special.

- Next, I set all bounding boxes to 'straddling'.

- Then, I walk the list again to identify primitives that are completely left or right of the split plane. The remainder is 'straddling' (thanks to the previous pass). Planars on the split plane are added to left or right, depending on the cost calculation.

- Then, I walk the list again to split it to a left and right list. No clipping is performed; straddlers go to both lists (a clone of the bounding box is created to allow clipping in the next pass).

- Next pass: Clipping. For this I first walk the generated left list, then the generated right list. Left and right primitives are ignored, but straddlers are clipped, their associated events are removed and reinserted as they may have changed. I also test for new planars here.

- In a final pass, I store primitives in the newly created kd-tree nodes. This step will be dropped later on, as this needs to be done only for leafs, but right now it was easier to keep the old structures.

That's a total of 6 passes, of which 2 (cost calculation and list splitting) are performed for all three axii. At least one pass can easily be removed, but that still leaves me with 5. And I am ignoring the node deletion/insertion, which also involves walking the lists, thanks to the single linked list structure...

So there's plenty of work to do, but at least it gives correct results. I need to check if it gives the exact same results as the nlog2n compiler, but theoretically, everything is in place.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/04/26] [tbp] [Fast fast kd-tree building] Wayback!

It's quite entertaining to only have one way lists, eh? I hope you enjoy the ride [SMILEY Smile]

I wouldn't expect an exact match if i were you, for instance on the same arch i get a slight diff (a couple of nodes generally) between gcc and msvc binaries. That's better than what it used to be but i'm still a bit depressed, i thought i was careful enough; anyway i'm not going to require compilers to be stricter wrt optimizations.

(L) [2006/04/26] [Phantom] [Fast fast kd-tree building] Wayback!

Well when 'different tree' means 'slower tree', I'm not that forgiving. The nlogn compiler is producing results within 2 or 3% of the nlog2n compiler (in terms of rays per second), so it might be a simple system hick-up. Now that I mention this, I just realized I ad an exceptional large set of background tasks last night.

I'll compare node counts today; if it's close enough I'll declare this compiler 'working' and then it will be optimized, at least slightly. I'm not striving for the fastest compiler, but I do want to be able to build larger sets in reasonable time. Besides, this algorithm hell is fun. [SMILEY Wink]
_________________
--------------------------------------------------------------

Whatever

(L) [2006/04/26] [tbp] [Fast fast kd-tree building] Wayback!

I haven't formally done an optimizing pass on my compiler either, yet. I mean i've taken care not to write obviously slow code but that's it. From a cursory dissasemblage, it's a branch jungle out there; i think i can weed out some.

Now what i really ought to do is to profile it proper (and maybe try to collect some meaningful cache hit/miss data).

Hmm.

(L) [2006/04/26] [Phantom] [Fast fast kd-tree building] Wayback!

OK, speed is back to normal now (was indeed caused by background processes). Stats for the nlog2n compiler (Toxie's scene6):

(L) [2006/04/26] [tbp] [Fast fast kd-tree building] Wayback!

Max leaf depth: 52 ???

(L) [2006/04/26] [Phantom] [Fast fast kd-tree building] Wayback!

Ehm yeah. [SMILEY Smile]

That's probably not good, but the point is, it is using the same scoring as the nlog2n compiler, and it's producing virtually the same result. So I'm happy. Improving the compiler means tweaking the scoring; all bookkeeping is working (and still rapidly improving at the moment).

I must say that an nlogn compiler requires a rather daunting amount of hardcore code btw.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/04/26] [tbp] [Fast fast kd-tree building] Wayback!

I think there's a bit more than just sub-optimalities, here is the result side by side from my partially clipping compiler vs your fully clipping one. There's some clipping in that scene (table/chair vs walls etc), so a perfect clipping compiler definitely should have an easier job.

Total node count: 12509    / 3083

Leaf node count:  6255 / 1542

Empty leaf count: 2243 (35.8%) / 364 (23.6 %)

Min leaf depth:   2

Max leaf depth:   52 / 22

Average depth:    20.501358

Other useful stats with no match,

num_max_ids 12 (3.98 avg/leaf) - that is max triangles in a leaf and average triangle/non-empty-leaf -

I mean what my compiler produce is not even remotly good by any standard.

So, something's still a bit fishy in your scoring (i'll put my bet again on useless split cycles of some sort).

Speaking of kd-tree quality metric, i think it's time to talk again about a way to trade our trees so we can bench via rendering while removing our respective renderer bias.

(L) [2006/04/27] [Phantom] [Fast fast kd-tree building] Wayback!

I hit a pretty major snag with my compiler. For clipping, I need to move events around: The 'left' side of a triangle will shift to the right, and the 'right' side of a triangle will shift to the left. Initially, I implemented this as an event deletion/reinsertion, but this is way too slow, even for the fairy forest scene. So I did it slightly different now: First, I simply clip the primitives, and store the clipped extremae in the events, without moving them. When a primitive gets clipped away, I flag the events as 'INVALID'. This is all in O(N), obviously.

Next, I walk the list of events, and remove any INVALID events. This can only be done in a second pass, as clipping is done while walking one axis (always the x-axis in my case), but it invalidates events in random positions on the other axii. The second pass is O(N) though, and single-linked list disasters are prevented by keeping track of a 'next' pointer.

Finally, some events are in the wrong place, because of the clipping. This is fixed by resorting using mergesort. And that's the problem: While mergesort is pretty fast for lists with only a couple of unsorted nodes, it still makes my compiler nlog2n, strictly speaking...

I believe this problem is mainly introduced by the lack of a double linked list: Clipping would normally require mvoing extremae back and forth through the list, without deletion/insertion or resorting. With a single linked list, insertion is expensive.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/04/27] [tbp] [Fast fast kd-tree building] Wayback!

When we start none of triangle's proxies, their aabb, are clipped as the root node by definition contains them all.

Therefore while recursing by splitting those voxels in 2, clipping can only happen against those split planes and nowhere else; on the split axis we'll obtain 2 complementary segments [min, split] & [split, max] for each straddling triangles. On top of that we know exactly where to put each side of those 2 segments: the min/max part won't move and by definition the end of the first segment will appear at the end of the left child node list and the opening of the 2nd segment will appear at the beginning of the right child node list.

(L) [2006/04/28] [tbp] [Fast fast kd-tree building] Wayback!

Ah! But i've never claimed it was preserved for anything but the primitive we're tinkering with.

Obviously when we 'adjust' extrema on U/V we may change inter-primitive order, but that's not an issue at all. What i was looking into is: what can we say regarding this primitive we're clipping and its extrema positions/movements in the 1D space of the list on those axis.

EDIT: nice ascii art btw [SMILEY Razz]

(L) [2006/04/28] [Phantom] [Fast fast kd-tree building] Wayback!

That's easy to say if you have no clipping at all. [SMILEY Smile]
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/02] [Phantom] [Fast fast kd-tree building] Wayback!

I'm struggling with those huge trees, can't find a solution. One thing I did notice is that my 'average prims per non-empty leaf' count is much lower than yours (tbp): You mentioned 3.98, I get 1.723. Not sure why, not sure if this means I get a better tree either. I do suspect that building a tree that's 3 times larger is a bad thing for performance (I mean, build time).

It's also worrying that I thought I had a straight forward SAH builder, while in reality my scores are so far off.

Here's my current scoring code, perhaps someone could take a quick peek to see if I'm doing something obvious wrong?

(L) [2006/05/02] [fpsunflower] [Fast fast kd-tree building] Wayback!

Your numbers actually seem to be closer to Wald's than ours I think. Probably due to the robust clipping. Given that your trees are of higher quality than what your previous compilers did, chances are you're actually on the right track. I suppose the kd-tree exchange could help validate this. I personally would be less concerned with the tree size and more with the end performance.

Some suggestions though:

* what are travcost/isectcost set to ? These might not be valid anymore. Did you try measuring these via linear regression as xela suggested ? Or just trying lowering isectcost.

* I set maxprimsper leaf to 0 (to be able to clip empty space inside single prim leaves) - thats only going to give you deeper trees though.

* Wald mentions having to do a final optimization pass to "undo" splits that don't actually help much. It might be worthwhile to undo those splits that produced (3/4 in L/R) (Actually it would be good to maybe dump the content of those leaves + where the split occured as geometry so you could look at it and see why its trying to do that)

* Testing for planars go left/right is a bit of a waste of work in my experience. The intel guys suggest always going to the smaller leaf (one of their heuristics in the MLRTA paper). You can double check by counting the number of times this heuristic doesn't agree with the raw cost optimization I guess.

(L) [2006/05/02] [tbp] [Fast fast kd-tree building] Wayback!

I'm not good at multi-tasking, so please excuse inacuracies.

There's a bit of clipping in that scene as i've said earlier. Currently my new compiler only does partial clipping, i've verified the content of the most crowded leaves and as expected there was triangles requiring proper clippage (that is either they were not intersecting the voxel anymore - but their aabb did and i don't remove them because that should be handled via full clipping - or their aabb wasn't a good approximation). As said earlier i was expecting a compiler doing full cliping to behave better in that area.

So it's only fair that your compiler ends up with a lower avg-tri/leaf (because mine can't cut as far as yours).

Some random notes.

I initialize the best cost with the cost of leaf creation. If it's beaten i split, otherwise i create a leaf. That saves a bunch of evaluations.

If you consider that there's only one way to split at a given coordinate (that is you score once and only once, sweeping over all primitives sharing that position):

. you don't need to have planars sorted in a fancy order.

. it's cheap to decide where planars should go.

. less evaluations (pop count and scoring).

I still haven't heard a good reason not to do that.

I've done post-process optimizations, in fact my old compiler still does that, but that never brought any significant speedup (but in dubious corner cases).

(L) [2006/05/02] [Phantom] [Fast fast kd-tree building] Wayback!

You made me feel better. I'm not completely happy though, as someone else (I believe it was sunflower) also posted detailed tree stats, which more or less matched yours. (while replying I noticed that the first reply on my cry for help was by sunflower, stating that he's also not clipping fully, so I think I'm happy now)

About some of your suggestions:

(L) [2006/05/02] [Phantom] [Fast fast kd-tree building] Wayback!

Regarding efficient sorting after clipping:

There are two cases, an event either shifts to the left if it is a minimum ( case A), or to the right, if the event represents a maximum (case B).

Then, we have two cases:

1. The major axis (or 'bestaxis', or 'k'): Case A will end up at the tail of the list, once all polygons that get clipped away are removed. No sorting required. Case B will end up at the head of the list in the same manner.

2. The other axii ('ku' and 'kv'): Minima will be shifted through the next pointer. Maxima need to be reinserted (as we use single linked lists). One optimization is possible: The event will not be inserted before the matching minimum, so the insertion search can start there. This is very bad for very large polygons (e.g., the floor of Toxie's tilted kitchen), but usually, it saves a lot of iterations.

So, two of the four cases require no special attention, two require walking a subset of the event list. This probably beats my current resorting using mergesort.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/02] [tbp] [Fast fast kd-tree building] Wayback!

De "Regarding efficient sorting after clipping"... seems like we're tuned. I need to get back to it, make up my mind and settle for an approach.

I didn't spot it initially, so i forgot to say i don't special case flat cells at all...

<time goes by>

... but now that i think of it i'd need to double check my code because now that i think of it i think that flat cells can in fact never happen (which is not what was intented).

I also agree with your analysis, stuffing planars to the proper side makes a difference.

But unlike both of you, i couldn't get any traversal time gain by biasing/tweaking scores either for space cuts or straddling primitives.

BTW in my simplified population count there's no branches (but those required for looping), i'm not saying your 'if' equate to branches, but that part of the compiler tends to be a branch jungle. Eliminating them pays.

I'm not sure i follow your timing comparison but i'm happy you're getting somewhere. That's motivation to get that damn thing done [SMILEY Wink]

(L) [2006/05/02] [Phantom] [Fast fast kd-tree building] Wayback!

Yeah motivation was becoming the main problem. I mean, I'm perfectly willing to spend time on a kd-tree compiler (it was quite a task though), and I would certainly like to have the 'perfect' compiler (i.e., very fast, very accurate and capable of eating huge sets) but right now it looks like you can't have it all. Adding clipping costs a lot of time (it's a pretty expensive operation in one of the passes), but I'd rather have a 'perfect tree'. But I was doubting that 'perfect tree' idea very much after those wildly differing figures. I'm not entirely convinced that clipping triples the node count, but OK...

So I'll try to drop the kd-tree topic now, and move back to fast traversal. I have a very basic tracer, just 300 lines or so, that I want to optimize to perfection. I'm not even sure if I'll ever go back to HQ renders, I'm quite happy for the next couple of months with my rendering testbed. Because of it's size and limited 'feature set', it's very suitable for experimentation.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/02] [tbp] [Fast fast kd-tree building] Wayback!

<hand waving sequence>

I haven't formally profiled or gathered meaningful figures, but clipping as expressed herein (that is once you notice all the properties of the clipped bits) shouldn't be that expensive.

One thing to keep in mind is that you should pick the method bearing most fruits for small scale operations; i mean clipping should mostly happen deep down the tree: you'll have large triangles zipping through the whole scene, and you'll have to clip them pretty much right from the start but that's not the bulk of it.

If those large triangles annoy you, i think it could be possible to cheaply detect and triangulate them some more. Hmm, no that doesn't make much sense.

Now i don't agree, having a decently fast & robust clipping compiler isn't the end of the trip. I still have my old n.log²(n) compiler nagging the new one with obscenely faster trees; while what i think is the reason for most of that difference would be a real bitch to implement, i'm not sure it's that intractable.

And that's just what i'm sure about.

(L) [2006/05/02] [Phantom] [Fast fast kd-tree building] Wayback!

Hm so far the thing that helped me most was a pure SAH approach, the only exception being empty space cutoff, and even that barely helps. Clipping does help though. What tricks did you use to make a real difference, and how large do you estimate that difference to be?
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/03] [tbp] [Fast fast kd-tree building] Wayback!

It's - mostly - the same good old trick i've been talking about for more than a year, where each segment is framed within another very slightly larger segment for scoring; that's exposed as a switch in my old compiler (note that it's orthogonal to clipping): without it old and new compiler are more or less producing the same tree and then once enabled the old compiler reliably produce better trees; the more complex the scene is the more benefits it brings.

Of course it's expensive. While i *know* it brings better trees i'm not sure at all that the reason why it works matches the reason i wrote it. The old compiler isn't known for it robustness or crystal clear code, it's a mess; there's little chance to decide whether or not that framing is efficient because it alleviates a bug or because it's good on its own.

Anyway there's only 2 cases and that doesn't change the fact that the old compiler consistently beat the new one; i just doubt other tweaks to the heuristic (in the old compiler) can justify such a gap.

My plan is to have a robust and efficient perfect-clip-full-SAH compiler and then build upon that sound basis, cuz if one thing is sure it's that there's still significant quality to be gained without venturing into speculative or profile guided compilation.

(L) [2006/05/03] [Phantom] [Fast fast kd-tree building] Wayback!

I did a small test with a non-clipping compiler, i.e. I took out the full clipping and just clip the aabb. After that, I do a full sort, just to be sure. Results:

(L) [2006/05/03] [Phantom] [Fast fast kd-tree building] Wayback!

One random thought for the night:

I already mentioned that it might be worthwhile to sort the input set based on first appearance in the leafs, so the object list gains the following properties:

- Nearby polygons are also nearby in memory;

- Ordering inside a leaf roughly matches ordering in objectlist.

I had another idea: Sorting primitives based on their area. A larger primitive will be hit more often than a smaller one; keeping the larger prims together could improve cache coherency.

I'll try to implement a 'sort experiment' facility asap, I'm really curious if it will help at all.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/04] [Shadow007] [Fast fast kd-tree building] Wayback!

Well, I'm not fully sure, I guess it should depend on the quality/representativity of the Traversal/intersections cost metrics ...

If in the raytracer, doing 10 intersections were faster than doing 1 traversal step, the KDtree should be much shallower than if the inverse were true ...

I guess this means that for one raytracer implementation, there would be a perfect KDTree.

For all the raytracers having the same measured Traversal/intersections cost, the Tree would still be perfect.

Does that make sense ?

(L) [2006/05/04] [Phantom] [Fast fast kd-tree building] Wayback!

Shadow: Is this your first post? Welcome!

I guess, for all tracers that have the same intersectioncost / traversalcost ratio, the same potential improvements to a kd-tree should work. Or going one step further: Any tracer that has cheaper traversal than intersection, should benefit from a tree that minimizes the number of primitives in each leaf? ... no, that's not true: If it takes three extra traversal steps just to bring down the number of intersections by one, it's only worthwhile if intersections are at least three times as expensive as traversal steps.

I think however that most compilers/tracers made by people on this forum use more or less the same traversal and intersection code.

While I'm posting, here's a small result from an even smaller experiment conducted today: Reordering primitives based on their appearance in the leafs does not help at all, as far as I can measure, so don't bother. [SMILEY Smile]
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/04] [Shadow007] [Fast fast kd-tree building] Wayback!

Hi, I posted 2-3 questions a few days ago, trying to understand Havran/Tbp's approach to KdTree compilation.

I was introduced to the Raytracing by your articles on flipcode, and since have only so far looked at the docs. (Wald's work and so on).

One other of my computer interests being Ogre3d (www.ogre3d.org) (mostly as a lurker too), I decided to try to integrate both, making a Raytracing RenderSystem for Ogre3d.

I know fully well that given the lack of experience in both domains, as well as my lack of time, I won't even begin to approach something at least partially finished in any decent time...

Still, hopefully, by the time my RayTraceRenderSystem will be ready for use, the CPU/multi-core CPU/Raytracing research progresses will make it sufficiently efficient to either render the video in real-time, or at least cover partially the screen...

Before then, I'll try to have it capture the scene when the PrintScreen button is pressed, and progressively render the RayTraced output as a background task ...

<handwaving>

In fact, what I'm trying to do is use Ogre3d's RenderSystem API as an alternative to Open-RT and Ogre3d's core as the scenegraph ...

</handwaving>

(L) [2006/05/04] [Shadow007] [Fast fast kd-tree building] Wayback!

Oops, forgot to say I was currently working on my First KDTree compiler.

I guess trying directly with a tbp's style builder isn't the smartest idea I've had in a while ...

(L) [2006/05/04] [Phantom] [Fast fast kd-tree building] Wayback!

Using OGRE might in fact be a really nice way to get the public on rtrt.

About kd-tree compilers: I found it useful to start with Havran's/Wald's latest paper (look in links & papers section, kdtree.pdf is the filename): First do a 'naive' compiler and make sure it works. From there on, either stop and start working on tracing, so you get some output, or move on to the nlog2n compiler (also described in that paper). For the final step, nlogn, ignore the paper and use this thread instead.

Same for the tracer itself: Start with mono ray traversal and triangle intersection, and proceed to packet tracing once you're happy. I'm sure many people on this forum will assist where neccessary. Wald's thesis contains tons of info to get a fast tracer going in no time. (where no-time is defined as 500+ hours of spare time, I'm afraid).
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/04] [Shadow007] [Fast fast kd-tree building] Wayback!

Yes, That's more or less the approach I was going to take ... except I'll probably at first use a few shortcuts : since you told you'd post the code of your KdTree Compiler by the end of the week, I guess I'll wait for that [SMILEY Smile]

As for the tracer itself, I'll largely improvise around the RayTracer7 project ...

(L) [2006/05/04] [Phantom] [Fast fast kd-tree building] Wayback!

Hm... If you're just interested in the OGRE integration, you are welcome to take a more recent build of my compiler & builder.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/04] [Shadow007] [Fast fast kd-tree building] Wayback!

Yes, the fact is I DON'T have 500+ free hours ... so I'll trye to integrate into ogre, and learn all I can in the process.

(L) [2006/05/04] [Phantom] [Fast fast kd-tree building] Wayback!

OK, that will have to wait till Monday, I don't have the stuff on me right now. There will be some syncing to do, as I am working on a very bare tracer right now. A more functional tracer is somewhere on a harddisk, it's quite fast and accepts the output of my current compiler.

I'll send everything over early next week.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/04] [Ho Ho] [Fast fast kd-tree building] Wayback!

Could you upload the compiler and tracer somewhere so others could play around with it too?

I would love to have something to try on my 2x3.6G P4. Trying to multithread it should be an interesting thing to do [SMILEY Smile]
_________________
In theory, there is no difference between theory and practice. But, in practice, there is.

Jan L.A. van de Snepscheut

(L) [2006/05/04] [Phantom] [Fast fast kd-tree building] Wayback!

OK, I'll make it a public release. It will be slightly behind current developments, but should still be a good reference.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/05] [Shadow007] [Fast fast kd-tree building] Wayback!

We'll be in your debt forever [SMILEY Smile]

(L) [2006/05/05] [Phantom] [Fast fast kd-tree building] Wayback!

No mr Bond, I expect you to port the code to OGRE. [SMILEY Smile]
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/06] [Phantom] [Fast fast kd-tree building] Wayback!

I'll construct a small demo package so we can compare performance on actual hardware. If Father Christmas is right, and a 2.0 opteron could be compared to 2.8 Ghz P4, then your 2.6 opteron should be significantly faster than my P4. But let's quit the speculation. I'll drop off a package in the morning.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/06] [tbp] [Fast fast kd-tree building] Wayback!

Technically it's the early morning, foo [SMILEY Shocked]

For once i'm not awake because of insomnia but cuz i got a brand new nephew, eh.

I'm really not quite sure general benchmarks can be extrapolated to what we do, it's a special workload and one for which i'd speculate your chip is good at.

(L) [2006/05/06] [Phantom] [Fast fast kd-tree building] Wayback!

Congrats on your new nephew. [SMILEY Smile]

Do you have an ftp somewhere that I can use?
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/06] [tbp] [Fast fast kd-tree building] Wayback!

Thx.

You can use the scene upload thing i've set up, i'll re-upload everything on this host on your call.

(L) [2006/05/06] [Phantom] [Fast fast kd-tree building] Wayback!

That took a while, the site is painfully slow this night.

File uploaded, look for perftest.zip in the scene upload dir.

Instructions: Start as-is for scene6, or edit scene.txt:

On the first line, use scene 11 or 14 for the kitchen or the fairy forrest.

Looking forward to the results!
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/06] [tbp] [Fast fast kd-tree building] Wayback!

Grr. Values are bouncing around like crazy, and your font is hard to read (i'm mostly blind after looking into too many shiny pix).

Are you using rdtsc by any chance? If so, then you should consider you've already received a slap on the wrist.

(L) [2006/05/06] [tbp] [Fast fast kd-tree building] Wayback!

Forced it to behave. Still a bit bouncy bouncy tho.

That's not the canonical camera - as i remember it - but let's say that i get a peak at about 9.9 fps on scene6.

<hands the speed king crown>

Still i'm a bit surprised by the under performance of your rig.

EDIT: Faery results, 3.4 fps but some problems with geometry (not an aliasing issue)

...clicky...

[IMG #1 ]

EDIT²: The butterfly also has missing triangles. Want me to upload it anyway?

[IMG #1]:

(L) [2006/05/06] [Phantom] [Fast fast kd-tree building] Wayback!

Timing: I'm using GetTickCount.

Geometry problems: These are packets with incoherent direction signs. Didn't bother fixing that; I just took out my zig-zag renderer yesterday (it's way slower than straight tile rendering). Using the zig-zag renderer there are no 'geometry issues'.

Speed crown: I guess you can imagine how nice this is after chasing your tail for almost a year. [SMILEY Wink] I suppose it won't last long?
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/06] [tbp] [Fast fast kd-tree building] Wayback!

GetTickCount? Gasp.

(L) [2006/05/06] [Phantom] [Fast fast kd-tree building] Wayback!

You're almost 20% behind on scene6. That's about what I claimed for perfect clipping in my compiler. You should catch up as soon as you fix your compiler. That will increase your tree sizes though. [SMILEY Smile]

I'm proceeding to multi core rendering now.

And o yeah I should get some sleep.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/06] [Ho Ho] [Fast fast kd-tree building] Wayback!

I got the test to run, sort of.

P4 [LINK mailto:2.8@3.6 2.8@3.6]. I'm not sure about the memory speed, it was a long time ago when I last restarted.

I got 11FPS in scene6. Other two exited with "page fault on write access", probably something wrong with wine. They crashed before they could write to kdlog. Perhaps emulating 32bit windows apps on 64bit Linux is not that simple [SMILEY Smile]

When you got 9FPS at 2.8G and I got 11 at 3.6 that makes FPS increase of ~22% and clock increase about 28%. I assume the program scales linearly with CPU speed, especially as the scene fits to L2. That means wine looses only about 6% compared to native speed. Not too bad [SMILEY Smile]
_________________
In theory, there is no difference between theory and practice. But, in practice, there is.

Jan L.A. van de Snepscheut

(L) [2006/05/06] [tbp] [Fast fast kd-tree building] Wayback!

In order to keep vandals from anhilating my precious ADSL bandwidth...

[LINK http://ompf.org/alpha/bikker/perftest.zip]

Jacco, yell if that's not meant to be published.

(L) [2006/05/06] [Phantom] [Fast fast kd-tree building] Wayback!

No that's fine, although it needs a small fix, it exceeds some array bounds (which probably causes Wine to whine).
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/06] [tbp] [Fast fast kd-tree building] Wayback!

Fix, upload and then shout in my general direction: "ARUGA! ARUGA! DIVE, DIVE!

(L) [2006/05/22] [Phantom] [Fast fast kd-tree building] Wayback!

Hm, that could work. When a start event is encountered and it's associated prim is clipped, it's easy to remove that particular start event. Removing the matching end event is tricky, as you don't have a prev pointer, as you pointed out. So that would require a nasty loop 'till the end part is found; for an average scene that would be a short list walk (except for some huge prims). Since the new position for both the start and end event will be later than the current position (list pointer is at start event), this should work, but it does require a rather nasty test to the main loop: For each event we process, we need to see if there's anything to be re-inserted; in a worst case scenario this list contains several primitives that all need to be evaluated.

And then there's one more problem: The other axii. We are not walking those lists, so there's no way we can re-insert clipped events there.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/22] [Shadow007 not logged] [Fast fast kd-tree building] Wayback!

Thanks for the reply Phantom.

Still, it's not on the major axis that there is a problem, It's on the other ones ...

(L) [2006/05/22] [tbp] [Fast fast kd-tree building] Wayback!

I'm in the process of streamlining my rtrt before making a release and it's taking way more time than envisioned. I've was tinkering with kdlib part the other day and got an older version to compile. Yay.

I'll see if i can polish & carve it out in the next couple of days, because at that point of the discussion implementation details matter.

(L) [2006/05/31] [Shadow007 not logged] [Fast fast kd-tree building] Wayback!

I've got a problem while experimenting with Phantom's KDTree Compiler : Sometimes, when clipping a "not planar" triangle, it gives a flat bounding box (for an axis min[a] == max[a]). Thus when I reinsert that in the sorted list, we have for a given primitive END before START, which messes up the KdTree Generation (and the code as well).

Does anyone have an idea what I should do with the "flattened" primitives ? Should I convert them to PLANARs ? Should I force the Max to be after thebeginning (by adding EPSILON for example) ? Any other Ideas ?

(L) [2006/06/01] [Phantom] [Fast fast kd-tree building] Wayback!

Is that happening with the latest compiler?
_________________
--------------------------------------------------------------

Whatever

(L) [2006/06/01] [Shadow007 not logged in] [Fast fast kd-tree building] Wayback!

It happens with rtStage2_may_9th_2006 ... Were there changes to the next versions ? I thought the changes were mostly to the raytracing ...

(L) [2006/06/01] [Phantom] [Fast fast kd-tree building] Wayback!

Yes that's true. It might explain some of the problems I had with the transformed kitchen scene. I'll look into it; right now I'm exceptionally busy, next week there will be time for ray tracing again.
_________________
--------------------------------------------------------------

Whatever

back