Fast kd-tree construction / adaptive error-bounded heuristic back

(L) [2006/08/19] [tbp] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

Couldn't resists quickly poking through the paper, so let's anchor the discussion here.


I have troubles figuring the details of the "adaptive error-bounded heuristic" and i'll put that aside for now, and i'll go on with some remarks.


I don't know why i had the a priori of thinking about the scan in terms on multiple aabb vs samples, when it's all about widening how many samples you can handle at once.

So you need 2 registers to contain the splatted low/high bounds of the aabb, and then 3 registers per 4 sampling location (position, Cl, Cr); with 16 registers, you can maintain 16 samples live and get away with 2 scratch registers. That matches what's stated in the paper. Now the sampling locations being constant, it should be possible to use comparisons vs memory, reclaim 4 registers and then do at least 20 samples at once if not 24. I suppose it was tried and didn't work for some reason. Could someone clue me a bit? [SMILEY Smile]


There's also some talk about parallelization, more precisely distributing other axis of the same node to other cpu. Distributing sub-tree works quite well, has less overhead and no locality issues. Granted it's not fine grained enough when handling massive nodes, but it's not mentionned at all. Any reason?


I'd better sleep over it for now [SMILEY Razz]
_________________
May you live in interesting times.

[LINK https://gna.org/projects/radius/ radius] | [LINK http://ompf.org/ ompf] | [LINK http://ompf.org/wiki/ WompfKi]
(L) [2006/08/19] [Shadow007] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

Just a simple question : How is it computed ? I guess I was dreaming when the teacher was speaking about it ... years ago ...
(L) [2006/08/19] [Phantom] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

I had some fun reading that paper. Absolutely great stuff! And simple to implement, which is nice as well. [SMILEY Smile]


Some notes / things I wonder:


- If they are only considering AABB's, the trees will be of a lesser quality than trees that are constructed using full clipping. Did they compare to a full clipping compiler, or to AABB compilers? If the latter is the case, the quality difference is likely to be more than 3.6%.

- They didn't use parallel processing. Does that mean that a 180k scene could be build in 0.25 / 2, or am I being optimistic here?

- Even 0.125 for a 180k scene is easily beaten by BIH... And still no-one produced a top-quality BIH 'compiler'.

- The balance became a lot more complicated. BIH is now only twice as fast (~) as kd/sah, but this fast kd/sah renders slower than the fastest kd/sah, but still faster than BIH... And there is of course still the memory issue. I think the right approach needs to be picked on a case-by-case basis now, which is not nice, obviously.

- I really like the fact that this new idea is so easy to implement. I'll give it a shot ASAP.
_________________
--------------------------------------------------------------

Whatever
(L) [2006/08/19] [tbp] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

Even without putting too much credence into efficiency claims (3.6% etc...), that's really a nice brutal & KISS compliant solution to the scoring part of the problem which is largely orthogonal to rest, like actually producing a split (modulo the fact that it's faster to do that when everything's sorted).
(L) [2006/08/22] [tbp] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

Before trying weirder stuff i've tried to establish a reference timing to bench against; and i've hit what looks like a wall at 20 cycles per aabb for 8 samples & 32bit code on my k8. That's when everything fits caches and aabb are accessed without any indirection. When doing a full scan of the buddha, it falls down to 40 cycles.


I'd be interested to hear what others have come with on other architectures.
_________________
May you live in interesting times.

[LINK https://gna.org/projects/radius/ radius] | [LINK http://ompf.org/ ompf] | [LINK http://ompf.org/wiki/ WompfKi]
(L) [2006/08/22] [Phantom] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

How does 20 cycles translate into ms for an entire build of, say, a 100k triangle model?
_________________
--------------------------------------------------------------

Whatever
(L) [2006/08/22] [tbp] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

There's too many factors to extrapolate up to entire build figures (even if that could be a bottleneck), and that's not the point. There's ways to manage more samples at once, so the problem is more about deciding where to put the trade-offs. Hence this attempt to have a fastest/best-case to bench against.
_________________
May you live in interesting times.

[LINK https://gna.org/projects/radius/ radius] | [LINK http://ompf.org/ ompf] | [LINK http://ompf.org/wiki/ WompfKi]
(L) [2006/08/23] [tbp] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

Over lunch, i've played a bit more with the sampling part because of some funny results. Got 2 rules of thumb (for k8 at least), there's a 10 cycles price tag per 4 sampling point (up to a point) and a linear scan of uncacheable data is only 2x more expensive. That last bit makes me question the choice of an indirection vs copy in the paper.


I can't easily plug that into my framework so now i guess i have to write the rest of the machinery.
_________________
May you live in interesting times.

[LINK https://gna.org/projects/radius/ radius] | [LINK http://ompf.org/ ompf] | [LINK http://ompf.org/wiki/ WompfKi]
(L) [2006/08/23] [tbp] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

As i suspected part of the 10-cycles-per-4-samples barrier on my k8 was due to float/integer mismatches, i pondered about an all integer approach:

 we only need to know about aabb respective ordering.

 we need the full triangle anyway for proper clipping.

 the lexicographical transformation is reversible, if need be.



Trouble is that there's only 2 SSE/MMX integer comparison primitives, eq & gt, and they only come in a signed flavour. Gah.


Anyway, it works and even a crappy implementation shaves 1 cycle [SMILEY Wink]

Note that it opens up access to the MMX register file without penalty.


PS: the type mismatch penalty also exists on Intel latest cpu.
_________________
May you live in interesting times.

[LINK https://gna.org/projects/radius/ radius] | [LINK http://ompf.org/ ompf] | [LINK http://ompf.org/wiki/ WompfKi]
(L) [2006/08/25] [Michael77] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

I don´t know, if anyone has yet implemented the approximation suggested, but maybe we could discuss it a littlebit. I am still having a little trouble understanding it completely and especially applying it to BIH ( more on that later).


The paper talks about a piecewise quadratic approximation, but I am not shure, what exactly this means. My current idea of it is the following:


You have four inputs: CountLeft, CountRight, SALeft and SARight. Using your testsamples you can do a linear interpolation quite nicely between them using


f(x) = f0 + (f1 - f0) / (x1 - x0) * ( x - x0)


with f0 and f1 being the sampled values and x0 and x1 being the splitCandidates that were sampled. Now, if you do this for all four inputs, you can compute an approximation of the costfunction in that segment and find a minimum for that approximation ( which is stated as being a possible solution although not a good one).


~cost(x) = ~countLeft(x)* ~SALeft(x) + ~countRight(x) * ~SARight(x)


So to get me up and running, I just tried this with a kind of brute force approach, approximating the global minimum by doing a lot of samples on this approximated cost function based on 8 initial samples and for some reason, taking the approximated minimum would always result in a tree worse than just taking the minimum of the sampled positions.


So am I totally wrong with this kind of approximation? Or are 8 samples simply not enough, even on smoothly varying scenes like bunny. Or is the BIH much more difficult to approximate with this approach?


I guess, the last one may be true since the BIH might be more ill behaved since the SA might change very much when a single triangle moves from one side of the splitting plane to the other while this doesn´t matter that much on a KD Tree. So while the KDTree only has the triangleCount where large differences between two splitPositions may occur, the BIH has both, the triangleCount and the SA that may have some troublesome spots. So my guess is, computing the local minimum will not help at all, just testing some additional samples in certain areas (which?) will.
(L) [2006/08/25] [tbp] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

It's not clear to me from your description if whether or not you're refining the first approximation (done on regularly spaced samples), by attributing/binning more samples in the 2nd round where the error is the greatest.
_________________
May you live in interesting times.

[LINK https://gna.org/projects/radius/ radius] | [LINK http://ompf.org/ ompf] | [LINK http://ompf.org/wiki/ WompfKi]
(L) [2006/08/25] [Lotuspec] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

Doing a piecewise linear approximation through your sample points is useless, no?

As far as I can see you will allways end up in one of your sample points so you need to use quadratic (or higher order) interpolation to get some improvement.
(L) [2006/08/25] [knallbunt] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

Hi, I'm new here  [SMILEY Smile]


Wouldn't it be better to calculate the global minimum by taking the minimum of the minimums of every piecewise quadratic function (obtained by derivating the function). Seems to work quite well for me - or is there a better approach?


I would guess for SAH to work well for BIH, one has to change the cost function a bit (perhaps introducing a penalty for overlapping intervals etc)
(L) [2006/08/25] [Michael77] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

Yes, I will try that. But for now, I just implemented a refinement step where additional samples are taken to the left and to the right of the previous minimum. This at least gave me a slightly smaller tree with although MRays/s were not improved very much. I will add the binnig approach later, need to do something about the build speed first (currently takes 280 ms to build bunny, 516ms to build a 140k Model).
(L) [2006/08/25] [vitamin] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

Hi,


I am also new here [SMILEY Smile]


There is another paper on the topic that does something similar

[LINK http://www.mpi-inf.mpg.de/~popov/publications/fast_kdt/index.html]


It uses 1024 regularly placed samples that come at no extra price and does not try to fit

any quadratics.


Maybe you should have a look at it.


The method described there can also easily be extended to BIHs
(L) [2006/08/26] [tbp] [Fast kd-tree construction / adaptive error-bounded heuristic] Wayback!

Here's warm welcome to our new friend from the cold [SMILEY Smile]


Really interesting paper; i'd need to ruminate it a bit more and i'm under caffeinated but...


There's nothing that forbids massaging stuff in a streaming fashion in AEBH (mostly to get rid of the cache refill bottleneck), in fact that exactly what i'm opting for.


Also, there's only a 8/16 samples per pass limit if you insist having everything in live registers. So you have to ask the question if that makes sense.

At least on my k8 the answer is definitely no. I have my fastest - 32bit - 8-samples no spill scan (all in live regs) beaten by one with many reloads in between:  it's easier to break dependency chains and so on.

Therefore the 8/16 samples per pass limitation is artificial and can be arbitraly tweaked. So now the question really is how much is good enough.


EDIT: i'm a bit puzzled by your numbers for the // builds of the Faery Forest scene, because i can build that scene in about the same time with my n.log(n) compiler on 2 cpu.


EDIT²: i should have stressed that i'm mostly waving hands here, i don't have a complete solution to substantiate my claims [SMILEY Smile]
_________________
May you live in interesting times.

[LINK https://gna.org/projects/radius/ radius] | [LINK http://ompf.org/ ompf] | [LINK http://ompf.org/wiki/ WompfKi]

back