MLRTA implementation

(L) [2006/02/07] [Phantom] [MLRTA implementation] Wayback!

Starting a new topic, as 'started again' is not really inviting.

My current implementation has the following:

- Kd-tree is optimized for MLRTA (empty space cut-off early in the tree);

- Frustum/node clipping and traversal is implemented;

- Code now first looks for a non-empty leaf:

* Branches are always followed to the left

* Bifurcation stack is filled (at each branch)

- Once the first leaf is found, it is declared the first candidate EP;

- From there on, the bifurcation nodes are fetched (lifo style);

- Kd-tree traversal starts from scratch from the 'right' pointer of each bifi node;

- When a filled leaf is found, the fetched bifurcation node is the new candidate.

The frustum is represented by the four corner rays, not planes. This makes calculating the intersection points with the split planes easier. The bounding box for each node is tracked by incrementally updating the scene bounding box using the splitplane.

Code is in pure C, but prepared for SIMD.

Final leaf node culling has been implemented, but is disabled: In pure C, this is not worth it.

Code to check for full occlusion by a triangle in a leaf is also in place and working, but this needs extra tests: The leaf node itself must also contain the full frustum, so a large triangle that touches the leaf and obstructs a frustum that also just touches the leaf is not going to be accepted.

That's the current state - And performance is disappointing. Right now, only a couple of nodes are saved for each tile (16x16 pixels), so that's not going to boost performance very much. I can see a couple of reasons:

1. C code: Moving to SIMD would help, but if you shave off only 2 traversal steps out of 30, it's never going to be very fast.

2. Leaf node clipping: Implemented this; didn't cut off many extra nodes.

3. Better kd-tree: I suspect that my compiler is not cutting off empty space from leafs. I'll fix that. This could help substantially.

4. The mysterious remark in the paper: Processing is terminated when the leaf is inside a watertight volume, or if the frustum ends in a leaf and is completely obstructed by one of the primitives. No idea how I could do this efficiently. Besides, the idea of adding occluders by hand is not very attractive.

Next on my list is the SIMD version, so I can at least get my 10% speed boost (important in our friendly competition), and improvements to the kd-tree compiler so the thing actually makes a difference. [SMILEY Smile]

- Jacco.

(L) [2006/02/08] [Phantom] [MLRTA implementation] Wayback!

It's now fully optimized SIMD, and still no luck. Only difference is that now VC2005 won't eat my code anymore, it crashes and advises to 'simplify the code' (I am not kidding). ICC still compiles it fine though, under VC6. You can't plug the latest icc in vs2005... Dude.

Tonight I will rewrite my kd-tree compiler, just for fun.

(L) [2006/02/08] [Phantom] [MLRTA implementation] Wayback!

OK, I think I found the problem.

First, this evening it struck me that I can actually test for triangles that fully occlude the frustum, and use the leaf they are in as a starting point, but only if I reach that leaf by traversing the kd-tree near-to-far. So far, I was just taking the left branch all the time, searching for filled leaves. Code changed; this time it always takes the near node first, and it places the far node on the stack.

Next, I put back my frustum/triangle intersection code. It doesn't work at all. And to make things worse, it turns out that all triangles of the first leaf that I hit are BEHIND the frustum. Now that's interesting info. How did traversal end up there?

The answer is simple: I am traversing by starting at the root, projecting intersection points onto the split plane, and testing wether these points are inside the bounding rectangle of the split plane inside the current node. If they are, I follow both branches.

That's wrong. I mean, if the frustum starts inside a node (typical situation for indoor scenes), the projection of the intersection points may very well be inside the aabb of the current node, yet you definitely want to skip the node behind you. I'm now going to figure out how that can be done. [SMILEY Smile]

(L) [2006/02/09] [tbp] [MLRTA implementation] Wayback!

I for one welcome our new MLRTA overlord [SMILEY Smile]

Even with your code under my nose, it's still not working here (properly i mean). As i'm anything but helpful here, i'll shut up for a change.

(L) [2006/02/09] [Phantom] [MLRTA implementation] Wayback!

Hm I really need that kd-tree rewrite. I did some new visualizations, and it's a disaster. Here's a picture:

[IMG #1 ]

Red squares are visited nodes (this is Sponza, btw), blue nodes are visited leafs that contain triangles, the white dots are the intersection points of the frustum with the splitting planes.

Couple of observations:

- It's clearly visiting the right nodes, and apparently not too many either. So the clipping of nodes is working, the intersection point calculation is working also.

- The nodes with triangles are huge. I bet there's a ton of empty space in those... That's not good at all. This causes the code to find leafs with triangles all over the place, and that means an EP high in the tree is inevitable. It should have found a single leaf in this particular case; you're looking at the visualization of a ray that goes all the way through the open space from the left side of Sponza to the right.

So the obvious route from here is to stop whining about MLRTA not working and so on and to fix my kd-tree compiler. It obviously sucks and that must hurt rendering in general, not just MLRTA.

[IMG #1]:Not scraped: https://web.archive.org/web/20061004031141im_/http://www.bik5.com/traversal.jpg

(L) [2006/02/09] [Phantom] [MLRTA implementation] Wayback!

Ow one other thing: It's now properly ignoring nodes behind the camera. So I did something right today at least.

(L) [2006/02/10] [Guest] [MLRTA implementation] Wayback!

Hi, Phantom,

1. I'd recommend to look more closely at traversal algorithm: it should always be going from near to far (increasing distance along the direction of rays starting from the origin). It requires analysis of direction x, y, z signs and storing of correct offset increment for "left" and "right" childs before starting the traversal. This implies that all rays in frustum should be directionally coherent. We are now experimenting with omni directional traversal but it is too early to report any results.

2. Using 4 corners instead of planes could lead potentially to problems. There are 2 resons, first is precision: although mathematically it should look OK (corner rays are extremas of a convex pyramid) but in reality rays between the corners on the frustum planes are not exactly following those planes and they are not necessarily inside the frustum. The secong reason is that far side of the frustum is being cut by split planes and the whole construct becomes non-convex and you can not use corner rays for conservative distance estimation:)

3. Revealing the mistery about watertight and proxy occluders. We were thinking about 2 cases:

a) Watertight: a solid 3D object (we did not use term solid because one could think that it is not applicable to deformable objects although at any given frame a 3D deformable is a solid from point of view of rendering) by this we mean 3D object bounded by closed 2D manifold without boundary (any topologist here? [SMILEY Smile]). The idea is to mark leaves located inside such objects by setting one bit attribute as "occluder". If the whole frustum stick there then... It requires knowledge about the nature of object.

b) Proxy occluder: this is more complex but rather general idea. Assume we have a relatively smooth surface and some voxel (not even necessarily a leaf) cuts of a piece of such surface with no holes in it. The triangles are relatively small (except walls, floors and ceilings but this is a different story) so you wouldn't get benefits from testing the frustum against each individual triangle. So we instead compute two non-aa plains bounding the piece of surface as tight as possible (any simple to optimize criteria for "tightness" is OK here) and store them together with a tree cell. If intersection of frustum with those 2 bounding planes is inside the cell, then...

4. Tree quality is essential thing. Having many triangles inside some of the leaves is OK and unavoidable (e.g. triangle fun), but the probabability of hitting such a leaf should be infinitesimally small. Usually what good tree would have is lots of leaves with empty space, lots of leaves with a single triangle, a number of very-very small leaves along common edges (2 triangles in a leave) and a number of very-very small leaves around common vertices.

These are some basic ideas, we just were not able to describe all the details in a 12 pages limit given by SIGGRAPH paper guidelines.

(L) [2006/02/10] [Phantom] [MLRTA implementation] Wayback!

Hello russian_hacker,

I activated your account manually, and send you an e-mail about that, but it bounced.

Anyway, you should be able to log in now.

About your tips:

1. I've changed my traversal code to do that. Latest news is that basically I have a bad kd-tree compiler; it visits leafs with triangles all over the place even when I expected the frustum to move through empty space. I'm rewriting the compiler right now.

2. I thought you where using the corner rays yourself? For each visited node, I calculate the intersection points of the four corner rays with the split plane. Then I take the min and max of these points wrt to the current axis, and compare this range to the aabb of the current node. This way I should never miss a node. So I am using separate rays, but comparing takes place using the range of values between the intersection points.

3. That could probably be done by a preprocessor, although it will be a pretty complicated task. [SMILEY Smile] I see what you mean though.

4. This is quite likely where my problem is. I get an average of 7-10 intersections per ray, which is quite bad compared to Wald's 2 or 3.

Problem with figures from Wald is that you never know which figures are marketing and which figures are real. I tend to believe any figure that I surpassed. [SMILEY Smile]

I am glad that there is so much going on in the field lately. There are some very cool technologies evolving right now.

Thanks for dropping by,

- Jacco.

(L) [2006/02/10] [toxie] [MLRTA implementation] Wayback!

my cup of tea (plz correct me if i'm wrong): [SMILEY Wink]

i think the whole speed of the implementation is "hidden" in one simple fact:

you don't create rays at all (or only if REALLY necessary = on the lowest subdivision-levels).

you trace planes (=frustums) through the kD-tree that are implicitly given by the image-plane/pixels.

as the traversal with these frustums is extremely simple (paper has super-simple code in it that is used for that),

you can get super-fast to the low levels of the kD. There you 'just' have to intersect the triangles with frustums

which can also be done rapidly for HUGE triangles/planes (see Shaft Culling paper).

So the shading of the results becomes the most expensive part.

But if the rays are placed "random" inside the frustum(s) (because then it's very expensive to split a frustum into 4 smaller ones)

it becomes impossible to use that strategy.

Am i right? [SMILEY Wink]

(L) [2006/02/10] [Phantom] [MLRTA implementation] Wayback!

Yes you're right. But since I'm still struggling to get a scene with primary and shadow rays for one or two point lights fast enough to walk around, it looks promising to me. Besides, right now it's forcing me to write a proper kd-tree compiler. I'm relying heavily on Havran's knowledge for that, once again.

(L) [2006/02/11] [russian_hacker] [MLRTA implementation] Wayback!

Hi, toxie,

the answer is yes and no.

You are right, we don't instantiate rays if it is not needed and traverse frustum through the tree (using planes rathe than 4 corners). Once entry point is found we instantiate rays and continue traversal from that common node with 16-ray packets using SIMD. But traversing frustums is not that simple if you like to find a deep entry point. That's why we spent many space in the paper to describe various approaches to find and cut away guaranteed non-interesecting nodes. Ultimately if entry point is still not good you would split the frustum into smaller ones and try to find entry points one more time.

I can not see any big issue if rays are randomly distributed but still inside the frustum. We just find good entry point for the frustum and then generate as many rays as we need inside that frustum and trace them from the entry point.

For specular rays reflecting from the flat surface you could continue tracing frustum w/o rays instantiation. For shadow rays for point/spot lights the frustum application is also obvious.

Ambient occlusion and GI is final gathering (all cost is in there) so frustums should help. Unfotunately, we are now merging that experimental MLRTA codebase with the previous block-based codebase where all complex shading and GI are, so I can not report numbers for MLRTA+GI for about a couple of months.

Now, if you think about it, MLRTA is applicable for other bounding constructs rather than just simple frustum (the simpler bb and the construct separation test the better). We also have not experimented yet with creating frustum/boundary around existing rays - that could be another MLRTA application area.

So MLRTA approach have to be used smartly with fall back to block ray tracing if it doesn't make sense to use it. Ray tracing is not a single function for finding closest intersection any more. Everybody is convinced that conventinal ray tracer has at least 2 specialized functions: one for intersections and one for occlusion testing. Really good block-based ray tracers (using SIMD) have much more and nobody complains because ultimately performance increases to the levels of interactive fps.

Alexei

(L) [2006/02/13] [toxie] [MLRTA implementation] Wayback!

thanx for the great info..

i'm looking forward to see performance-snippets for GI, etc. (cause what you have to do actually, is to map

existing rasterizer-approximations to match these ray-shafts (they work on the same principles) as you cannot

point query/sample randomly anymore).

but i doubt the usefulness of creating shafts out of "random" rays and then use MLRTA. I already experimented with

that and (as expected) it's pretty slow as you have to 1.) construct every ray and then do 2.) some kind of sorting on the whole bundle

to build the shafts.

(L) [2006/02/24] [Phantom] [MLRTA implementation] Wayback!

I got it working reasonably well now. There are some scenes that show a very clear advantage for MLRTA over non-MLRTA rendering (~17% performance increase), and this could grow a bit further: First of all my renderer is suffering (in general) from the lack of support of flat cells. This makes it hard to cut off empty space in corners. Next, I am used a fixed tilesize of 16x16 pixels. The algorithm is not very expensive, so fixing this is not going to help much. And finally, I am not using the 'ugly hacks' that russian_hacker proposed: I'm not too fond of adding occluders and bounding surfaces to improve the performance of this algorithm.

What I am observing right now is the following:

- MLRTA works great on huge scenes where you see only a small portion of the total scene. You have a very long 'common traversal sequence' in this case, and MLRTA helps big time.

- This does however require some large occluders. I am testing against leaf triangles to halt traversal, this works fine, but this also means that a huge scene with lots of details in occluders will spoil the fun completely (this is why russian_hacker needs the bounding surfaces, to turn many small tri's into one occluder).

- Reported performance gains (30%-250%) are very optimistic. And, it applies only to those rays that are accelerated by MLRTA. This is hard to do for shadow rays (at least in my setup), so I need to cut possible gains in half (15-125%). It also doesn't apply to the complete process of traversing, intersecting and shading but only to traversal, which takes off a lot of the potential gains.

- When I walk the Sponza balconies, I see a performance boost of about 2%... I have a simple test scene with a chair and some tables (it's in the demo I posted earlier); this one scores 20% top. But those are huge occluders...

Anyway, the code is in place, it's completely SIMDed (I'm quite sure it's about optimal) and usually it doesn't hurt performance. Then again, it usually doesn't help either. [SMILEY Smile]

I will check again once I got my flat cells. That might help a bit.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/02/25] [Phantom] [MLRTA implementation] Wayback!

BTW I am now implementing 4x4 packets. The MLRTA dudes mentioned that this is faster than 2x2, and I can think of several reasons why it could be faster and why not, so trying is best. [SMILEY Smile]
_________________
--------------------------------------------------------------

Whatever

(L) [2006/03/03] [Xela] [MLRTA implementation] Wayback!

Hi guys,

Jacco had informed me about this forum a few weeks ago and I thought that it would be a

good idea to jump into the discussions right away; not because I know all the answers,

but simply because I have a bona fide understanding of what these guys who wrote

MLRTA paper were thinking about [SMILEY Smile]

Then my natural laziness took over. It also was quite interesting to see how collective

mind is working, from initial guesses to some definite answers. Shame on me. As a

penance, Im going to write a very long posting now [SMILEY Smile]

For the purposes of legal deniability [SMILEY Smile] and to preserve some subtle ambiguity, Im

choosing nickname Xela. Im not going to write individual replies to all relevant

postings, but rather will state my personal take on the important issues in one posting.

1.    Algorithm

Much of the confusion wrt MLRTA is arising, I guess, from my attempt to put all eggs in

one basket. There are 2 completely separate ideas in the algorithm, which somehow got

mangled in the one paper. These are

1.1.    Traversing kd-tree using frustum, containing all rays in the group.

1.2.    Searching for entry point (EP) for big groups of rays.

None of these ideas are, of course, new. There are some interesting details though, and

combination of these may have positive impact on performance as well.

As a matter of fact, the only reason I used MLRT acronym was that all other good

names were already taken (including frustum, pyramidal, and shaft). If I could, I would

use frustum. And here lays very important difference between MLRTA and Havrans

shafts. Using shafts (== 4 corner rays) your are traversing a tree only to the node, in

which corner rays are parting their ways, that is why Havran is calling it the longest

common traversal. MLRTA 1.1, however, is just another traversal algorithm, going all

the way to relevant leaf nodes. This distinction actually was lost on me during submission

process and that is why I corrected myself during the presentation (I also had some

friendly chat with Havran at the conference itself [SMILEY Smile]).

So, 1.1 is just another traversal mechanism for group of rays. If you have only one ray,

you always know whether it goes through near, far, or both sub-cells. For group of rays,

basically there are 2 possibilities:

-    store data about which rays are active at the moment (or compute it on the fly)

-    traverse the group as a whole. This is possible, of course, because there is no harm

                  in extra work (ray/tri intersections); you just keep ones closest to ray origin.

There are pros and cons for both approaches. In particular, using first one it is possible to

purge some intersection tests, based on active mask. For example, if you have 4x4

group and 4 SSE registers representing row-wise active masks for 4 columns, one

_mm_movemask_ps op is sufficient to purge ray/tri computations for the whole row, if it

evaluates to 0. BTW, guys, there was some discussion about benefits of grouping rays as

4x4 or 2x2. There is significant advantage in using 4x4, but only if you preserve activity

status (pushing and popping it from the stack as well). By using 4x4 group, anybody

could beat original Walds paper by at least 2X, even for the same trees.

There is one huge disadvantage in explicitly handling all rays in the very big group

though. You simply will not have enough registers (thanks to you-know-who [SMILEY Smile]). So,

here comes MLRTA, which describes all rays in a group via some geometrical proxy and

traverses this proxy conservatively through the tree. That is, sometimes you could be

absolutely sure that any arbitrary ray within a frustum will never go through one sub-cell.

Then you just will ignore this sub-cell (pretty much as single ray would ignore non-

intersecting sub-cell).

And yes, Phantom is absolutely right; were using two axii for these tests. I now revisited

the paper, and somehow it is not that obvious from the text this submission rush

However, you shouldve notice the words suppose further in the text, which meant that

two axii in the split plane should be used [SMILEY Smile]. With SSE ops, this could be implemented as

2 comparisons per axis (followed by movemask and possible early bail out).

This makes traversal process very effective, but there is also a penalty for this. You may

go into the cell, which in fact is not intersecting with the frustum (and, accordingly, all

potential rays in it). This could be somewhat alleviated by clipping the frustum against

leaf box (especially if you are storing BB of intersections of all triangles in the leaf with

the leafs BB, which could be smaller than leafs BB itself).

Basically, you could use only 1.1. and forget about EP search (which could be sometimes

harmful). In fact, there is another (not based on frustum-like processing) group traversal

algorithm in the paper, called interval traversal. It was somehow overlooked in the

discussion despite the fact that full pseudo-code implementation is given in the paper. It

is good in its own right and beats diligent 4x4 traversal in most cases. IMHO, interval

algorithm is better suited for relatively small groups, while frustum is more effective for

bigger groups. In our implementation, were using interval algorithm for primary and

coherent shadow groups (after EP is found). For shadow rays, it is better to cast rays from

the light source as it preserves common origin and frustum-like concepts as well.

Certainly, talk is cheap, but I had completed a required information security training

today and understood that I could not just give you guys code away [SMILEY Sad]. Seriously, you are

lucky if you are not subject to all this big corporation baloney. Fortunately, there is a very

good remedy. Carsten Benthin (one of OpenRT guys) had implemented some MLRTA-

inspired approaches in his PhD and he gave me explicit permission to share it with you

folks. Address is [LINK http://graphics.cs.uni-sb.de/~benthin/final.pdf.] Please note though that

the dissertation is not yet defended, so do not distribute it indiscriminately.

Carsten has tons of code snippets in his work, which are sometimes massaged for

readability, but optimization is easy for those fluent in art [SMILEY Smile]. It is very interesting

reading. It also covers at length issues related to rays coherency, which somehow avoided

discussion in this forum and which are very important.

OK, now back to EP. As implemented in our system, it is basically a conservative

traversal algorithm. That is, if you know for sure that sub-cell is avoided, you are not

going there. If you somehow could put upper bound on distance to intersections, even

better. Then you could use early termination test during EP search. This is all related to

all these mysterious empty occluders in the paper. If you know for sure that some

empty node is inside object (that is, youre not supposed to see it), you could use distance

to this cell as above-mentioned upper bound. Note that for regular traversal, you will

never reach such node. You could hit it though during EP search as you are not bothered

with regular ray/tri tests. Trouble with empty occluders is that finding them is very

hard, because they relate to global scene semantic. It is possible though if normals are

known and objects are water-tight. Unfortunately, in most scenes objects are not water-

tight, even if they are supposed to be. I also had tried alternative approach of using 2

parallel planes and verifying that any ray intersecting them inside the cell are also

intersecting geometry inside the cell. It works, but algorithm is ugly and I mean it. This

approach is not reported in the paper but hinted during the presentation.

In its simplest form though, the last approach is very easy to implement and use. If you

have axis-aligned triangles completely covering the flat cell, you could use it as

conservative frustum/geometry test very effectively. Even more, if texture coordinates for

these triangles are harmonized (basically, mapped to rectangle), you could substitute

ray/tri tests with ray/AA-rectangle tests. How else do you think we would get 100 fps for

Erw6?

Without these conservative distance checks, EP search may not result in any significant

improvements. Something from -5% to +30% (frustum or interval traversal is still good

though). With these checks and if model permits, sky is the limit. That explains why

MLRTA shines in indoor architectural scenes like Erw6 or Soda Hall.

Also please note that MLRTA thrives in situations when geometry is tightly localized in

leaf nodes (empty cells are occupying most of the space) and flat cells are created for AA

pieces of geometry. Somebody mentioned in the discussion that if split is not reducing #

of triangles in at least one sub-cell, it should not be taken. Well, maybe for some traversal

implementations, but my experience shows that you could get better performance if you

sometimes ignore SAH recommendations for leaf creation. It is all about carving empty

space out of the scene. Empty cells are just so nice

I probably will not talk any more about tree creation, as Im already on 3d page of this

posting I also will not venture into intricate issues of bifurcations, even though Im

proud of introducing this term to CG crowd [SMILEY Smile] (somebody actually explained it pretty

well in this forum).

2.    Performance

Thats where fun begins. I had heard a few grievances that people could not much

performance in the paper even if they are only casting rays and writing zeros to

framebuffer. In this forum, somebody mentioned that 40 fps is a hard limit due to sqrt

calculations per ray. Yeah, right. Think now really really hard do you actually need to

normalize primary rays? Answer is no. All you need from traversal/intersection is

1.    find Cartesian coordinates of intersection point

2.    Id of the intersected triangle.

3.    Texture coordinates of the intersection.

Neither of these goals require normalization. For non-normalized rays, you will find a

distance to the intersection point in units of directional vector. When multiplied with

directional vector plus ray origin, it will give you Cartesian coordinates of the

intersection points.

Normalization is required however, for most shading operations. If you go to the paper,

you will see that for Erw6 scene, this results in ½ performance drop.

There are also other performance-related issues. Some of them are covered intensively in

the literature (and in Carstens thesis). I just will list some of them in random order, not

sure whether some (or all) of you already know it:

-    use SSE intrinsics.

-    never use _mm_div_ps (but rather reciprocal followed by Newton).

-    Icc helps, especially version 9. It also a first version which allows you to mix

                  __m128 and __m128i almost penalty free (1 cycle), which is important for some

                 shading operations.

-    You are all using 8-byte format for kd-tree internal nodes, right? Still, there is a

                  room for improvement here

-    For C++ aficionados, _vtab kills; for all others even inline functions (which are

                  actually inlined) could let down the performance. This is for reasons I could not

                  comprehend. Sadly, but fact.

-    Inner traversal loop is sacred, everything else is not. Still, we are not using

                  assembly nowhere, but only intrinsics.

-    Hyper-threading helps (~30%). Dual machine helps even better [SMILEY Smile]. The only way

                  to do proper synchronization on IA32 I know is InterlockedCompareExchange

                  (for dynamic job allocation, which also helps).

-    Box clipping for coherent group of rays is just a charm (faster than regular case).

-    You could not have just one good thing in your RT. If intersector and traversal are

                  superb, ray casting becomes a problem.

This is not all that difficult and certainly does not require army of Russian hackers.

Neither it requires methodical Vtuning (in fact, I never got anything interesting out of

Vtune. Sorry, guys on the 3d floor [SMILEY Smile]).

Well, I thought I forgot to mention something important, but it is getting late in the

Valley. Better do it incrementally There were some very interesting threads in the

forum, in particular related to kd-tree creation, like adjusting empty cell threshold, etc.

IMHO, termination criterion is the key, but it is all a huge topic.

Chao,

Xela
_________________
[Xela]

(L) [2006/03/03] [tbp] [MLRTA implementation] Wayback!

Thanks for stopping by [SMILEY Wink]

It's very late, and i need to ruminate your post some more (with all neurons) so i'll just make some quick remarks.

Even if that board was initially about realtime raytracing, it has been overran by swarms of GI posers. That's why there's not much discussion about coherence and so on. These punks show no respect.

Performance chapter.

. I had a postponed ray dir normalization at some point, but it was only a win (obviously) with low coverage. I still don't see the point unless your goal is to render emptyness and in that case there's faster ways.

. _mm_div_ps: reciprocal + 1 refinement only bring 23 bits of mantissa in the best case, that's not an issue for normalization but maybe otherwise (ie shading).

. Mixing types in a vector (or lying about it) isn't only a compiler issue, but might also annoy cpu. K8 don't like that for sure and there's some penalty to it (not always, but sometimes and then Intel announced that their future chips will have the same kind of requirements); casting requires an op.

. InterlockedCompareExchange isn't the only way, there's plenty of LOCKable atomic primitives on x86.

. I'll use ICC again the day they'll stop ignoring bug reports [SMILEY Smile]

Again thanks for that neat post.

PS: That thesis link is 404 compliant [SMILEY Sad]

(L) [2006/03/03] [fpsunflower] [MLRTA implementation] Wayback!

Best post ever. Thanks Xela for giving us some truth [SMILEY Smile] Hopefully you can enlighten us on kd-tree creation issues one day too [SMILEY Razz]

tbp: the thesis is there, just an extra period in the url. I'm off to read that now

(L) [2006/03/03] [Phantom] [MLRTA implementation] Wayback!

Remove the dot, Neo.

Me also needs time to digest this.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/03/04] [Xela] [MLRTA implementation] Wayback!

Tbp,

(L) [2006/03/04] [chris81] [MLRTA implementation] Wayback!

Could you post a link to Carsten's thesis?

(L) [2006/03/06] [ector] [MLRTA implementation] Wayback!

SSE3 CPU:s do not have a "doubled number of registers". All they have is a bunch of horizontal ops, a couple of ops designed for complex arithmetic, some memory barrier and an alternate unaligned move.

x64 CPU:s, though, do have twice the number of registers, both regular and SSE. Of course these are only visible in 64-bit mode.

EDIT: damnit, tbp was faster.

(L) [2006/03/06] [toxie] [MLRTA implementation] Wayback!

'kay guys.. Settle down.. [SMILEY Wink]

That's what i actually WANTED to write.. (x64 instead of SSE3)

(L) [2006/03/08] [tbp] [MLRTA implementation] Wayback!

0xffc00000, 'indeterminate', is about the only NaN you'll ever get out of a x86 and if i'm not mistaken that magic value was part of a draft of IEEE 754 (and got dropped). Still, it's a valid NaN.

I can't discuss precisely your traversal, i haven't seen it and those NaNs are bit more evil than you think and a source of endless fun:

(L) [2006/03/08] [toxie] [MLRTA implementation] Wayback!

Xela: Could you please answer my question about the 4x4 traversal? Is it possible to get it fast on plain 32Bit-SSE-machines or do you need a x64 to get it fast??!

Would be nice to hear just a small comment about that. Thanks!

(L) [2006/03/08] [Phantom] [MLRTA implementation] Wayback!

Maybe I can spend a few words on that, as I am experimenting with that right now:

A very quick and dirty implementation of 4x4 packets already matches the performance of my 2x2 path (the path I used so far, it's highly optimized). Looks like it will be easy to beat the performance by 10-20%. One area that brought a quick win is the triangle intersection code: There are lots of _mm_set_ps1 opps there, which can all be done once for 16 rays. Similar 'once for all' opportunities seem to exist at other locations.

As soon as I get a solid improvement, 4x4 is going to be my new default path.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/03/08] [Phantom] [MLRTA implementation] Wayback!

Here's my code that I pasted in a PM to tbp earlier today, so you know what I'm talking about:

(L) [2006/03/08] [tbp] [MLRTA implementation] Wayback!

I forgot to ask what fastxovery was supposed to do.

BTW no need to really understand what's happening, put a breakpoint, alt-8, look if there's some undue reg->stack / stack->reg movements or silly reloads etc, tinker, rince, repeat.

It's not going to fit the register set anyway, so perhaps you can babysit the compiler a bit (provided it somehow got it wrong); that is if you care about that bit of code.

(L) [2006/03/08] [Phantom] [MLRTA implementation] Wayback!

Ow OK that sounds easy enough. I'll give it a try tonight.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/03/08] [tbp] [MLRTA implementation] Wayback!

He said internal nodes [SMILEY Wink]

But then:

a) only internal nodes are problematic to pack into 4 bytes.

b) if internal/leaf size don't match, how do you address them?

So, for a 4 bytes internal node that's something like 16 bits for the split, 14 bits for childs offset and 2 bits for axis.

14 bits for offsets doesn't sound very reasonable, but then you can try to re-arrange i guess and anyway you can always fall back to the 8 bytes style when you create the tree.

Now we have 16 bits for the split, so an obvious option would be fp16 but that would lead to quite some quantization problems, so i bet some fixed point thingy is the way to go.

I'm not sure the cost to unpack that at traversal time is worth the bandwidth savings tho. Got some perf figures to show us?

(L) [2006/03/09] [Phantom] [MLRTA implementation] Wayback!

Over the past few days, I have been struggling to replicate that '2x speedup' for 4x4 packets. I really fail to see how this could ever happen. Right now, my 4x4 path is completely functional and optimized, and it performs *exactly* as the 2x2 path does. I made some progress, but every optimization was easily copied to the 2x2 path. Sometimes I get a slightly better score for primary rays only, but that's really in the 'below 2%' area, where inaccuracies and scene specifics tend to make things noisy.

So... Either my 2x2 path is completely awesome as it is, since it matches what others do with 4x4 packets, or I am missing the point here. Could someone show one of those awesome 4x4 cores?
_________________
--------------------------------------------------------------

Whatever

(L) [2006/03/09] [Phantom] [MLRTA implementation] Wayback!

Never mind, I'll try some of Carsten's cores in the morning.

By the way, I noticed one odd thing in his thesis: Each time he pushes the current near value on the stack, he does this using _mm_max_ps. Why is this?
_________________
--------------------------------------------------------------

Whatever

(L) [2006/03/09] [tbp] [MLRTA implementation] Wayback!

To properly clip that pushed ray segment to the near cell. But you made me realize that i don't need to clip the far value (i'm doing 2 min/max).

edit: Fooled by my own code, that min isn't pushed.

(L) [2006/03/10] [Phantom] [MLRTA implementation] Wayback!

Weird, I have been tracing for over a year now, and I *never* used that max/min construct. I wonder what the consequences are... Anyway, I'll try some of Carstens cores today to round off my 2x2 vs 4x4 experiments. After that I would like to try the pluecker test for triangles.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/03/11] [ector] [MLRTA implementation] Wayback!

Hey,

You guys keep mentioning "Carstens cores", are they from his "unreleased" paper with the dead URL?

If so, is there any way I can get my hands on one of your already downloaded copies? [SMILEY Smile]

(L) [2006/03/11] [Phantom] [MLRTA implementation] Wayback!

That url is not dead, it just contains one dot too many. I added a (correct) link to it to the 'links and papers' section, see my post (the first one) for the link.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/03/17] [craouette] [MLRTA implementation] Wayback!

While going through the thesis of Carsten Benthin , I found a few problems (or things I didn't understood correctly):

First, I don't understand why using BB instead of triangles is an approximation. If I remember well the Havran thesis, The split has to be computed on primitive bounds only (according to the evaluated axis), and that is bounding box, no ?

Second, There is a flow in the Kd-tree construction proposed:

Counting the number of split does not take care of triangles that start and end outside the current node BB. The cost is then wrong for this case (even worst, if the triangles of this node are evaluated by using splits, somme triangles will be missing in the node). Moreover, How is evaluated the total cost of the node ? The number of triangle in the node is needed to compute it, and going from split to primitive is not free....

Hope I was clear enough,

craouette

(L) [2006/03/17] [craouette] [MLRTA implementation] Wayback!

and it is said that

Number of left primitive = planar + opened

Number of right primitive = number of splits - planar - closed

but this not true. it should be:

Number of right primitive = number of primitive - planar - closed

where number of primitive = total number of opened + total number of planar = total number of closed + total number of planar

so a first loop is required before calculating the cost this way.

I am wrong ?

and tbp, do you mean that SAH requires to clip the primitive by the split positions ?

craouette

MLRTA implementation back