Starting anew

(L) [2006/03/11] [Phantom] [Starting anew] Wayback!

I am seriously considering to build a new tracer, of course using a bit of cut and paste, but still, a completely new one. There are several problems with my current tracer that I would like to address:

- There's far too much global data, which is hard to split in hot and cold data;

- There are too many paths: 2x2, 4x4 and mono;

- Anti-aliasing is far too expensive (as it uses the mono ray path);

- Many paths miss functionality (e.g., no reflections in 4x4 path).

Also, the fallback paths seem to be used way too often, resulting in poor performance.

The basic idea that I would like to implement this time is to generate rays in a regular pattern, accumulating them until a 4x4 packet is full, then rendering that full packet. Extra rays, like anti-aliasing rays, would be added to these packets too, so they no longer degrade overall performance as they do now.

The pattern I had in mind is like this:

(L) [2006/03/11] [tbp] [Starting anew] Wayback!

You are trying to address many unrelated issues there.

If i remember Wald talks about having a distinct ray generation phase (and i think it's done like that in OpenRT), but when i looked into deferred rendering i've concluded once again there wasn't enough work to be done at that point to turn that into its own pass and you're much better trying to work on rays you've generated (even if that means, sometimes, that you might do an expensive unbundling etc).

Cuz if you put that into its own pass, you do little computation (some reciprocal, checking signs etc) then store a rather large amount of data (possibly re-arranging it). Later you get back there, read back all that memory (we're talking about at least 2 cache lines per ray) and do some real work on it. I don't think you can justify wasting that much mem bandwidth, even if that gets you more coherency.

Anti aliasing, at least some in some forms, is easy to to adapt to packets as is.

So, i bet, all that massaging would only make sense for secondary rays (i've haven't tried going 4x4, maybe odds are different).

(L) [2006/03/11] [Phantom] [Starting anew] Wayback!

Problem with anti-aliasing is that it typically generates lines of extra samples to probe. That doesn't fit in a packet well; so in practise I did those using single rays. I would really like to get rid of mono rays completely (that's basically my main goal here) to reduce code complexity. Gathering primaries is not too complex (I think the scheme I described should work well). From that point, I can either gather shadow rays in the same manner, casting a packet as soon as it is 'full', or I could do deferred shading. Deferring shading in a unified fashion would increase coherency, since it would spawn tons of rays starting at the same light. I could even drop ray origin initialization completely for all rays that start at the same light source. Not a big win, I suppose. [SMILEY Smile]

But again: I want a single code path. That would reduce my current tracer to 1/3rd of it's size, which would increase maintainability tremendously.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/03/11] [tbp] [Starting anew] Wayback!

Deferring or not shading is orthogonal, i wasn't saying that it was too complex to have a ray generation pass but that it was innefective due to how little work was to be done at that point (that and the fact that your rays then become somehow cold).

I only shoot packets for AA, but packet or not it's slow anyway [SMILEY Smile]

(L) [2006/03/13] [tbp] [Starting anew] Wayback!

Some times ago i re-did the ray packet unbundling to mono ray code, so i'm reasonably sure it's about as fast as such an horrible thing can go.

Then i've used the gcc binary because it has by far the fastest mono path.

I've compared that unbundling-to-mono-on-incoherency to a really crude version where when 2x2 rays are incoherent, i desactivate all but one in turn and do 4 traversals. It's slightly but consistently faster; as there's never much incoherent rays among primaries to begin with, it's significative.

So i guess i now need to tighten that a bit and while i'm at try to group those incoherent rays a bit.

Of course that also means your ray packet path must handle all degenerate cases gracefully [SMILEY Wink]

It turns out my robust ray packet/box intersection test wasn't that robust.

(L) [2006/03/13] [Phantom] [Starting anew] Wayback!

So basically what you're saying is that in your setup you could just as well remove the mono ray code path and replace it by a packet traversal with 3 rays disabled? That would certainly bring down code size.

BTW one other grief I have with my current tracer is that enabling shadows actually decreases overall ray thoughput (from 3.4M/s to 2.6M/s on a PM/1.6Ghz).

One thing I was wondering: People that see a 2x speedup using 4x4 packets, how many rays are you casting (rough estimate) on a scene like Sponza (80k?)? I have been aiming for Wald when I started on RTRT, then I aimed for tbp, as he was slightly ahead of me (he's a compiler guru, so that's OK), but right now, I have no idea who's "king of the hill" and how far ahead the "king" is.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/03/13] [toxie] [Starting anew] Wayback!

For a nice rayshooting-only-competition i'd recommend the following:

Convert some scenes to a REALLY easily readable format (= without any texture coordinates,

normals, or other kinky stuff). I for myself have a nice collection of raw binary files (=36Bytes per Triangle,

no headers, no nothing) if you like some.

Then include a ascii-text or binary file that includes camera-position, forward-, right- and up-vector and fov.

(Maybe include some reference pictures so anyone could see if their cam-code is okay)

Define a default high resolution (like 1024x1024) and some super-simple shading (like TriNormal*RayDirection).

This way rules are simple and anyone can easily compare their performances.

As performance numbers:

FPS (for mono, 2x2, 4x4, MLRT, whatever and for 1,2,X CPUs/Cores)

Acc.-Data-Structure build (measured from the point of having loaded all tri-data into mem to the first shot ray)

(And maybe memory-usage of Data-Structure?)

(L) [2006/03/13] [fpsunflower] [Starting anew] Wayback!

I'm in too. I'm curious to see the real gap between my mono Java tracer and fully optimized packet tracers.

We should probably be real clear about the camera angles to expect. Everyone has their own slightly different way to specify fov/aspect ratios. Perhaps a square resolution and a 90deg fov can make this easier?

And maybe include a "correctness" check by rendering a high res image.

(L) [2006/03/13] [toxie] [Starting anew] Wayback!

So here you go: [LINK http://ainc.de/RTcoreTest/RTcoreTest.zip]

What's inside?

-Six scenes converted to a VERY simple binary triangle (= 3 vertices = 9 floats = 36 bytes) format without header.

To read 'em in: filesize/36 = number of triangles, and a simple fread() is all you need.

-Six ascii-text files describing the camera: first 3 numbers = camera-position, next 3 numbers = camera-look-at

last number = index of axis for the "default"/world-up-vector (0 = (1,0,0), 1 = (0,1,0), 2 = (0,0,1)) to create the forward and right vector.

The included scenes are (hopefully) "free" for our use:

-Menger-Sponge and Clown are made by myself

-Happy Buddha (The Stanford 3D Scanning Repository [LINK http://graphics.stanford.edu/data/3Dscanrep/])

-Kitchen (BART [LINK http://www.ce.chalmers.se/old/BART/])

-Scene6 (Shirley GI Test Scenes)

-Fairy Forest (The Utah 3D Animation Repository [LINK http://www.sci.utah.edu/~wald/animrep/])

(L) [2006/03/13] [toxie] [Starting anew] Wayback!

And some notes:

I'v chosen these scenes cause they should represent different problems:

Clown and Buddha represent (more-or-less) equally distributed triangles (3D-Scans).

Kitchen is evily transformed (triangles aren't axis aligned) and also has the teapot-in-a-stadium-problem.

Menger-Sponge is (almost) fully axis-aligned, but features a lot of holes everywhere.

Scene6 is tiny. [SMILEY Wink]

Fairy Forest also has the teapot-in-a-stadium-problem plus a bit of a "real-life" scenario.

(L) [2006/03/13] [lycium] [Starting anew] Wayback!

many thanks for your testscenes toxie :)

however, one thing that doesn't sit well with me is your triangle format: imo, it's wasteful and unrealistic to store them like that. an equally simple format is just 2 ints (numvertices, numtriangles) followed by the binary data: numverts*4*3 bytes per vertex and triangle (float and unsigned int). i know the extra indirection is no fun for performance, but when you get to having vertex normals, uv co-ords, material infos and such it's really not an option at all to store everything X many times.

[PS. anyone using this format is also in a position to really stress test their k-d tree building/traversal/memmgmt with some programs of mine (procedural mesh generation and a 3dsmax .ASE converter), that i intend to patch up a little bit and release when i'm not dead/busy...]

[PPS. i think the last couple of posts could be moved to the datasets forum?]

(L) [2006/03/13] [lycium] [Starting anew] Wayback!

while i fully agree with you, the specifics of our implementations tell another story: you also need to determine the filesize, that's a good few more ops (fseek, ftell, etc) i don't have to do. so you end up with 1 fread and a bunch of other stuff, i have 3 freads and nothing more: 5 lines of code for reading a file qualifies as SUPERSIMPLE i believe?

and yes, just measuring numbers is fun, but they have to still mean something: no point measuring something different to what you'll actually be using IRL. moreover, i bet the extra order of indirection will hurt the p4s a lot more than other archs (due to the long pipeline), so you're actually hiding important info this way...

(L) [2006/03/13] [lycium] [Starting anew] Wayback!

i guess for the puposes of benchmarking it's better that way then; in any case, it's still possible to keep indices (though you have to do lots of vertex equality tests to extract the info) and shading info alongside this flattened/inline/wald data, probably a good idea for hot/cold separation and deferred shading anyway.

[i'll still use indexing though for filesize and flexibility reasons; it's also a supersimple (tm) conversion to the inline format during tri index loading]

(L) [2006/03/13] [toxie] [Starting anew] Wayback!

Results:

Buddha: 2.85 FPS

Scene6: 3.14 FPS

Fairy Forest: 1.96 FPS

Clown: 3.3 FPS

Kitchen: 1.83 FPS

Menger: 1.89 FPS

FOV = 80

Aspect Ratio = 1

Resolution = 1024x1024

Machine: P4HT 2.8GHz, 2GB RAM

Uses OMP and 2x2-SSE to trace rays,

but no OMP and single samples to do the shading.

Pics:

[IMG #1 ][IMG #2 ][IMG #3 ]

[IMG #4 ][IMG #5 ] [IMG #6 ]

[IMG #1]:Not scraped: https://web.archive.org/web/20061004023115im_/http://ainc.de/RTcoreTest/buddha.png
[IMG #2]:Not scraped: https://web.archive.org/web/20061004023115im_/http://ainc.de/RTcoreTest/clown.png
[IMG #3]:Not scraped: https://web.archive.org/web/20061004023115im_/http://ainc.de/RTcoreTest/fairy.png
[IMG #4]:Not scraped: https://web.archive.org/web/20061004023115im_/http://ainc.de/RTcoreTest/kitchen.png
[IMG #5]:Not scraped: https://web.archive.org/web/20061004023115im_/http://ainc.de/RTcoreTest/menger.png
[IMG #6]:Not scraped: https://web.archive.org/web/20061004023115im_/http://ainc.de/RTcoreTest/scene6.png

(L) [2006/03/13] [tbp] [Starting anew] Wayback!

MengerSponge

[LINK http://ompf.org/ray/wip/pix/20060313-01-MengerSponge.png]

This is not a legit entry (i'll have to run soon) cuz:

. had to match camera by hand (there's not much fluctuation, in that zone but still...).

. shading a direct mapping of triangle id, heh.

. that's with an experimental gcc + patches, no time to check with anything else.

. etc...

3.10 fps (i'm pretty sure the Kray/s stat is bogus), but at least timings are accurate.

Opteron 252, that's 2.6 ghz, 2G of ram (1G per cpu), 1 cpu/thread active for everything, 2x2 packets on linux with gcc 4.2.something.

Of course that was with the old-compiler-that-sucks-ass:

ra2_loader: got 1920000 triangles and then some in a grand total of 0.664 seconds.

...

~> first node done in 0.428 seconds.

...

build::build: 11.092 seconds for 1920000 tri, nodes 1349593, leaves 674797.

build::build: stats: max lvl 24, 2029534 items, 460030 empty leaves ( 68%), 12 max items per leaf.

flat::flatten: allocated 10.297MB for nodes and 7.742MB for triangle ids.

Tomorrow afternoon/evening i'll try better, and as suggested i'll post some of those links in other sections of that forum (if you will).

(L) [2006/03/13] [Phantom] [Starting anew] Wayback!

One preliminary result from me (updated on March 14th):

Kitchen scene, 4x4 packets, 1.6Ghz Pentium-M, single thread: 2.2fps.

Scene6, same settings and code: 4.2fps.

Preliminary because:

- My 'ra2' loader still has a problem (incredible, I know), which causes the scene to be mirrored. [SMILEY Smile] I get the exact mirrored image of what Toxie showed, so it should not matter much.

- This is my 'redone' tracer. It's 500 lines, including memory pool management, but does nothing more than the stuff required for this test.

- Despite my great plans for avoiding special cases for incoherent rays I'm looking at some gaps now near the spots where packets tend to loose their coherency, so I'm clearly doing something wrong. [SMILEY Smile]

- My kd-tree compiler has problems with some scenes: The fairy scene is the smallest that crashes the compiler (allocating far too many kd-tree nodes), so I will investigate. Budha also doesn't compile.

Despite that, the 2.2 and 4.2 figures should reflect the performance of my current code quite well.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/03/14] [Phantom] [Starting anew] Wayback!

Toxie: I have to admit that I didn't correct my fov. I suspect that I used a FOV of 90, since I simply shoot rays from a point in space (camera) to a quad in space. Also, my shading is not exactly as you described it. This should of course be completely uniform if we want to compare scores.

Once I got my shading right, I'll post a pretty collage like you did, and I'll post a table with 'scores'. I'll also make a nice package that people can use to gather scores on different hardware; one of the things that are a bit unclear to me are the approximate speed differences between Penitum-M's and P4's. Also, it might be interesting to comare scores for various compilers and compiler settings. Right now icc9.0 is beating VS2005 all over the place (by a 10-15% margin) even though I have to use icc from VC6, so it's relatively hard to play with the compiler settings (I just enable the 'restrict' keyword and use -O3, that's all atm).

I'm quite happy with your initiative by the way, I was quite stuck and now I am looking at a clear and small challenge. My code has been reduced to 570 lines, with a render core that's a mere 180 lines... Now that's stuff I can experiment with in those small spare hours.

Sadly I have work to do today. [SMILEY Smile]
_________________
--------------------------------------------------------------

Whatever

(L) [2006/03/14] [tbp] [Starting anew] Wayback!

Turned my pix into a link so it doesn't screw the whole post [SMILEY Wink]

Later today i'll add some shading, i had to shortcut some stages due to lack of shading data, so it will be closer to what you guys are really doing.

My camera wasn't reliable, i was getting a view that grossly looked like yours (toxie), but not exactly.

Then with the same compiler version, i'll gather some numbers on win32 as well. So hopefully i'll have some definite numbers regarding gcc4.2 & msvc8 and the relative speedup due to 64bitness.

BTW i've just noticed that Menger sponge isn't composed of 192k tri but 1.92M. Doh. I guess i was fooled by the relative speed at which it got compiled due to the regularity in the data set.

Jacco, there's also the relative speed between a k8 & the rest to put into the equation; certainly not trivial to get right.

PS: Forgot to thank Toxie for the initiative and data set [SMILEY Smile]

(L) [2006/03/14] [tbp] [Starting anew] Wayback!

Here are my official entries.

Win32, 1xopteron 252 active, 2.6ghz, 2G of RAM.

[LINK http://ompf.org/ray/wip/pix/20060314-01-win32-scene6.png]

[LINK http://ompf.org/ray/wip/pix/20060314-02-win32-menger.png]

[LINK http://ompf.org/ray/wip/pix/20060314-03-win32-faery.png]

[LINK http://ompf.org/ray/wip/pix/20060314-04-win32-clown.png]

scene6: 8.25 fps

menger: 2.66 fps

faery: 3.19 fps

clown: 5.70 fps

To be fairer to you guys that was done on win32, that is with a 32 bit binary, with only one cpu active.

Shading is as prescribed a dot between the normal and ray direction (note: my normal extraction is fucking slow), incoherent rays (ie on the border) shouldn't get shaded but they are, sue me, that's just lazyness.

Toxie, it seems your fov is somehow half of what you described, but not exactly, in any case with fov 80 i get a perfect match with fpsunflower's pictures so i've settled for those views.

On kitchenTransformed.ra2, there's some degenerated triangles:

flat::transfer_wald_triangles: id 4339, N[k] == 0 -> NaNs. shouldn't happen.

My ra2 loader is too brain dead to prune them atm, so that scene will have to wait.

Then it seems my compiler gets a bit stuck on the buddha. I need to investigate further, but it seems to be a miscompile with my current gcc. Meh.

Before adding the dot shading, i benched various versions. Msvc on win32 gives me a binary 25% slower than gcc. Then, there's only a ~5% speed difference between 32 and 64 bit version, both compiled with the same gcc. So i must be doing something wrong. Gah.

On top of that my normal extraction comes with a big penalty, i need to do my homework again. Double gah.

EDIT: i was comparing black spots on faery vs fpsunflower, i have a bit less and they seem to be related to edge aliasing.

(L) [2006/03/15] [toxie] [Starting anew] Wayback!

Okay.. tbp you were right.. FOV was just half.. [SMILEY Smile]

Uploaded the new pictures..

Here are also some updated numbers with corrected FOV.

This time also no more nasty framework and API stuff around it.

Just (more or less) pure RT-performance numbers: =)

Buddha: 4.16 FPS

Scene6: 3.92 FPS

Fairy Forest: 2.55 FPS

Clown: 4.45 FPS

Kitchen: 2.37

Menger: 2.03 FPS

Makes me feel better now.. [SMILEY Wink]

Machine is still a P4HT 2.8GHz, 2GB RAM..

(L) [2006/03/16] [tbp] [Starting anew] Wayback!

I'm thrilled to announce i've, yet again, replaced this morning the main part of traversal with inline asm (took ages, i'm rusty) for a net speedup of -0.2%. [SMILEY Rolling Eyes]

That freaking gcc is doing horrible things all around, but i couldn't find any easy way to put a stopgap to that madness.

I hate compilers.

(L) [2006/04/11] [Phantom] [Starting anew] Wayback!

I lost track of my own code, so I started on another iteration of the kd-tree compiler. I decided to go for the 'naive' approach, just to see if I missed anything in my previous attempts. Suprise... It's 23% faster. I'm glad I'm doing another rebuild. [SMILEY Wink]
_________________
--------------------------------------------------------------

Whatever

(L) [2006/04/12] [Ho Ho] [Starting anew] Wayback!

I think [LINK http://graphics.cs.uni-sb.de/~woop/rpu/pics/Peoq2-large.jpg this] [LINK http://graphics.cs.uni-sb.de/Publications/2004/SaarCOR/Peoq-15-1024-large.jpg map] is availiable in demo version too.

Of cource you could only provide the tools to convert the .bsp to something nicer so we could convert our own maps if needed [SMILEY Smile]
_________________
In theory, there is no difference between theory and practice. But, in practice, there is.

Jan L.A. van de Snepscheut

(L) [2006/04/12] [fpsunflower] [Starting anew] Wayback!

Thanks! The morning after I write my new compiler =)

I haven't tweaked my new compiler too much yet but I get 1.2s build time (SAH across all 3 axes with robust handling of planar faces).

Old compiler (which tries the axes longest first - and no special handling of planar polys): 0.7s

Still a few orders of magnitude away from the goal I'm afraid [SMILEY Wink]

Now what would be a good camera angle to view this from ?

(EDIT: I should let my processor caches warm up before posting benchmarks ... [SMILEY Razz])

(L) [2006/04/12] [tbp] [Starting anew] Wayback!

[mumbles] 15 ms to sort, 94 ms to compile.

And that silly compiler is going down to lvl 37 [more mumbling]

I'm hitting the resolution limit for that timing method on windoze anyway, i need better instrumentation.

What's your compilation time like, toxie, on this one? 20ms total? 40ms?

I'm still off by a non-negligible factor i'm afraid.

Edit: Better resolution... sorting takes more like 29 ms on its own, i know it sucks anyway, same time to compile. So even if i put the sort phase aside, i'm not quite there yet.

Starting anew back