RtStage2 source release

(L) [2006/05/08] [Phantom] [RtStage2 source release] Wayback!

Release is imminent. I am building a small package now.

The original plan was to release both my 'pretty' tracer (with textures, lights and stuff) and the 'speedking' edition (bare bones tracer), but after looking at the 'pretty' tracer I decided that it is not good enough to release. It just contains too many problems that I fixed recently (esp. memory management issues), which will probably require tons of support should anyone try to build and run it.

The original plan was also to release the current 'speedking' edition, but without the actual tracer (just the kd-tree compiler), but since I am holding on to the 'pretty' tracer, I will release my full current project. It's somewhat 'work in progress', obviously, so I will update it regularly. Right now, it is completely optimized to run Toxie's benchmark scenes, it does not contain code for recursive ray tracing (just first-hit code), no shading, no texturing and so on. It also has quite a bit of 'work in progress' code, like #defines for stuff that I am testing at the moment. It does however show how to implement a very basic 4x4 packet tracer, how to feed it rays, how to shade, how to do intersections (Carsten style), how to do memory management and of course how to build a kd-tree in O(N log N) time.

As soon as the package is up, I'll drop another note here.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/08] [Phantom] [RtStage2 source release] Wayback!

File has been uploaded to the repository:

[LINK http://ompf.org/alpha/bikker/rtStage2_may_8th_2006.zip]

(thusly moved)

Some notes, besides the one in my first post:

- Only one scene included: Sponza. Add extra scenes to 'meshes' directory.

- Code compiles under VS2005 and VC6/icc combo; icc is by far the fastest. VS is used for debugging.

- See common.h for settings: MAXTREEDEPTH, PRIMSPERLEAF influence tree generation. TRAVCOST and INTRCOST are self-explanatory, I assume. TREECLIP is used to enable/disable full clipping. REORDERINGPRIMS still works, but doesn't help at all, like I mentioned. REBUILDTREE cannot be disabled, as tree saving/loading is broken atm.

- kdhelp contains all helper functions, the old nlog2n compiler (might still work) and some code backups. You shouldn't need it, but if you are looking for list insertion/deletion and that sort of stuff, it's there. [SMILEY Smile]

- This package is the exact code that was used for the demo I released earlier, except for a couple of things:

  * Intersection code has been changed to Carsten's approach. It's faster and more accurate.

  * Timer has been replaced by tbp's wallclock. Much more stable, and more accurate.

  * Incoherent packets are now detected properly, but not yet handled (code is there, but doesn't work).

Well I suppose that's it for now, if there are questions, just let me know. Small fixes will be posted in this thread, major releases will get a new package upload.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/08] [Ho Ho] [RtStage2 source release] Wayback!

No (official) Linux support? I guess I'll have to port it myself then. More fun for me [SMILEY Smile]

I'll get started in around 6 hours after I get home from work. If it is not too much different from the RT articles it shouldn't take too long.

Btw, does it compile with GCC? If it does it will probably be easy to get running in Linux. With previous code most of the time was spent on making it digestable for GCC.

[edit]

I should have read both of the posts. Too bad ICC can't compile in VC mode under Linux.

[edit2]

have you tested it in 64bit OS? I hope you have coded it 64bit-safe. If not it gives even more fun for me since I have no 32bit OS installed [SMILEY Razz]
_________________
In theory, there is no difference between theory and practice. But, in practice, there is.

Jan L.A. van de Snepscheut

(L) [2006/05/08] [Phantom] [RtStage2 source release] Wayback!

You shouldn't have too much trouble getting it to run under gcc, I hope. I did obey the for..next scope rules, for example. That said, I have little experience with gcc, so I really shouldn't make any claims. [SMILEY Smile] Keep me posted.

EDIT: If you edit, so do I. 64bit-ness: It's a pure 32bit thingy right now. The kd-tree contains raw pointers, so that's going to be a problem. 64-bitness is also pretty low on my priority list, as I won't have access to a 64-bit machine for the next couple of months. My next laptop (next month) will be dualcore, but not 64bit. Tbp once mentioned some issues for 64bit apps (kd-tree pointers being the most obvious); it shouldn't be too hard to get it working as a 64-bit app, and it should give a small speed boost. I believe tbp mentioned something like a 20% immediate gain, mostly because of the extra registers.

EDIT2: Does anyone know if XP32bit accepts 64bit executables if the host platform is 64bit?
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/08] [tbp] [RtStage2 source release] Wayback!

Gcc plays it rough with alignments. I haven't looked at the source yet, so i can't say if that's going to hit you in the back.

Glad you've used the wallclock thingy, now i'd need to excise the full thing with proper crossplatformness.

Got to walk the walk, edit time: going for 64 bit is a breeze unless you've been naughty. Then it hurts.

Anyway, you can cross-compile to 32bit provided you have all the required 32bit libs somewhere.

About xp64/32, i dunno. But guess the answer is a big no (that would require some massive emulation).

(L) [2006/05/08] [tbp] [RtStage2 source release] Wayback!

GetTickCount is absolutely horrible, lots of jitter, large grain... Perf counters are a bit better, they generally rely on PIC timers (~3 Mhz); the real trouble is the high latency of those calls (i measured them back in the days to be > 5000 cycles on a mono cpu box) and the fact that they aren't that reliable (faulty HAL and so on). On the other hand, cpu with varying frequency aren't an issue.

I prefer to use rdtsc, mostly because then measuring has negligible impact, but you need to be extra extra careful: you can't handle varying freq, multi core/cpu systems are a pain (even more so on xp when the kernel don't synchronize them) etc...

(L) [2006/05/08] [Phantom] [RtStage2 source release] Wayback!

What's the normal way to deactivate rays in a packet? I tried setting tfar to -1 for the rays that should be skipped, but that causes problems that look like mailboxing problems, i.e. black spots near node boundaries. I added a special mask now for ray deactivation that is used inside the intersection test to mask out primitive & distance updates just before they are updated, but this feels like I'm keeping track of rays that are supposed to be inactive for too long...
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/08] [Ho Ho] [RtStage2 source release] Wayback!

I just started to port it. So far I have succesfully compiled surface.cpp [SMILEY Smile]

I use cbuild for building. [LINK http://awiki.tomasu.org/bin/view/Main/CBUILD]

I wanted to ask what extra libraries may I use? I wouldn't like to use plain-X for opening window and handling input. I would prefer [LINK http://www.talula.demon.co.uk/allegro/ Allegro ]but if I must I can use SDL too.

I last used SDL about three years ago to do a really simple test. Opening a window wouldn't probably be a big problem, I'll just have to read some documentation to remember that stuff.

I have used Allegro for the last five years for everything that needed graphics output. It would give all sorts of nice things like bitmap loading (bmp, tga and pcx native, jpg and png as addons), simple thread based timers and keyboard and mouse input. With allegrogl it even has really simple OpenGL support including extension loading. Allegro works cross-platform on windows, Linux and OSX in both 32 and 64bit modes. Also it has DOS support together with some Linux console graphics libraries [SMILEY Very Happy]

[edit]

It seems I need to have some replacements for __forceinline, __int64 and LARGE_INTEGER. I'll find the __forceinline from gcc manual. Does MSVC support intXX_t? If it does I could replace those other two with either int64_t or uint64_t.

For QueryPerformance* I'll probably use gettimeofday replacement.

If anyone thinks there are better alternatives let me know.

[edit2]

Woot!

Now three files out of six are compiled: surface, kdtree and kdhelp. Half way there!

Hopefully the GetTickCount and wallclock_t replacements work the same way as under windows.

[edit3]

A little help here. In raytracer.cpp line 55:

_declspec(align(16))

IData Engine::m_ID[8];

Is it correct if I write it this way:

IData Engine::m_ID[8] __attribute__ ((aligned (16)));
_________________
In theory, there is no difference between theory and practice. But, in practice, there is.

Jan L.A. van de Snepscheut

(L) [2006/05/08] [Phantom] [RtStage2 source release] Wayback!

The __int64 is purely for the timer class.

Please note that besides soloapp.cpp, there should be no platform-specific code anywhere. Soloapp opens a window via the win32 api, and passes a pointer to the tracer. Soloapp.cpp also handles timing, camera positions and some other things.

Btw, even surface.cpp/.h should be pretty much platform independent; it simply encapsulates a 16 or 32 bit linear frame buffer.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/08] [Ho Ho] [RtStage2 source release] Wayback!

Ok, now all but soloapp are compiled. For my luck I didn't have to use nearly as much local variables as I did in the legodemo you gave me some while ago.

I'll bike around for a half an hour to clear my mind for the last part of porting and then start trying to get it to actually work with 64bit. So far it seemed there wasn't too many places where you used pointers and ints in the same places. Hopefully I didn't miss something big and ugly [SMILEY Smile]
_________________
In theory, there is no difference between theory and practice. But, in practice, there is.

Jan L.A. van de Snepscheut

(L) [2006/05/08] [tbp] [RtStage2 source release] Wayback!

<stdint.h> has all the types you need in a portable - ehe - way, that's why you see uint32_t etc in my code.

_MM_ALIGN16 comes with Intel SSE headers and provide portable alignment.

SetColor shouldn't need a temporary, but you don't quote enough code for me help.

I said i'll post proper code for the wallclock_t thingy, code which works on win32/linux with gcc/msvc etc...

There's no need for sdl, it's bloated and slow, just use some bare X11 with shm extension for fast buffer copy. FreeImage is a simple lib that supports every format under the sun and works on win32 and linux just fine.

I may have forgotten some remarks along the way [SMILEY Razz]

(L) [2006/05/09] [Phantom] [RtStage2 source release] Wayback!

Quick progress report: I got incoherent packets working, Toxie's transformed kitchen is flawless now except for some minor but apparent kd-tree building issues at the kitchen door (now that I'm writing this, I think it's simply exceeding the maximum tree depth, which is erroneously handled by not emiting triangles in my compiler), and performance impact is zero (since only one in every 255 tiles requires splitting, and splitting itself has near zero overhead, average overhead is negligible), so that's great.

Next step is multi-threading, as I will have access to a dual core laptop tomorrow. I'm rapidly switching notebooks atm, so I need to focus on things that I can actually test. [SMILEY Smile] Besides, on my current system, the icc installer doesn't work for some reason, so I can't do a proper fresh demo... Sorry.

About the SVN: It's all fine with me, I could even setup something on my server (I have good experiences with tortoise, might install that), but I won't participate. I can't handle random contributions right now. So if you want to move forward with that, you'll have to maintain your own version, and sync it every now and then with my releases.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/09] [Ho Ho] [RtStage2 source release] Wayback!

I coded until 2a.m last night and I got it to compile. Of course that was the easy part compared to what needs to be done [SMILEY Smile]

I'm having trouble with 64bitness in KDTree building. Currently I haven't got past the ebox struct with all of its pointer arithmetics. Could you describe in a few words what does the ebox struct do and what it's variables hold?

Also you seem to assume certain struct sizes. Unfortunately in 64bit most of the structs and classes are bigger than in 32bit and probably mess up the pointer arithmetics. For example the ebox struct is 24 bytes in 32bit but 36 bytes in 64bit IIRC. Same with every other class where you use longs or pointers.

I think one possibility to solve 32/64bit issues would be to abstract things a bit and use simple 32bit ints as offset from array start instead of the real pointers. They shouldn't cause too much slowdown but quite a lot of code needs to be changed. Tbp, how have you solved it?

I don't think I'll create my own server. Unless you change a lot of code every update I don't think manual syncing would be a big trouble.
_________________
In theory, there is no difference between theory and practice. But, in practice, there is.

Jan L.A. van de Snepscheut

(L) [2006/05/09] [Phantom] [RtStage2 source release] Wayback!

Hoho,

Manually syncing should not be an issue. I made several minor changes, but these are all pretty isolated, a simple merge tool should make the job very easy.

64bit stuff: Can't you just port to 32bit linux first? That way, you can start with working code before diving into 64bit stuff...

About using offsets in an array instead of pointers: That should work, but there's one problem: The memory manager currently allocates small blocks of memory on demand, which means that kd-tree nodes are not in a continuous chunk of memory. This could be fixed by changing the memory manager so that it allocates a new (larger) block each time the existing block appears to be too small (i.e., create larger block, copy old block to large one, delete old one). This would also make the kd-tree saving & loading valid again.

About ebox & EBox: An 'ebox' holds three events, one for each axis, i.e. primitive start or end, plus planar events.

An EBox encapsulates two ebox'es, thus 6 events. Advantage is that some data that is shared among ebox'es needs to be stored only once (e.g., the pointer to the original primitive), and that all data for a single primitive is closely packed together in memory. See tbp's notes for details.

Actual data layout: an ebox has three 'next' pointers, these are used to place the events in a single-linked list. Since the address of each ebox is divisible by four, the lower bits can be used to store two bits. These are used to flag the event as 'primitive start', 'planar' or 'primitive end' event. Other than that, the event needs a position.

The EBox contains, as said, two ebox'es, one for 'minima' along each of the three axii, one for 'maxima'. Besides that, there's a pointer to the original primitive, some flags, and a pointer to a clone: When a primitive is straddling the split plane, it needs to be split, and thus a full EBox is needed in both the left and right child lists. For this, the EBox is first cloned, then clipped to the left and right nodes.

That's all. Everything should be fine with offsets, as long as you change the memory manager slightly.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/09] [toxie] [RtStage2 source release] Wayback!

first of all: nice release! definetly good to read src code and thus should help a lot of the guys out there to

get into the world of rtrt! (your optimized 4x4 traversal ROCKZ btw.! how much faster is it over 2x2??!)

second: bad news: everything works fine with intel c++ 8.x+9.0, but crashes with 9.1 (when optimizations are enabled only though).. Weird thing: The tree setup visualizer works BUT it crashes during rendering, as

Scene::GetKdTree()->GetRoot() returns some strange (nonvalid) address..

Some debugging just showed that the first two packets are traced and the third gets this weird address -> crash..

(L) [2006/05/09] [tbp] [RtStage2 source release] Wayback!

[LINK http://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/i386-and-x86_002d64-Options.html#i386-and-x86_002d64-Options]

See -m32 for 32bit codegen.

It's much simpler and efficient, again, to have 2 representations: one when building the tree, another when using it to render stuff.

As said in another thread, in fact you don't need the tree when building - you do that via recursion/stack, so it could be just streamed to disk or whatever. Lowering the representation is cheap, you can tweak the whole tree and it's easy to put it in a continuous block.

(L) [2006/05/09] [Phantom] [RtStage2 source release] Wayback!

I didn't have to go through any kind of special trouble to create a tree directly in a renderable format. Only thing I need is a postprocessing step, where I replace a linked list of primitives with a pointer in the object list array plus a size (number of prims in that leaf). See 'BuildTriAccels' (this method actually has a bad name now, as TriAccel's are created together with the primitives; this method only does the operation I just described).

Toxie: I didn't have any problems with the intel compiler. Did you use it under Linux? I believe I used the win32/icc 9.0x version.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/09] [Ho Ho] [RtStage2 source release] Wayback!

I got it compiling with the -m32 fag and it enters the mainloop too. Unfortunately I can't get any visuals yet because I can't link 32bit programs with the installed 64bit Xlib. Even if I could link it my X code is not fully functional anyway.

I guess I'll make it to render to files until I get 64bit support done or 32bit OS installed. That way people without X installed can run it too [SMILEY Smile]
_________________
In theory, there is no difference between theory and practice. But, in practice, there is.

Jan L.A. van de Snepscheut

(L) [2006/05/09] [Ho Ho] [RtStage2 source release] Wayback!

Some of the first images from under 32bit Linux:

[IMG #1 ]

[IMG #2 ]

I'm not sure if that black line is meant to be there. Perhaps it is caused by the PPM exporter* or something I changed in rendering. As can be seen those lines moved when camera moved.

*) I borrowed and modified it from tbp's sphereflake tracer [SMILEY Smile]

As you can see there is definitely something wrong with my wallclock replacement. When I compared the time it took to render and write ten frames with the time it took to render and not write those ten frames I found that it took about 0.37s per frame or ~2.7 FPS @ 1024x1024 on P4 3.6GHz.
_________________
In theory, there is no difference between theory and practice. But, in practice, there is.

Jan L.A. van de Snepscheut

[IMG #1]:Not scraped: https://web.archive.org/web/20061004024240im_/http://img520.imageshack.us/img520/60/out3vo.th.png
[IMG #2]:Not scraped: https://web.archive.org/web/20061004024240im_/http://img188.imageshack.us/img188/3306/out94vm.th.png

(L) [2006/05/09] [Phantom] [RtStage2 source release] Wayback!

That frame rate includes writing to a file? Otherwise it would be a bit slow. [SMILEY Smile]

About the black lines: Those are the incoherent packets, i.e. 4x4 tiles where not all ray directions have the same signs. The version you have still simply skips them. And as you move the camera, these lines thus also move. I fixed this today, I just downloaded the new icc (9.1), if all works OK I'll upload a fixed version (that'll be tomorrow).
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/09] [Phantom] [RtStage2 source release] Wayback!

Toxie: Did you test any scenes besides the KitchenTransformed with the 9.1 compiler? I'm getting crashes on that scene, but on all other scenes, the tracer / compiler works fine. I'll download BoundsChecker tonight to see what's wrong with the kitchen.

I do get very excellent speed by the way, the 9.1 compiler is great. VS2005 integration is very cool too, now I can finally see if PGO helps.

I'm VERY happy with the new intersection code btw, it's far more accurate; fairy forrest always had some black spots, but right now, it's flawless. That also means it's faster; tracing the black spots tends to be expensive (as the rays pass through the entire scene).
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/09] [Phantom] [RtStage2 source release] Wayback!

New release now available. Check here:

rtStage2_may_9th_2006.zip in the scene repository

(tbp, can you move it? tnx)

This package now contains both a working executable (compiled with icc 9.1) and a source release.

Changes:

- Incoherent packet handling;

- VS2005 project files for icc9.1 (rename sln_old to sln if you don't want to use icc);

- Resolution & 'speedtest' define now read from scene.txt.

If you just want to play around, do the following:

- Open scene.txt, select the scene you want (13 = scene6, 14 = fairy forrest, feel free to add your own);

- Select resolution (see top of file);

- Uncomment 'speedtest' to get a canonical timing, or comment it to get a nice fly-by;

- Store more obj/mtl combo's or ra2 files in the meshes folder.

Source release:

- Sources are in src dir;

- Use winmerge or araxis to do a nice merge, shouldn't be too hard;

- Enjoy. Code works under icc9.1 with all optimizations set to max, except for the kitchen. Please report additional problems. There should be no visible artifacts.

I have a feeling that I am forgetting a ton of things, but it's late here.. Need to catch some sleep.

O yeah, one thing I wanted to ask: icc9.1 isn't producing .syn files during instrumented sessions. Some obscure website (intel.com) mentioned something about code not calling any functions, but I don't see what I could be doing wrong. Anyone? And if any Intel guy is reading this: Could Intel support this community with some free icc licenses? Right now I'm basically moving from one evaluation to another.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/09] [Ho Ho] [RtStage2 source release] Wayback!

From the looks of it, it shouldn't be too hard to merge my changes with your code. I'll do it sometime tomorrow and when I've got something worth showing I'll upload it.

The biggest thing is with window managment. Currently I have none but there are quite a bit of #ifdefs to cut out the windows part from soloapp.cpp. I guess window managment should be abstracted away a bit so there wouldn't be the need for mixing X11, winapi and other code together in one file.

(L) [2006/05/10] [Phantom] [RtStage2 source release] Wayback!

The original plan was to put all platform specific stuff in soloapp.cpp, but right now, there's a bit more code there (like moving the camera, loading global settings), so it might need to be split. You could also just exclude soloapp from the linux build, and replace it by your own soloapp_linux.cpp or whatever. Then again, perhaps you can get away with some #ifdef's, so that a single source package works on all machines...

FYI, the icc9.1 compiler complained about the same SetColor( Color(r,g,b) ) that you mentioned, so I fixed it in the latest source code. It was really just a couple of instances, and it's all in init code (file loading, primarily).
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/11] [Shadow007] [RtStage2 source release] Wayback!

Just a few words to tell I've begun to include the source in my Ogre plugin. At the moment, my goal is to overlay the raytraced rendering above the Ogre one. I got that partly working. So far, I'm still only showing the KDtree compilation in front of the image. Remade it this morning to check out the minimal states I need to integrate the next source release.

At the moment, the scene used is still the one provided by Jakko.

The KDTree compilation beeing quite long (10 minutes in debug), I'll try to get it "split" (by un-recursiving it first) so that I can build the tree over more than 1 frame. (I could'nt find a way to do the ShowWindow thing).

I guess it won't be good for the building length, but it should be OK.

The step after that will be to capture the primitives from the Ogre engine, and add them sequentially to the scene.

Then, I'll have to check allocations/deallocations so that I'm able to avoid memory leaks after compiling/rendering a set of scenes (as opposed to only one rendered for the process's duration).

I've only got a question about the use of static class/members : Is that for performance reasons ?

(L) [2006/05/12] [toxie] [RtStage2 source release] Wayback!

Played around with Pluecker in my own source(s) but it didn't work out for me (even with 4x4).

But i can suggest 2 small optimizations to your code:

a) In the tri intersection loop over the 4 ray sub-packets it helps to check if the current sub-packet is valid/not masked (otherwise don't calculate the intersection for this sub-packet)

b) (v0s & v1s & v2s) | ((v0s^0xf) & (v1s^0xf) & (v2s^0xf)) is the same as

(v0s & v1s & v2s) | ((v0s | v1s | v2s)^0xf) (at least if i haven't gone braindead during the last year [SMILEY Wink])

so this fits better to your (v0s | v1s | v2s) < 0xf test.

(L) [2006/05/12] [Phantom] [RtStage2 source release] Wayback!

Fresh demo with preliminary multithreading. Some artifacts, vastly improved performance on dual core machines.

[LINK http://ompf.org/alpha/bikker/rtStage2_dual.exe,] check file rtStage2_dual.exe (need to toss it in a dir with the data files that came with earlier releases).

Sources will be available after I fix the last problems.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/12] [lycium] [RtStage2 source release] Wayback!

btw, i hate to be a party pooper but when do we get to see the fast ray tracing? there are faster ways to produce these images in software, of which jacco was once a speedking too btw, and being the speedking with shadows+reflections is probably a much nicer title to have. i'd wager tbp is slightly ahead there because he's doing deferred shading+shadowing, which i'm pimping all too often. furthermore, tracing simple gpu-friendly scenes also isn't really what ray tracing is about; we have all this (algo|loga)rithmic power to not touch huge portions of the scene dataset at our disposal, so let's use it to trace some megatris and make the geforces/radeons jealous...

i know it doesn't fit in nicely with the whole ultra-coherent-4x4-packet scheme of things*, but isn't it a more worthwhile pursuit in the end?

* (tho there's always antialiasing, and it's quite trivial to trace coherent shadow rays with the deferred scheme- this can be optimised compared to nearest-intersection too because you only need an intersection, and return basically nothing etc)

(L) [2006/05/12] [tbp] [RtStage2 source release] Wayback!

While i was double checking that 100% cpu(s) usage i've found your bug, you're leaking thread handles. That also means you don't have a 'static' pool of workers... [SMILEY Shocked]

Lycium, we basically have the same box, i really think the difference is in the OS or some other interaction.

Anyway the method Jacco has described also amounts to a frame rate limiter. That's why i was surprised by the 100% usage + mediocre ratio mono/dual.

We've settled for those scenes because they were available, provide something better. You know exactly what i mean, wink wink nudge nudge.

It's already hard to bench raytracing without making it even more intractable by plugging in wildly differing shading.

But you're right, that's not the ground where the battle should happen because at that end we're doomed.

(L) [2006/05/13] [lycium] [RtStage2 source release] Wayback!

ok, seriously the last i have to say on the coherence vs gi issue: the simple fact is that there's no coherence to be gotten from bouncing single random rays around the scene- there's cache loading everywhere. why single rays? you try tracing n stratified rays more than a few bounces (or even in an unbiased manner via russian roulette), and let me know how it scales with antialiasing samples... how many rays do you think you need to spread over a hemisphere to extract coherence? 100? 250? even though i know 100 isn't enough, 100^2+2 rays for just two bounces, with only 1 measly ray per pixel is already going to be outrageously slow, ugly, and there still isn't any coherence in those 100 rays.

edit: i lied, one more thing: reflection rays diverge horribly off surfaces with even moderate curvature (to say nothing of concavity). and then you have refraction... basically, nothing except primary and shadow rays (point / small area- i mentioned this elsewhere on this forum) are coherent, and that's where the bulk of the rays go for ray tracing.

(L) [2006/05/13] [Phantom] [RtStage2 source release] Wayback!

I am not targetting GI with my interactive ray tracer. I have a couple of goals:

- Build a renderer that does things that 3D HW can't (easily) do (e.g.: true reflection/refraction, extreme polygon counts);

- Showing that ray tracing is fundamentally 'better' than rasterizing (imho; because it allows more natural effects in a more straightforward manner);

- And last but not least: Ray tracing is about the only algorithm that has a use for todays extreme processor speeds. I like the speed contest very much.

GI is currently unsuitable for coherent ray packets, I agree completely. That does not render our efforts useless though: A machines like the PS3 might very well outperform it's own 3D HW by pure ray tracing, at least in terms of image quality. 3D HW takes endless tweaks and tricks to do what ray tracing does intuitively. The ultimate 'success' would be if NVidia released 3D HW that does ray tracing: Once we start seeing performance of 100M rays/s, developers will begin to love ray tracing. Once we hit the 1B rays/s, rasterizing will be gone, transistor counts will be significantly lower, scalability will be super-easy, and everyone will be crying for faster kd-trees. [SMILEY Smile] Well actually that's probably not true, but the point I'm trying to make is that besides having fun, we're doing some fundamental research I guess.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/13] [lycium] [RtStage2 source release] Wayback!

i must apologise for the late edit, but the meat of it:

nothing except eye and (some) shadow rays are coherent; it's not just a gi problem as i may have made it out to be. and this research is long done btw ;) this is why i say banging on the coherency door too much might not be so useful beyond a certain point.

i agree with everything else you say (elegance, logarithmic complexity, nvidia should be in on it, etc)- that's why i'm here :D

(L) [2006/05/13] [Phantom] [RtStage2 source release] Wayback!

You want my phone number? [SMILEY Smile] 10 posts in 30 minutes. [SMILEY Wink]

Shadow rays are just as coherent as first-hit rays. Just like eye rays, shadow rays originate from a single point and hit the same points that the eye rays do.

Reflection rays can be less coherent, but only when there's lots of curvature. In that case, you can still use packets, they will just perform less well. For a mirroring floor, coherency is fine; of course such a floor would also be easy for 3D hardware. However, consider a slightly bulging floor, or a hollow mirror: This is a disaster for 3D HW, while for a ray tracer, it's nothing special (i.e., no special requirements/code needed; no cube maps, no 'render to surface', no aliasing problems and so on...).
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/13] [Phantom] [RtStage2 source release] Wayback!

Fresh source release:

[LINK ftp://mamoth.ompf.org/ ftp://scene:5c3n3up@mamoth.ompf.org] file rtStage2_may_13th_2006_v2.rar

(note the 'v2', first upload failed halfway)

New in this release:

- No more artifacts

- No more memory leaks

- Double performance for complex scenes (not yet for scene6 though)

- Thread is now kept alive, so no more thread handle leaking.

Works like a charm on my machine. Good enough for now. Next week I'll start adding back functionality.

EDIT: This is purely for dual core. The main thread spawns one extra render thread and suspends it immediately; then it fills the task stack, resumes the helper thread and executes the same code itself (i.e., fetch task - render task - next). As soon as all tasks have been processed, the helper is suspended till the next frame. More cores could easily be added obviously.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/13] [Phantom] [RtStage2 source release] Wayback!

What I have now is:

0. (init time) a helper thread is spawned and suspended.

1. First, there is the main thread. This fills the task stack.

2. After that, the main thread starts processing tasks by fetching them from the stack and executing them.

3. The helper thread is waked and also executes step two. One difference: Once it is done, it is suspended again.

4. Should the main thread finish it's last task before the helper thread finishes it's last task, the main thread waits for the helper to complete.

So there you have your fighting for food, and the docking state. This is what I came up with myself, it requires only one extra thread, no post-init thread creation and only simple thread suspend/resume. There are no special provisions whatsoever for synchronisation, except for the main thread waiting for the last task to complete on the helper thread.

And it seems to work: Not only do I get the full 2x speedup on complex scenes (in fact, it appears to be better than that); the program also runs more stable than in solo mode (timing).

I don't want to use OMP. I have no idea what it does under the hood and how it is implemented by the OS. Besides, by explicitly preparing for threading, I found numerous issues that are also issues under OMP (like, which variables are unique per thread), but I doubt that I would have found them by merely putting some #pragma's around parallel blocks. In fact I think I am better prepared for OMP now that I have at least a basic understanding about how I would do this without the help of OMP. As usual, I might add, I've seen people do terrible things to 3D engines (which happen to be my speciality during my game dev carreer).
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/14] [Phantom] [RtStage2 source release] Wayback!

Chill out. I'm not afraid of mistakes, and I'm trying to learn something. Like I said, I don't believe I will learn anything by using OMP.

Can you explain *why* it is a bad idea to let the master do the same work as the worker threads? The reason I did it like this is that this way the master is not waiting, except for the last package. So, it is not spending time polling. What's wrong with that?

You say that at any given time there is only one active thread per execution unit. I reckoned that this is ideal: Minimal task switching overhead.

The only thing that became clear to me from your last reply is that I need to lock threads to execution units.

One thing I didn't understand from one of your previous posts is what you mean by 'docking'. Could you elaborate on that?
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/18] [Ho Ho] [RtStage2 source release] Wayback!

It has been a while since my last update. I've been busy with other stuff and haven't found much time to work on porting. As I can't work on it during weekends I have to do it after work. Since I'm usually quite tired after it I don't do it as often as I would like to.

Since the last update I got 32bit Gentoo installed and during the last two hours have ported May 9 version to 32bit Linux together with graphical output. I used the code tbp provided in the first page.

Here is a picture of fairy forest in action.

[IMG #1 ]

I get about 3.4-3.5FPS on average at 3.6GHz and no threading with GCC 4.1. I think there is something missing because the 13'th May version runs at 3.9FPS through Wine. Or was there some major updates that made it faster than the one from 9'th?

There are still lots of stuff broken, missing or ugly

As can be seen, I haven't yet used the portable wallcklock. I still use my old one that doesn't work that well but at least it gives right output to the console.

The unions in raytracer.cpp are not aligned and might give some speed penalty.

There is no threading.

It probably works only on my PC, perhaps on other 32bit Linuxes too.

No 64bit support.

The code looks ugly as hell, especially soloapp.cpp.

It desperately needs UI and threading abstractionI'll try to clean up the code a bit and fix things when I get time. Hopefully in one nice sunny day I catch up with Phantom and can release the code [SMILEY Smile]

From the next major things I don't know yet what to do first, threading or 64bit. I think the first might actually be easier unless I decide to port only renderer and use KDTrees exported from 32bit version.

Btw, once I get it working I probably can ask around in one Estonian forum for guinea pigs. One guy said it has access to 8-dualcore cpu server that could be used, probably 64bit K8's. As it is server it can't use any kind of graphical output but it should be quite nice to test how well threading scales [SMILEY Smile]
_________________
In theory, there is no difference between theory and practice. But, in practice, there is.

Jan L.A. van de Snepscheut

[IMG #1]:Not scraped: https://web.archive.org/web/20061004024449im_/http://img506.imageshack.us/img506/4397/fairyforest0vh.th.png

(L) [2006/05/19] [Ho Ho] [RtStage2 source release] Wayback!

I have one small request.

Could you please not assume anything [SMILEY Smile]

Especially problematic is pointer arithmetic. If you would use some predefined constants instead of 0xfff...ff then it should get much easier to port between 32<->64bit. Also if you absolutely need to use primitives with certain size try using (u)intXX_t or at least don't use long where you want to have 32bit integers.

Second thing I would like is that you wouldn't hardcode OS specific things like threading, GUI and user input. As these things shouldn't be performance critical it shouldn't be too bad if they would use some kind of abstraction layer.

What is there already you don't have to fix unless you want to. I can fix it myself when given enough time. If you add some new things then please try to use some kind of abstraction. It would make my work much simpler.
_________________
In theory, there is no difference between theory and practice. But, in practice, there is.

Jan L.A. van de Snepscheut

(L) [2006/05/19] [Phantom] [RtStage2 source release] Wayback!

I will have a look, shouldn't be too problematic to switch to int32 and uint32 instead of int and unsigned int, especially since I prefer unsigned int these days (where possible), and uint32 is shorter to type. [SMILEY Smile]

As for the constants: You are referring to the masking operations in the kdtreenode struct? That should really be done in the manner tbp proposed (i.e. 0xffffffff should become -1 if I'm correct; that should work on 64bit with no problems).

Hardcoding: Is that to prevent me of doing that or did you spot hardcoded OS specifics? I intentionally used a minimalistic class for threading, so it would be easier to convert. The GUI is platform independent, as it plots in the same buffer that the tracer is using. And there is no user input. [SMILEY Wink]

I will try to convert to something that's acceptable to you, otherwise you'll have more problems with each version I release.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/05/20] [Ho Ho] [RtStage2 source release] Wayback!

I've made quite some progress from the last post [SMILEY Smile]

I have half-working Thread class and the tracer uses it. It's not equivalent to the windows Thread class, some functionality is a bit different. I even get almost 2x speedup with dualcore defined compared to that not being defined. For some reason something brings performance down compared to the 9'th May version. There shouldn't be three threads fighting because the main thread sleeps in 10ms intervals.

There are quite a few problems with multithreaded rendering, probably syncing issues. Also I saw a couple of deadlocks. I have no idea what can cause these. Probably something I messed up in Engine::RenderTiles [SMILEY Smile]

Thread class looks like this:

thread.h:

(L) [2006/06/06] [Phantom] [RtStage2 source release] Wayback!

I'm afraid you're correct. It's almost nlogn, but I took the easy path at the end: To reinsert clipped triangles, I decided to simply use a resort. Don't worry about it though, the list is already sorted when mergesort is executed, except for one or two clipped triangles. Mergesort approaches nlogn in that case, if I'm correct.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/06/07] [Shadow007] [RtStage2 source release] Wayback!

Hi, I've finally got a few results, modifying your compiler to be strictly ONLogN. While I can't check the KdTree (no mm_ extensions here --> no raytracing), the KdLog.txt is the same as with the "reference" one, so I guess it's the same result.

Meanwhile, to compile the FairyForest scene (fully clipping), it takes 13s instead of 21s so I guess you're underestimating the complete resort cost ...

I didn't implement (yet) my idea of "remerging" progressively the modified events in one long pass, but resort/remerge the Wald's way. Still, I optimised some : I check which events are 'really' modified, and clip as soon as possible to avoid inserting events that will be evaluated as invalid later.

I'll post the compiler modified source when I'm sure it works (ie when I have raytraced a few scenes ...).

(L) [2006/06/13] [Shadow007] [RtStage2 source release] Wayback!

Phantom, I've found a problem in the MemoryManager Init : in the initialization, you create a table of (a_size >> [SMILEY Cool] pointers to pointers of 4096 (4100) KdTreeNodes. Could you please explain how that >> 8 was computed ? In fact, on my KdTree Compiler, for the Fairy forest scene, I need 710 pointers, while only 680 were allocated. It is a source for bugs ...

While I changed the size >> 8 to a size >> 7, I guess there should be some kind of correct way to compute that value.

Any ideas ?

(L) [2006/06/16] [Ho Ho] [RtStage2 source release] Wayback!

A little note.

I got someone to run the 13'th May Windows version on Conroe and the results are quite interesting. With the default settings (1024x1024 fairy forest) he got 6.4FPS with single core and 11.4FPS with two.

The CPU he used was [LINK mailto:2.13GHz@2.76GHz 2.13GHz@2.76GHz], 4M L2, FSB was [LINK mailto:266@345.5MHz 266@345.5MHz], bus speed was 1066@1385MHz. IIRC 2.8GHz P4 had about half that speed [SMILEY Smile]

My work on the tracer has halted for the time being but I hope to get back to it as soon as possible. I'll have a vacation in a couple weeks and I hope to finish it during that.
_________________
In theory, there is no difference between theory and practice. But, in practice, there is.

Jan L.A. van de Snepscheut

(L) [2006/06/19] [Phantom] [RtStage2 source release] Wayback!

Just a small note: I was reviewing my kd-tree traversal code and came up with a small fix. Right now, I initialize 'near' and 'far' for each packet to 0 and 1000. These values are then updated during traversal. I already knew that Wald was suggesting to clip each ray to the scene bounding box, but I omitted this, as my code ran fine without that. However, this obviously makes it far less likely that the first traversal step will skip one of the child nodes. A simple fix looks like this (changes wrt to already released code):

(L) [2006/06/19] [toxie] [RtStage2 source release] Wayback!

This is actually obvious. ;)

Sponza gets slower as you are INSIDE the scene and thus your initialization [0,1000] is "correct" (no clipping of the far-value to the SceneBBox really necessary as the scene is closed (except for the hole in the sky) and thus you'll hit a face before you exit the scene). So any additional work (=clipping) results in a slight slowdown.

Legocar is viewed from OUTSIDE the SceneBBox. So if you don't clip your near-value, the first leafs you'll visit will actually be NOT HITTING THE RAY AT ALL (as do the included triangles of course)! This is because of the nature of the kD-tree traversal which expects that the ray is always clipped to the current Node-Box (decisions which child(ren) to traverse are only based upon near, far and the splitplane-intersection!).
_________________
what do you expect to do if you don't know what to do when you've got nothing to do?

RtStage2 source release back