General development strategy

(L) [2006/08/20] [TomPie] [General development strategy] Wayback!

Hi,

I am new to this forum. I have been doing ray tracing for some time now, but not for the ultimate speed. Its very interesting to read all the posts here and I really enjoy the discussions. But let me be a bit hypercritical.

It seems to me that to get a nice and fast ray tracer, we can use C++, non-SIMD code and a good data structure. But to get the last 10% of performance out of the CPU, we need all the "dirty" tricks: SSE intrinsics, deep fiddling with parameters, maybe switching to C (or even assembly), if not for analyzing compiler results. In addition, when you do this for a living, this binds one person to be the expert on "making it fast on that platform X". That´s a dead end road. I do not want to deal with all that (don´t get me wrong, I like to hack). I would like to see more improvement on a higher level of data structure and algorithms, like BIH etc. There are still large open fields, maybe like using partitions (Matousek) etc.

But maybe I am not deep enough into current implementations. I would like to know the difference between a well optimized non-SSE, C++ based ray tracer against a highly optimized ray tracer that you guys implement. Difference means: runtime difference and time to implement and test.

Tom

(L) [2006/08/20] [Michael77] [General development strategy] Wayback!

Well, I think one needs to check the value of the different performance tweaks. Of course, the acceleration structure ( this includes proper alignment of data) has the most influence on performance. But going from simple C++ to SSE Code will also bring you nearly a 4 times speed increase and is not that difficult to implement ( although I admit, I haven´t done it yet). And using a profiler like VTune will of course always be a benefit, if just it shows you some errors in your c++ code. However, I don´t think, low level performance optimization with manual prefetching or things like that, are really worth the trouble. Mostly, the speed increase you can get from these on well designed systems are not that large to make up for the code complexity. And regarding assembler: In 95% of all cases, a good compiler will write better assembly than you will ever be able to, so there is really no need to use it.

Michael

(L) [2006/08/20] [Phantom] [General development strategy] Wayback!

Some stats from my side:

- SSE2 gave me about 2.5x the speed of 'mono-rays'. I don't know where everyone gets 4x from, but as far as I know, that is simply not realistic.

- I found SSE2 pretty complex, initially. It's all fine as long as you work with floats and no branches, but once you need conditional code and a mix of floats and integers, things can be a bit tricky initially.

- I never used assembler and rarely studied compiler output.

- I did however tweak code and data: Keep data that you write to apart from data that you just read; use as much 'const' as you can in your methods; build your own memory manager; make sure all data is aligned to cache boundaries. This and more obviously requires low-level hardware knowledge.

- C++ is fine. My current tracer is written in C++, although it consists mainly of some monolithic methods (sometimes through forced inlining).

- My core code is less than a thousand lines. This includes traversl code for shadow rays, which is a 200 line duplicate of the primary ray traversal code (with some minor changes).

I'm not sure how Tbp's radius is performing nowadays, but I used to be 'speedking'. [SMILEY Smile] So above remarks still alow a rather good performing ray tracer.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/08/20] [Phantom] [General development strategy] Wayback!

I assume the original poster is interested in the performance gain of code written in C/C++ without explicit use of intrinsics, compared to code that explicitly does use intrinsics. Whatever the compiler does with regular C/C++ is up to the compiler.
_________________
--------------------------------------------------------------

Whatever

(L) [2006/08/20] [TomPie] [General development strategy] Wayback!

Thanks so far for the answers!

I wanted to find out, if optimizing the last bit is worth it or if a somewhat slower, but maybe more readable, maintainable and quicker to implement ray tracer is ok, too. One important point has been made by Lotuspec: When it comes to global illumination, many of the current acceleration techniques (at least coherence) do not work so well.

Tom

(L) [2006/08/20] [tbp] [General development strategy] Wayback!

Comparing decent scalar vs vector SSE (that is you have put the same amount of time tweaking both), you should get a 2.5x speedup on this kind of code (of course it depends on where exactly is your bottleneck) on current architectures.

If for whatever reason you're not willing to spend time fixing your code, you should at the very least be extra-careful with your data layout and memory issues in general. In most cases that's ugly, not portable and therefore not easily maintainable [SMILEY Smile]

But still pretty crucial.
_________________
May you live in interesting times.

[LINK https://gna.org/projects/radius/ radius] | [LINK http://ompf.org/ ompf] | [LINK http://ompf.org/wiki/ WompfKi]

(L) [2006/08/21] [Shadow007] [General development strategy] Wayback!

I'm trying to get some kind of scanning KdTree building in Phantom's kdtree compiler, so far with bad performances (not optimized though).

I guess it could be partly because I don't know squat about SSE2/SIMD... and thus don't use them.

It seems to me quite the good time to start the learning machine once more [SMILEY Smile]

Could anyone point me to a good tutorial and reference ???

(L) [2006/08/21] [tbp] [General development strategy] Wayback!

Conveniently regrouped SSE reference (too lazy to find the real thing): [LINK http://ompf.org/docs/amd/26568.pdf]

The other day i've given a quick try at the scoring part of the scanner; couldn't convince any compiler (msvc, icc, gcc) to produce the code i wanted. The intrinsic approach is pretty hopeless anyway as you're going to use all registers and end up fighting with the register allocator.
_________________
May you live in interesting times.

[LINK https://gna.org/projects/radius/ radius] | [LINK http://ompf.org/ ompf] | [LINK http://ompf.org/wiki/ WompfKi]

General development strategy back