ocaml sphereflake raytracer with SAH-based kd-tree back

(L) [2006/12/29] [elihu] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

I have been working on a raytracer in ocaml for some time, and recently decided I would strip out all the experimental stuff I'm working on (which amounts to about 2/3 of the code, and is at the moment relatively useless) and release it as a plain, bare-bones GPLed raytracer, in the hopes it might be useful to someone.

It doesn't do very much; it can only render 3 types of object: a sphere, an unsorted collection of objects (or, in other words, groups of spheres), or a kd-tree sorted collections of objects.  To put this another way, the only interesting thing it can render is an spd sphereflake.

It accepts nff files on standard input, and ignores what it doesn't understand.  I included a couple of sphereflake spd models, just pipe them to the raytracer binary (called main) like so: "cat balls-l4.spd | ./main".

Things that aren't implemented: tranparency, refraction, image file output, transformations, non-uniform textures, anti-aliasing, csg operations, etc...  Specular highlighting is broken.

The kd-tree constructor uses the surface-area heuristic.  (I'm not sure it's entirely correct, but it reduces rendering time by 25% or so versus splitting cells in the middle.)  It doesn't use boundaries of AABBs as split candidates, instead it just tries 10 or so candidates - that's something I could probably improve.

In case anyone is wondering why I'm using ocaml and not C:  ocaml has a more restrictive type system that makes it harder to write buggy code, and has a number of useful features like garbage collection and closures.  This allows me to write new code more easily, and get a good idea if the big O complexity is reasonable.  On the other hand, C is a bit faster and ocaml doesn't support multithreading (at least, not in a way that will let two threads run concurrently on separate processors).

You can find my code here: [LINK http://syn.cs.pdx.edu/~jsnow/raytrace.tar.bz2].  I'm bad at naming things, so my raytracer doesn't have a name.  Suggestions are welcome.  Compiling the program requires ocaml and lablgl (the latter being openGL bindings for ocaml).  My makefile uses "ocamlopt.opt" by default, if this isn't available on your system (ubuntu doesn't include it, whereas fedora does), then change the makefile to use plain "ocamlopt".  I haven't tried compiling on Windows, but I don't forsee any problems.

This isn't the only ocaml sphereflake renderer out there, another written by someone else can be found here: [LINK http://www.ffconsultancy.com/free/ray_tracer/].

edit: My raytracer now has a name (glome) and a webpage.  Get latest versions [LINK http://syn.cs.pdx.edu/~jsnow/glome/].

(L) [2006/12/30] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:...On the other hand, C is a bit faster...
This isn't the only ocaml sphereflake renderer out there, another written by someone else can be found here: [LINK http://www.ffconsultancy.com/free/ray_tracer/].
I'd advise caution when making such claims, unless a factor of 2 or 3 in execution time falls under the 'a bit faster' category. See this [LINK http://groups.google.com/group/comp.graphics.rendering.raytracing/browse_frm/thread/3119086d94a78fa3/3e82fd2ec7e68189?tvc=1 comp.graphics.rendering.raytracing thread] (or start from this [LINK http://groups.google.com/group/comp.graphics.rendering.raytracing/tree/browse_frm/thread/3119086d94a78fa3/0e84198649b291b4?rnum=21&_done=%2Fgroup%2Fcomp.graphics.rendering.raytracing%2Fbrowse_frm%2Fthread%2F3119086d94a78fa3%3F#doc_b2b377d781dbfa87 post]) for example. You may recognize some protagonists [SMILEY ;)]

Other than that, let me welcome you and your sphereflakes.

(L) [2006/12/30] [elihu] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> tbp wrote:elihu wrote:...On the other hand, C is a bit faster...
This isn't the only ocaml sphereflake renderer out there, another written by someone else can be found here: [LINK http://www.ffconsultancy.com/free/ray_tracer/].
I'd advise caution when making such claims, unless a factor of 2 or 3 in execution time falls under the 'a bit faster' category. See this [LINK http://groups.google.com/group/comp.graphics.rendering.raytracing/browse_frm/thread/3119086d94a78fa3/3e82fd2ec7e68189?tvc=1 comp.graphics.rendering.raytracing thread] (or start from this [LINK http://groups.google.com/group/comp.graphics.rendering.raytracing/tree/browse_frm/thread/3119086d94a78fa3/0e84198649b291b4?rnum=21&_done=%2Fgroup%2Fcomp.graphics.rendering.raytracing%2Fbrowse_frm%2Fthread%2F3119086d94a78fa3%3F#doc_b2b377d781dbfa87 post]) for example. You may recognize some protagonists ;)
Other than that, let me welcome you and your sphereflakes.

I was attempting to be deliberately vague about my performance claims; evidently I wasn't quite vague enough. I expect a factor of 2 or 3 difference is probably about right (at least, if these benchmarks are at all reliable [LINK http://shootout.alioth.debian.org/gp4/ocaml.php]) which I suspect is about the same as the difference between a good acceleration structure and a mediocre one. If you compare singlethreaded ocaml to multithreaded C with handcoded SSE, the difference is probably a lot more.

I'm mainly using the raytracer to test algorithms, so as long as it is sufficiently similar to C that its performance bottlenecks are for the most part in the same places I'm happy.

I'd be interested to know how my raytracer compares with the fastest renderers out there. In the interests of idle curiosity, I just tried comparing against povray 3.6.1 on a level 4 spd sphereflake 7381 spheres). Pov takes about 2.5 seconds total, my renderer takes about 0.7 seconds to construct the kd-tree and 2.4 seconds to render. (For pov, I disabled the two triangles that the sphereflake sits on because my renderer can't draw them either.) This is on an athlon 64 3800 (dual core, but that doesn't matter for either program).

(L) [2006/12/30] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

I didn't meant to be mean, it's just that when i see reference to both OCaml and that ffconsultancy url my skin itches [SMILEY ;)]

It's quite difficult to compare renderers, there's so many knobs to tweak. Then using spheres as the main primitive isn't typical in so called fast renderers. POV isn't fast by any standard.

The closest thing i have [LINK http://ompf.org/ray/sphereflake/] renders 5.38M spheres (a lvl 8 sphereflake) with diffuse only lighting / shadows and a 2x2 FSAA @1024x1024 in 7.7s, construction included, on a my old 2ghz opteron. But of course that's cheating (sphere BVH).

Fast SAH kD-tree builders, without approximations, can chew thru a million triangle soup in a bunch of seconds. With approximations, it's an order of magnitude faster. Give or take a few order here or there [SMILEY :)]
See:
[LINK http://ompf.org/forum/viewtopic.php?t=19]
[LINK http://ompf.org/forum/viewtopic.php?t=78]
[LINK http://ompf.org/forum/viewtopic.php?t=252]
There's also other options, mostly hybrids between BVH & kD, like the famous BIH etc...

Equipped with such kD-tree, on previous gen. processors, one ought to shoot about 2M primary mono-rays per second; that is without using packet/coherent rays & SSE etc... See [LINK http://homepages.paradise.net.nz/nickamy/benchmark.html] for such benchmark.

Then you can start shooting ray packet/bundles and deploy your bag-o-tricks, there's a metric ton of threads on this very forum about that. Or you could sample the Visual section, [LINK http://ompf.org/forum/viewforum.php?f=6]

(L) [2006/12/30] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

I didn't meant to be mean, it's just that when i see reference to both OCaml and that ffconsultancy url my skin itches [SMILEY Wink]

It's quite difficult to compare renderers, there's so many knobs to tweak. Then using spheres as the main primitive isn't typical in so called fast renderers. POV isn't fast by any standard.

The closest thing i have [LINK http://ompf.org/ray/sphereflake/] renders 5.38M spheres (a lvl 8 sphereflake) with diffuse only lighting / shadows and a 2x2 FSAA @1024x1024 in 7.7s, construction included, on a my old 2ghz opteron. But of course that's cheating (sphere BVH).

Fast SAH kD-tree builders, without approximations, can chew thru a million triangle soup in a bunch of seconds. With approximations, it's an order of magnitude faster. Give or take a few order here or there [SMILEY Smile]

See:

[LINK http://ompf.org/forum/viewtopic.php?t=19]

[LINK http://ompf.org/forum/viewtopic.php?t=78]

[LINK http://ompf.org/forum/viewtopic.php?t=252]

There's also other options, mostly hybrids between BVH & kD, like the famous BIH etc...

Equipped with such kD-tree, on previous gen. processors, one ought to shoot about 2M primary mono-rays per second; that is without using packet/coherent rays & SSE etc... See [LINK http://homepages.paradise.net.nz/nickamy/benchmark.html] for such benchmark.

Then you can start shooting ray packet/bundles and deploy your bag-o-tricks, there's a metric ton of threads on this very forum about that. Or you could sample the Visual section, [LINK http://ompf.org/forum/viewforum.php?f=6]
_________________
May you live in interesting times.

[LINK https://gna.org/projects/radius/ radius] | [LINK http://ompf.org/ ompf] | [LINK http://ompf.org/wiki/ WompfKi]

[IMG #1]:

(L) [2006/12/30] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

Quick & dirty run with a tesselated sphereflake, ./balls -r 15 -s 4. That's about 800k triangles, kD tree built in ~3.5s. Can't shoot packets atm, so that's with single rays.

[IMG #1 ]
_________________
May you live in interesting times.

[LINK https://gna.org/projects/radius/ radius] | [LINK http://ompf.org/ ompf] | [LINK http://ompf.org/wiki/ WompfKi]

[IMG #1]:

(L) [2006/12/31] [olliej] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> tbp wrote:I didn't meant to be mean, it's just that when i see reference to both OCaml and that ffconsultancy url my skin itches

Hehe, i saw that some time ago, and thought it looked suspicious, but i was having to work on my thesis so it didn't get too much attention from me. Reading the news group thread had me desperately tempted to make a comment about the fundamental difference in implementation (namely the ocaml version was using a switch where c++ was using virtual dispatch...).

>> tbp wrote:Quick & dirty run with a tesselated sphereflake, ./balls -r 15 -s 4. That's about 800k triangles, kD tree built in ~3.5s. Can't shoot packets atm, so that's with single rays.

I hate you (In a good way of course [SMILEY :D])

(L) [2006/12/31] [phresnel] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

Apropos language compos:

I am running gcc. I saw a comparison between g++ and ... (forgot it, cc? can't really trackback that comparison since my system is being abused by my ray tracer atm, everything is sloooooow), saying cc (?, I am talking about the gnu c-compiler) makes faster code than g++. Now what does gcc, teh collection of dem all? What I mean, the only c++ thingy I am using is namespacing and here and there a globally defined operator overload. Makes this my code slower? I don't see any reason why my code should be slower, since I am not using classes or COM or so, I am just asking.

Nice balls-render btw, tbp [SMILEY ;)] One question on the HUD: What stand the thingies for (ono, bk et cetera)

(L) [2006/12/31] [greenhybrid] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

Apropos language compos:

I am running gcc. I saw a comparison between g++ and ... (forgot it, cc? can't really trackback that comparison since my system is being abused by my ray tracer atm, everything is sloooooow), saying cc (?, I am talking about the gnu c-compiler) makes faster code than g++. Now what does gcc, teh collection of dem all? What I mean, the only c++ thingy I am using is namespacing and here and there a globally defined operator overload. Makes this my code slower? I don't see any reason why my code should be slower, since I am not using classes or COM or so, I am just asking.

Nice balls-render btw, tbp [SMILEY Wink] One question on the HUD: What stand the thingies for (ono, bk et cetera)
_________________
[LINK http://greenhybrid.net/ greenhybrid.net]

Real Men code Software Graphics.

(L) [2007/01/01] [elihu] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> I didn't meant to be mean, it's just that when i see reference to both OCaml and that ffconsultancy url my skin itches

I don't blame you, it doesn't seem like a fair comparison to write a mediocre raytracer in C++ and then in ocaml, and claim that ocaml and C++ deliver equivalent performance.  I thought I ought to mention it to preempt anyone from saying "hey, didn't you know someone already did that before?"
>> Quick & dirty run with a tesselated sphereflake, ./balls -r 15 -s 4. That's about 800k triangles, kD tree built in ~3.5s. Can't shoot packets atm, so that's with single rays.

I assume .401 seconds is the render time; that's about 6 times faster than mine for an equivalent sphereflake.  (btw, what's the 74.2 fps number?)

I'm wondering if there is a significant performance difference between a triangle-based representation and a sphere-based representation.  I've noticed that many high-performance renderers seem to only deal with triangles.  Is this because

a) Writing lots of ray-intersection tests for every kind of geometry you might possible want to use is tedious
b) Following a function pointer whenever we want to intersect a ray with an object is too much of a performance hit
c) Triangles, being smaller on average than the primitives they approximate, are less likely to straddle split planes and get copied into more than one location in the kd-tree
d) Ray-intersection testing with a triangle is cheaper than ray-intersecting with most other kinds of geometry
e) Most interesting models are all triangles
or
f) Some reason I haven't thought of?

Here's some statistics on the kd-tree I build:
spheres in sphereflake: 7381
branches in kdtree: 7117
leaves in kdtree: 7118
number of objects in the leaves of kdtree: 18224
(Interesting note: I you omit the big sphere in the middle, you get 17493 objects in the leaves of the kdtree.  In other words, the big sphere is copied into 751 cells.)

So, on average every sphere is copied into about 3 cells.  I thought this might be slowing performance, so I tried a few tricks to speed things up.

First, I tried storing objects in the branches instead of copying them into both leaf whenever they straddle a split plane.  This actually made things slower.  I think this is because small objects were getting placed high up in the tree, so they were getting intersected by more rays than they would have otherwise.  (This is before I implemented SAH - perhaps this idea might be a win as long as one employed a cost function such that only sufficiently large objects get placed high up in the tree.)

Second, for each sphere I stored an identifier of the last ray to be intersected with that sphere.  If I try to intersect the same ray again, I return a miss.  Unfortunately, this didn't speed things up either (actually, it made things slightly slower).  My hypothesis: even though the average sphere lives in 3 different cells, the average ray only rarely hits more than one of those cells.

Here's a screenshot, for the curious:
[IMG #1 Image]

I tend to get white dots here and there; probably some floating point corner case.

I also need to find up what's up with memory consumption; for some reason, each sphere seems to eat up about 2k of memory.  How strange.

[IMG #1]:Not scraped: https://web.archive.org/web/20101018105629im_/http://syn.cs.pdx.edu/~jsnow/images/sphereflake.png

(L) [2007/01/01] [olliej] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:a) Writing lots of ray-intersection tests for every kind of geometry you might possible want to use is tedious
b) Following a function pointer whenever we want to intersect a ray with an object is too much of a performance hit
c) Triangles, being smaller on average than the primitives they approximate, are less likely to straddle split planes and get copied into more than one location in the kd-tree
d) Ray-intersection testing with a triangle is cheaper than ray-intersecting with most other kinds of geometry
e) Most interesting models are all triangles
or
f) Some reason I haven't thought of?

a) not really, though i imagine it would eventually get tedious [SMILEY :D]
b) yes
c) no idea
d) A sphere intersection can be quite fast, if not faster than triangle
e) yes, or can be tesselated into triangles

But the big one is that supporting multiple primitives introduces branching (best case) or function pointer/virtual dispatch (eg. b) into the heart of a number of tight loops, thus having a relatively high hit on performance even if none of the other primitives are used in the scene.

--Oliver

(L) [2007/01/02] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:I assume .401 seconds is the render time; that's about 6 times faster than mine for an equivalent sphereflake. (btw, what's the 74.2 fps number?)
Equivalent if you neglect a factor of 100 in the number of primitives. That run was silly: if i were to render efficiently such sphereflake i'd tesselate bigger spheres more than smaller ones etc... So it was more of a brute force demonstration than anything.
That 74.2 fps is just some random stat unrelated to raytracing and a testimony to my struggle against nVidia's drivers with multiple contexts & threads. Struggle that i'm currently losing.
>> elihu wrote:I'm wondering if there is a significant performance difference between a triangle-based representation and a sphere-based representation. I've noticed that many high-performance renderers seem to only deal with triangles.
To complement olliej's, once you build a proper tree, your limiting factor becomes ray/primitive intersection tests. You want them vectorized as much as possible and any form of branching kills you. That doesn't mean that multiple primitive types aren't an option, but certainly a software engineering issue.
SAH/approximated SAH, with additional heuristics fitting tradeoffs of your traversal (ie respective speed of traversals of empty/non empty cells, intersection vs traversal step etc...) is the only option as far as kD-tree are concerned.
>> elihu wrote:I tend to get white dots here and there; probably some floating point corner case.
Evil is in the details.

greenhybrid:
Even if C++ makes it (much sometimes) harder for compilers to see through the cruft (compared to C), you still only pay for what you use (ie exceptions).
Namespaces are just syntactic sugar. On the other hand you have to be extra careful when you do some fancy operator overloading on pods. Really. And it's quite subtle. If you pay attention you can get 'perfect' code generation even for templated classes with operator overloadage all around, basically getting the glorified assembler C used to be with the expressive power of C++. But it takes work and a good compiler.

PS: COM? What you say?

(L) [2007/01/02] [olliej] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> tbp wrote:That 74.2 fps is just some random stat unrelated to raytracing and a testimony to my struggle against nVidia's drivers with multiple contexts & threads. Struggle that i'm currently losing.

Oh? What are the nvidia drivers doing?

My current battle is with pthreads -- i've gone over things repeatedly with a fine tooth comb and yet i still get sporadic (say 30-40% of the time) errors, along the lines of illegal isntruction, or random failures in random pieces of code that looks fine -- it looks like a race condition but in what i cannot say :-/

--Oliver

(L) [2007/01/02] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

In one corner x raytracing threads, one that gathers renders and runs them through shaders in its context and finally a front end UI thread and context asynchronously controlling the rest. In the other, nVidia's drivers outsmarting me and the asinine xp32 kernel.

That's the recipe for a mess:
a) currently OpenGL context sharing is either on or off, it's going to change but...
b) even without those OpenGL thread, it's hard to stop that silly xp kernel from pingponging threads all around.
c) nVidia's driver tries to optimize the load, with 2 user space threads + kernel stuff, and obviously its heuristics play against me; i still haven't found how to get it to cooperate.

I'm not blaming nVidia, my workload is certainly atypical. I've tried various kludges, i can deal with either xp's scheduler or nVidia drivers but not both at the same time. For now i've put that issue aside. Maybe there's some exotic extension i can subdue to get what i want. Or i'll get back to something more KISS compliant.

Meanwhile i have an offscreen path to bench stuff [SMILEY :)]

(L) [2007/01/02] [olliej] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

Haha, compare to my (alas very slow) raytracer -- render to buffer in onew set of threads. OpenGL thread continuously renders that buffer as -- wait for it -- a large series of quads.

The weirdness i get is more of the illegal instruction variety :-/

(L) [2007/01/02] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

In one corner x raytracing threads, one that gathers renders and runs them through shaders in its context and finally a front end UI thread and context asynchronously controlling the rest. In the other, nVidia's drivers outsmarting me and the asinine xp32 kernel.

That's the recipe for a mess:

a) currently OpenGL context sharing is either on or off, it's going to change but...

b) even without those OpenGL thread, it's hard to stop that silly xp kernel from pingponging threads all around.

c) nVidia's driver tries to optimize the load, with 2 user space threads + kernel stuff, and obviously its heuristics play against me; i still haven't found how to get it to cooperate.

I'm not blaming nVidia, my workload is certainly atypical. I've tried various kludges, i can deal with either xp's scheduler or nVidia drivers but not both at the same time. For now i've put that issue aside. Maybe there's some exotic extension i can subdue to get what i want. Or i'll get back to something more KISS compliant.

Meanwhile i have an offscreen path to bench stuff [SMILEY Smile]
_________________
May you live in interesting times.

[LINK https://gna.org/projects/radius/ radius] | [LINK http://ompf.org/ ompf] | [LINK http://ompf.org/wiki/ WompfKi]

(L) [2007/01/04] [jdh30] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> tbp wrote:I'd advise caution when making such claims, unless a factor of 2 or 3 in execution time falls under the 'a bit faster' category. See this [LINK http://groups.google.com/group/comp.graphics.rendering.raytracing/browse_frm/thread/3119086d94a78fa3/3e82fd2ec7e68189?tvc=1 comp.graphics.rendering.raytracing thread] (or start from this [LINK http://groups.google.com/group/comp.graphics.rendering.raytracing/tree/browse_frm/thread/3119086d94a78fa3/0e84198649b291b4?rnum=21&_done=%2Fgroup%2Fcomp.graphics.rendering.raytracing%2Fbrowse_frm%2Fthread%2F3119086d94a78fa3%3F#doc_b2b377d781dbfa87 post]) for example.

That discussion concluded when I ported your C++ ray tracer back to OCaml and it was only 60% slower (and still unoptimised):

[LINK http://groups.google.co.uk/group/comp.graphics.rendering.raytracing/browse_frm/thread/eeefae7a63d2e86d/42e66688053c840a?lnk=gst&q=thierry+berger-perrin+jon+harrop&rnum=2&hl=en#42e66688053c840a comp.graphics.rendering.raytracing]

So I contest your assertion that OCaml is 3x slower.

PS: I just posted another OCaml implementation that is only 20% slower than your C++ and still 40% less code.

Cheers,
Jon.

(L) [2007/01/04] [jdh30] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:I have been working on a raytracer in ocaml for some time, and recently decided I would strip out all the experimental stuff I'm working on (which amounts to about 2/3 of the code, and is at the moment relatively useless) and release it as a plain, bare-bones GPLed raytracer, in the hopes it might be useful to someone.

Awesome! You may also be interested in my F# ray tracer for Windows:

[LINK http://www.ffconsultancy.com/dotnet/fsharp/raytracer/index.html www.ffconsultancy.com]

F# is not as fast as OCaml but it does support concurrency and is a much nicer language than C++...

Cheers,
Jon.

(L) [2007/01/04] [olliej] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> jdh30 wrote:F# is not as fast as OCaml but it does support concurrency and is a much nicer language than C++...

Define "nice" -- in C++ i don't need to define new symbols for *,-,/ etc

The absence of operator overloading is a OCaml (possibly an ML) issue, Haskell supports it, and it supports it very nicely.

However it's really not worth arguing over which is the best language, though tbp's 3x slower assertion may have been relatively inflammatory no language is perfect for everything (or, in the case of COBOL, anything [SMILEY :D]).

Arguing over "niceness" or "clarity" of a language is incredibly subjective anyway -- i personally don't like the ML languages, in most places where they are "better" than C/C++ i prefer Haskell...  But at the same time in *many* cases i find C/C++/etc (eg. the standard brace styled imperative languages) to be much nicer than either.

Just my 2 cents

--Oliver

PS.  In reality there are some things COBOL is better for, but it's a function of language/type restrictions rather than programming style/syntax.
PPS.  I still think it's an awful language though [SMILEY :D]

(L) [2007/01/04] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> jdh30 wrote:So I contest your assertion that OCaml is 3x slower.
PS: I just posted another OCaml implementation that is only 20% slower than your C++ and still 40% less code.

You can rightfully contest my claims, just like i did for the initial post, because they are inherently unsustainable.
In a contest where source code length is constrained you'd expect functional programming languages to 'win' at the end of the day if only because you'd expect their concision to buy the programmer space to cram fancier algorithmics in.
I'll just remark that your implementation is still slower than my horrible C++ version [SMILEY :)]

What matters, outside the world of artificial microbenchmarks, is how much time does it take to get to the proper answer on given hardware.
While C++ has no notion of concurrency it doesn't forbid it. That's why in something like radius, my rtrt, everything is iterative & distributed from acceleration structure construction to rendering, with fancy threading, wait free structures, coherent rays and whatnot, to maximize efficiency at run-time.
Am i distressed when i look at the plumbing, or how many lines it takes to get there? Certainly.
Does it matter to the end user? Not in the slightest.

I'll stress another key point, correctness/robustness. I'm sure you can prototype a raytracer in a few lines in an exotic language. But the real challenge is to get as close to the right answer as possible as fast as possible. That's quite different a problem.

(L) [2007/01/04] [elihu] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> Equivalent if you neglect a factor of 100 in the number of primitives.
Of course.  I am sorry for communicating imprecisely, I only meant equivalent in appearance - I suspect an exact reprensentation may be quite a bit quicker than one approximated by triangles (though I could see that the opposite could possibly be true as well, if function pointers really are as expensive as people seem to be hinting at).  Certainly, your kd-tree compiler is vastly superior to mine if it can handle 800,000 triangles in 3.5 seconds.
By the way, what hardware did you run that test on?  Were you using multiple processors?
I tried compiling radius from a cvs daily snapshot to see what the performance was like on my hardware; it seems to be successfully loading the sphereflake and building a kd-tree, but then it doesn't render anything.  I might have broke something by disabling a bunch of the compiler flags; gcc kept failing with an internal error.  Does radius currently work with gcc 4.1.1 and linux?
>> I tend to get white dots here and there; probably some floating point corner case.
It turns out this is caused by the numerical imprecision that results from adding a large number to a very small one.  Consequently, normals returned by my ray-sphere intersection test weren't actually quite normalized.
>> PS: I just posted another OCaml implementation that is only 20% slower than your C++ and still 40% less code.
I tried to compile this, and got an error:
>> This expression has type 'a but is here used with type
  vec * float * vec * float * 'a list
on this line:
>> | h::t -> intersects dir (intersect dir first h) t
And I don't understand your code well enough to figure out what's wrong.
>> F# is not as fast as OCaml but it does support concurrency and is a much nicer language than C++
Fast is a high priority for me.  If I ever re-implement my whole raytracer, it will probably be in plain C, but at the moment I'm mostly happy with ocaml.  (However, if you have any suggestions to make my ocaml code faster or use less memory, I'd be happy to hear them.)
>> Arguing over "niceness" or "clarity" of a language is incredibly subjective anyway -- i personally don't like the ML languages, in most places where they are "better" than C/C++ i prefer Haskell... But at the same time in *many* cases i find C/C++/etc (eg. the standard brace styled imperative languages) to be much nicer than either.

My opinion is that ocaml is an elegant, well designed language that just happens to have horrible syntax.  Once you get used to its quirks, though, it's really quite easy to use.  (If I had learned of it earlier, I would likely have written my raytracer using the much-improved revised syntax [LINK http://caml.inria.fr/pub/docs/manual-camlp4/manual007.html].)  Haskell is a language that I'm occasionally tempted to learn, but haven't yet had sufficient reason to.  It does seem to be making more progress than ocaml on the concurrency front, though.

(L) [2007/01/04] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:Of course.  I am sorry for communicating imprecisely, I only meant equivalent in appearance - I suspect an exact reprensentation may be quite a bit quicker than one approximated by triangles (though I could see that the opposite could possibly be true as well, if function pointers really are as expensive as people seem to be hinting at).
As you've remarked there's no loss of generality by triangulating everything. But function pointers when intersecting aren't the only concern if you go the multiple primitive way because you'll first have to find a way to (quickly) build an efficient acceleration structure. That's already not easy with a unique primitive type. Then vectorizing your rendering pipeline will also be troublesome.
>> elihu wrote:By the way, what hardware did you run that test on?  Were you using multiple processors?
Damn, you caught me. Yes everything is parallelized, from kD-tree construction to rendering. And i have a 2xOpteron 252 box (NUMA). Note that as said, you can build a kD-tree even faster with a streamed/approximated approach.
>> elihu wrote:I tried compiling radius from a cvs daily snapshot to see what the performance was like on my hardware; it seems to be successfully loading the sphereflake and building a kd-tree, but then it doesn't render anything.  I might have broke something by disabling a bunch of the compiler flags; gcc kept failing with an internal error.  Does radius currently work with gcc 4.1.1 and linux?
Radius works with msvc8, icc9.1 and g++ 4.x on 32bit windows and 64bit linux (debian) environments, because that's what i have [SMILEY :)]
I think you can get it to compile with a recent enough revision of gcc 4.1, but really you should use at least a 4.2 release (or 4.3 snapshot). If it's not available on your distro, you can always build one for yourself, see [LINK http://ompf.org/forum/viewtopic.php?t=29]
As i've never made any stable release but on cvs, i haven't comitted anything in 6 months. My bad. I'll address that soon but i need to polish an awful lot of stuff and it takes more time than envisonned.
Anyway if you feel like 'porting' it to a 32bit linux env (if that's what you have), it shouldn't take much work [SMILEY :)] But gcc 4.1 is a lost cause, as the latest release is 4.2 and it fixes many issues with SSE codegen.
EDIT:
>> elihu wrote:It turns out this is caused by the numerical imprecision that results from adding a large number to a very small one.  Consequently, normals returned by my ray-sphere intersection test weren't actually quite normalized.
Imprecisions in intersections can be stuffed under the carpet by saying that it's fundamentally an aliasing issue.
On the other hand a kD-tree traversal brings its own set of corner cases which should be treated [SMILEY ;)]

(L) [2007/01/04] [elihu] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> tbp wrote:elihu wrote:I tried compiling radius from a cvs daily snapshot to see what the performance was like on my hardware; it seems to be successfully loading the sphereflake and building a kd-tree, but then it doesn't render anything. I might have broke something by disabling a bunch of the compiler flags; gcc kept failing with an internal error. Does radius currently work with gcc 4.1.1 and linux?
Radius works with msvc8, icc9.1 and g++ 4.x on 32bit windows and 64bit linux (debian) environments, because that's what i have :)
I think you can get it to compile with a recent enough revision of gcc 4.1, but really you should use at least a 4.2 release (or 4.3 snapshot). If it's not available on your distro, you can always build one for yourself, see [LINK http://ompf.org/forum/viewtopic.php?t=29]
As i've never made any stable release but on cvs, i haven't comitted anything in 6 months. My bad. I'll address that soon but i need to polish an awful lot of stuff and it takes more time than envisonned.
Anyway if you feel like 'porting' it to a 32bit linux env (if that's what you have), it shouldn't take much work :) But gcc 4.1 is a lost cause, as the latest release is 4.2 and it fixes many issues with SSE codegen.

I tried out a shiny new gcc (thanks to your helpful instructions) and it compiles with only a few warnings. However, it runs the same as before; it opens a window displaying the radius logo, runs a sort, sucks up all available memory and grinds for a few minutes, then says "Hopla!", the cpu drops back to idle, and the window's title bar is updated with some informative statistics, yet nothing is displayed (aside from the radius logo). Is some incantation required to make it render an image?

I'm running 64-bit Fedora core 4 on a dual-core athlon 64 3800.

(L) [2007/01/04] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:I tried out a shiny new gcc (thanks to your helpful instructions) and it compiles with only a few warnings. However, it runs the same as before; it opens a window displaying the radius logo, runs a sort, sucks up all available memory and grinds for a few minutes, then says "Hopla!", the cpu drops back to idle, and the window's title bar is updated with some informative statistics, yet nothing is displayed (aside from the radius logo). Is some incantation required to make it render an image?
Not really. If i remember there was no fallback code from PBO texture streaming, but i doubt that's an issue there (and there's a console trace of probed extensions). Note: there's 2 streaming methods, switched by hitting 'p'... give it a try.

Given that there was no Wavefront/.obj parser in that version and the fact that it takes minutes (!) to build the tree, i have to wonder what you're feeding it with. Because you're supposed to give it raw triangles, such as [LINK http://ompf.org/forum/viewtopic.php?p=630#630]

If that's the culprit, i'm glad radius survived the random data input test [SMILEY ;)]

(L) [2007/01/05] [olliej] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> tbp wrote:Because you're supposed to give it raw triangles, such as [LINK http://ompf.org/forum/viewtopic.php?p=630#630]

Oooh, test scenes that are easy to read. I'll give them a run at some point.

Once i've made my renderer not suck as bad as it currently does (<1fps for sponza)

--Oliver

edit: oops, posted this msg multiple times as net connection spazzed out, sorry :O
LART notice: denial of service attempts are a prime offense, you've been warned.

(L) [2007/01/05] [jdh30] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> olliej wrote:jdh30 wrote:F# is not as fast as OCaml but it does support concurrency and is a much nicer language than C++...
Define "nice" -- in C++ i don't need to define new symbols for *,-,/ etc
In F#, you can use new symbols or overload +, - etc.
By nice I mean primarily concise and expressive.
>> The absence of operator overloading is a OCaml (possibly an ML) issue, Haskell supports it, and it supports it very nicely.
You need a few more type annotations if you use operator overloading in F# but it does work well.
>> However it's really not worth arguing over which is the best language,
I think it is worth arguing over which language is preferable for writing prototype ray tracers in a ray tracer forum.
>> Arguing over "niceness" or "clarity" of a language is incredibly subjective anyway

True, but we can quantify some important things. The point that I translated my original OCaml into C++ to get arguably bad C++ is a good one. It would be much more convincing if I could take a fast C++ ray tracer and translate it into succinct, clear and almost-as-efficient OCaml.

So let's look at my translation of TBP's C++ back into OCaml. The OCaml is 60% shorter and 20% slower than TBPs C++. The C++ uses low-level algorithmic optimisations, like skip lists, whereas the OCaml does not (beyond inlining).

In the face of that kind of objective evidence, I find it difficult to find in favour of C++.

Finally, if you really want to optimise your ray tracer, should you drop C++ and start writing CG/HLSL for the GPU?

Cheers,
Jon.

(L) [2007/01/05] [jdh30] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> tbp wrote:jdh30 wrote:So I contest your assertion that OCaml is 3x slower.
PS: I just posted another OCaml implementation that is only 20% slower than your C++ and still 40% less code.

You can rightfully contest my claims, just like i did for the initial post, because they are inherently unsustainable.
In a contest where source code length is constrained you'd expect functional programming languages to 'win' at the end of the day if only because you'd expect their concision to buy the programmer space to cram fancier algorithmics in.
On the contrary, your C++ uses fancier algorithmics (skip lists) to improve performance whereas my OCaml does not.
>> I'll just remark that your implementation is still slower than my horrible C++ version
Where is this horrible C++ version? I'd like to try translating an optimised (but smallish!) C/C++ ray tracer into OCaml. Do you have a suitable boiled-down program?
>> What matters, outside the world of artificial microbenchmarks, is how much time does it take to get to the proper answer on given hardware.
For production ray tracers, definitely. For language comparisons and educational material, people appreciate simplicity.
>> While C++ has no notion of concurrency it doesn't forbid it. That's why in something like radius, my rtrt, everything is iterative & distributed from acceleration structure construction to rendering, with fancy threading, wait free structures, coherent rays and whatnot, to maximize efficiency at run-time.
Am i distressed when i look at the plumbing, or how many lines it takes to get there? Certainly.
Does it matter to the end user? Not in the slightest.
I'll stress another key point, correctness/robustness. I'm sure you can prototype a raytracer in a few lines in an exotic language. But the real challenge is to get as close to the right answer as possible as fast as possible. That's quite different a problem.

I agree entirely but I do think there is merit in comparing solutions across the spectrum from simple to fast. For a lot of people, writing a ray tracer is a great way to learn a new language. Tens of thousands of people have read my ray tracer language comparison and they would not have done had I not posted simple programs as well as fast ones.

Cheers,
Jon.

(L) [2007/01/05] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> jdh30 wrote:...It would be much more convincing if I could take a fast C++ ray tracer and translate it into succinct, clear and almost-as-efficient OCaml.
So let's look at my translation of TBP's C++ back into OCaml.
Wait a minute. That's not a fast C++ ray tracer but a silly toy. And it's not even fast.
While you certainly have a point about prototyping, that tells nothing about a 'production grade' version.
>> jdh30 wrote:Finally, if you really want to optimise your ray tracer, should you drop C++ and start writing CG/HLSL for the GPU?
That wouldn't buy you much once you move beyond primary/shadow rays, because GPU are even more ill equiped to deal with incoherence than a general purpose cpu. Plus you have to deal with x distinct non-IEEE754 compliant platforms. A nightmare.

(L) [2007/01/05] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> jdh30 wrote:On the contrary, your C++ uses fancier algorithmics (skip lists) to improve performance whereas my OCaml does not.
It's a very natural solution, in C/C++ at least, and the cheapest when you consider that this tree is complete. Not using that property is a waste [SMILEY :)]
>> jdh30 wrote:Where is this horrible C++ version? I'd like to try translating an optimised (but smallish!) C/C++ ray tracer into OCaml. Do you have a suitable boiled-down program?
I was alluding to what has been [LINK http://ompf.org/ray/sphereflake/ there] for ages. Nowadays an 'optimized' ray tracer has to integrate coherent ray shooting, and that means using SIMD operations.
>> jdh30 wrote:I agree entirely but I do think there is merit in comparing solutions across the spectrum from simple to fast. For a lot of people, writing a ray tracer is a great way to learn a new language. Tens of thousands of people have read my ray tracer language comparison and they would not have done had I not posted simple programs as well as fast ones.

I guess part of the problem is agreeing about what exactly the term ray tracer covers. But you're right about the educational value of the exercise.

(L) [2007/01/05] [elihu] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> tbp wrote:elihu wrote:I tried out a shiny new gcc (thanks to your helpful instructions) and it compiles with only a few warnings.  However, it runs the same as before; it opens a window displaying the radius logo, runs a sort, sucks up all available memory and grinds for a few minutes, then says "Hopla!", the cpu drops back to idle, and the window's title bar is updated with some informative statistics, yet nothing is displayed (aside from the radius logo).  Is some incantation required to make it render an image?
Not really. If i remember there was no fallback code from PBO texture streaming, but i doubt that's an issue there (and there's a console trace of probed extensions). Note: there's 2 streaming methods, switched by hitting 'p'... give it a try.
Given that there was no Wavefront/.obj parser in that version and the fact that it takes minutes (!) to build the tree, i have to wonder what you're feeding it with. Because you're supposed to give it raw triangles, such as [LINK http://ompf.org/forum/viewtopic.php?p=630#630]
If that's the culprit, i'm glad radius survived the random data input test ;)

I was feeding it output from spd: "./balls -r 15 -s n" where n = 2 or 3 or 4, and assuming that if it didn't like the format of the file it would fail in a rather more obvious way.  Apparently, radius accepts any input; I tried feeding it its own binary as a scene file.  It told me it was composed of 3548 triangles.

I tried some scenes you linked to, and those are working quite nicely (happy buddha gives about 16 fps - very impressive).  It seems it only renders anything if I'm running as root.  Is it possible to enable lights, shadows, and textures in the current version, or are those features that haven't been committed to cvs yet?

(L) [2007/01/05] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:Apparently, radius accepts any input; I tried feeding it its own binary as a scene file.  It told me it was composed of 3548 triangles.
Failure is not an option.
I'll think about stuffing some geometry inside the binary to make it more interesting next time someone try [SMILEY ;)]
Please accept my apologies for the fuss.
>> elihu wrote:I tried some scenes you linked to, and those are working quite nicely (happy buddha gives about 16 fps - very impressive).  It seems it only renders anything if I'm running as root.
Ah. That's not as funny. I suspect that if you get to the point where it's about to render, then that problem has to do with OpenGL/X11. That's the only thing that would need special rights. I'll investigate, or perhaps you can run it via strace to see what's deal?
>> elihu wrote:Is it possible to enable lights, shadows, and textures in the current version, or are those features that haven't been committed to cvs yet?
Nope. There's only the basics, the kD construction is not even distributed. You'd have to go back to the previous incarnation, quadrille, but it sources were never published... blah blah blah...  Or wait for the next batch of commits; but i still have some nasty stuff to clear before it can be 'published'. Hopefully it won't be delayed too much.

(L) [2007/01/05] [elihu] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:I tried some scenes you linked to, and those are working quite nicely (happy buddha gives about 16 fps - very impressive). It seems it only renders anything if I'm running as root.
Ah. That's not as funny. I suspect that if you get to the point where it's about to render, then that problem has to do with OpenGL/X11. That's the only thing that would need special rights. I'll investigate, or perhaps you can run it via strace to see what's deal?

Not so: real time scheduling requires root access. (I assume you're trying to set it to SCHED_RR or SCHED_FIFO?) Here is the console output of a non-root run:

Code: [LINK # Select all][jsnow@eponine gcc]$ ./radius FairyForestF160.ra2
VERSION: gcc v4.2.0
CPU: 2 cpu/core detected.
CPU: binded main thread to cpu #0
sys::cpu::set_process_priority: setpriority failure: Permission denied
sys::cpu::set_process_priority: setpriority failure: Permission denied
CPU: couldn't tweak priorities.
CPU: the penguin said we're running at 2000.000 mhz.
CPU: rdtsc [X] SSE [X] SSE2 [X]
ui::window_t::open: asked for 512x512, screen(0, 1280x1024) depth 24
OGL: OpenGL 2.0 [X]
OGL: support for GLSL [X] FBO [X] PBO [X]
sys::cpu::set_process_priority: setpriority failure: Permission denied
horde::grunt_t::init: grunt #0, no-go.
loading ./toxie/FairyForestF160.ra2...
sys::map_t: mapped './toxie/FairyForestF160.ra2' @ 0x0x2aaaabad5000, 6268212 bytes.
kdlib::compiler_t::injection: 174117 triangles
scene_bbox{ (-3.2,-0.1,-0.1), (3.1,1.5,1.5) }
512 ms to sort.
compiler_t::compile(): 3525 ms
max_lvl 41
num_nodes 1439893
num_max_ids 65 (3.39 avg/leaf)
num_leaves 719947
num_leaves_empty 171487 (23.8 %)
num_space_cuts 0
num_ids 1859047
kdlib::writer_t::init: allocated 10.986M for nodes, 7.092M for ids.
bake_intersection: 174117 triangles, 0 degenerated.
hopla!

(JB, edit: removed your accidental double post)

(L) [2007/01/06] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:Not so: real time scheduling requires root access. (I assume you're trying to set it to SCHED_RR or SCHED_FIFO?) Here is the console output of a non-root run:
Code: [LINK # Select all]sys::cpu::set_process_priority: setpriority failure: Permission denied
horde::grunt_t::init: grunt #0, no-go.

Ah! That's your 2 (one per core/cpu) rendering threads bailing out because they couldn't get what they were after, see [LINK http://cvs.gna.org/cvsweb/radius/src/horde.h?rev=1.2;cvsroot=radius] around line 70.

Pinning those threads to a cpu or tweaking scheduling isn't required for them to work, but it helps (well, pinning is quite critical performance wise on NUMA boxes). I don't remember precisely what SCHED_* i'm trying to set up, but that was just to get ultra-stable timings & behaviour. You can safely nuke priority boostage, or just ignore the return code.

(L) [2007/01/09] [elihu] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

So, I implemented triangle support so now I can make a better comparison with radius.
[IMG #1 Image]

As you can see, I don't have normal interpolation working.  But the triangle count is the same as tbp's run (797150).
Rendering takes about 7 seconds - about 8-10 times slower than radius, if one considers that tbp was running on two cores whereas I was utilizing one.

Building the kdtree takes about 36 minutes and consumes about 1.6 gigs of ram.  (I should probably work on that one of these days...)
The tree has about 3 million branches, about 800,000 empty cells, and about 2.2 million leaves containing geometry.  It appears that the average triangle occupies about 10 cells.

The equivalent sphereflake using spheres (7381 of them) takes about 3 seconds (those two triangles the sphereflake is sitting on slow things down quite abit; without them I had the rendering time down to around 1.8 seconds).  Here's a screenshot of that:
[IMG #2 Image]

And here's some random renderings with the kdtree visible:
[IMG #3 Image]
[IMG #4 Image]

A few random notes about implementing this stuff in ocaml:

I noticed jdh's ray tracer was using a vector type like this:

Code: [LINK # Select all]type vec = {x: float; y: float; z: float}
(note: this is like a c record)
instead of what I was using:
Code: [LINK # Select all]type vec = float array
(note: this is like a c array)
And I was curious if it makes a difference or not, so I converted my raytracer to use records instead of arrays.  It turns out it doesn't make a noticeable difference either way, except for in the kdtree code, where it is much more convenient to be able to use the "axis" value from a branch as an index into the array and not have to use hacks like this:
Code: [LINK # Select all]let va (v:vec) i = match i with
0 -> v.x
| 1 -> v.y
| _ -> v.z
(note: this is like a c switch)
or even this:
Code: [LINK # Select all]let va (v:vec) i =
(Obj.magic v).(i)
(note: this is like casting a record to an array, and code like this horrifies many in the ocaml community)
The latter seemed to have some performance cost (though not as much as the pattern match), so in the end I went back to using arrays.
For those curious what my ocaml code looks like, but too lazy to download the (somewhat old, at this point) tarball above, here is a small but relatively interesting excerpt, the kdtree shadow-ray intersection code (annotated):
Code: [LINK # Select all] method shadow_std ray dist = (* arguments: ray, and farthest distance we're willing to search *)
assert (normalized ray.dir); (* usually, I compile with -noassert, so this check isn't actually compiled in *)
incr kdtree_shadow_ctr;
let dir_rcp = vrcp ray.dir in (* taking the reciprocal of the ray direction avoids lots of divisions later *)
let rec rectraverse node near far =
match node with (* pattern match on the kdtree node type *)
   Branch b -> (* most common case *)
   let () = incr kdhit_shadow_ctr
   and split = b.split
   and origin = ray.origin.(b.axis)
   and dr = dir_rcp.(b.axis) in (* this is the aforementioned array access *)
   let d = (split -. origin) *. dr in
   let (nearcell,farcell) =
   if dr > 0.0 then
      (b.frontside, b.backside)
   else
      (b.backside, b.frontside)
   in
   if d < near then
      rectraverse farcell near far
   else if d <far> s#shadow_std ray dist (* invoke the shadow ray test on primitive s *)
   | EmptyLeaf -> false
in
let (near, far) = bbox_interval ray bb dist in (* clip to the bounding box *)
if near > far || far <= 0.0
then false
else rectraverse root near far (* start the recursive traversal *)

I'm not experienced enough with ocaml to know how optimal this code is.

Another change I implemented was the switch away from representing my primitives as a collection of closures (basically, every primitive is like a c struct holding pointers to ray-intersection functions and some other associated data)  towards representing them as objects.  I was worried that virtual function dispatch would be slower than calling a closure, but, once again, it didn't make a measurable difference.  (Of course, this would be moot if I only rendered triangles.)

One of the things that did gain me some noticeable speed improvements was re-arranging my if statements so the most common case was the "then" and not the "else", and sorting my pattern matches so the most common cases are on top.  (For this, there is a profiler (ocamlcp) for bytecode-compiled programs that generates, as output, source listings annotated with comments at each branch saying how many times that branch was executed.  Very useful.)

One of the more satisfying features I added over the weekend was joystick support - now I get to fly around (sort of; I don't have rotations quite down right).  This precipitated a switch from fedora to ubuntu, something I've been meaning to do for awhile, but no joystick support in fedora was the straw that broke the camel's back, so to speak.  Unfortunately, my joystick's 4th axis goes unused due to limitations of the glut api, but that's okay for now.

[IMG #1]:Not scraped: https://web.archive.org/web/20101018105629im_/http://syn.cs.pdx.edu/~jsnow/images/raytrace/sphereflake-l4-triangles.png
[IMG #2]:Not scraped: https://web.archive.org/web/20101018105629im_/http://syn.cs.pdx.edu/~jsnow/images/raytrace/sphereflake-l4-spheres.png
[IMG #3]:Not scraped: https://web.archive.org/web/20101018105629im_/http://syn.cs.pdx.edu/~jsnow/images/raytrace/tetra.png
[IMG #4]:Not scraped: https://web.archive.org/web/20101018105629im_/http://syn.cs.pdx.edu/~jsnow/images/raytrace/sombrero.png

[IMG #1]:Not scraped: https://web.archive.org/web/20071028084102im_/http://syn.cs.pdx.edu/~jsnow/images/raytrace/sphereflake-l4-triangles.png
[IMG #2]:Not scraped: https://web.archive.org/web/20071028084102im_/http://syn.cs.pdx.edu/~jsnow/images/raytrace/sphereflake-l4-spheres.png
[IMG #3]:Not scraped: https://web.archive.org/web/20071028084102im_/http://syn.cs.pdx.edu/~jsnow/images/raytrace/tetra.png
[IMG #4]:Not scraped: https://web.archive.org/web/20071028084102im_/http://syn.cs.pdx.edu/~jsnow/images/raytrace/sombrero.png

(L) [2007/01/09] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:Rendering takes about 7 seconds - about 8-10 times slower than radius, if one considers that tbp was running on two cores whereas I was utilizing one.
My box really has 2 cpu, not just cores [SMILEY :)] Overhead being that low, you can safely apply a *2 factor to compare. Note that i was shooting single rays and for some reason the tree sucked a bit.. on such simple scene shooting packets/bundles really pays off.

Even if your scene has more coherence (in those reflections) due to lack of interpolated normals, you're surprisingly fast. I would have expected another order of magnitude on top of that. Congrats [SMILEY :)]

(L) [2007/01/09] [Phantom] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

You could also take into consideration that we have been spending tons of time on optimizing kd-tree quality... Elihu, what did you do to get an optimized kd-tree?

(L) [2007/01/09] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

I'm sure there's some margin to implement some optimizations with a 36mn kD-tree construction time budget [SMILEY Razz]
_________________
May you live in interesting times.

[LINK https://gna.org/projects/radius/ radius] | [LINK http://ompf.org/ ompf] | [LINK http://ompf.org/wiki/ WompfKi]

(L) [2007/01/09] [elihu] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> Phantom wrote:You could also take into consideration that we have been spending tons of time on optimizing kd-tree quality... Elihu, what did you do to get an optimized kd-tree?
The kd-tree optimization is helping quite a bit, but not so much that the renderer is useless without it: if I run the test again with a naive tree (split along axis of largest extent, put the split in the middle, stop when number of primitives in a cell drops below a threshold) I get about 10 seconds to render a frame (and 152 seconds to build the tree).
Here's what gprof has to say, when I employ a naive tree on a level 2 triangle-based sphereflake:
Code: [LINK # Select all] % cumulative self self total
time seconds seconds calls s/call s/call name
21.41 22.46 22.46 31871258 0.00 0.00 camlKdtree__rectraverse_258
19.38 42.79 20.33 269751153 0.00 0.00 camlTriangle__method_shadow_std_239
11.99 55.37 12.58 145861326 0.00 0.00 camlTriangle__method_rayint_std_244
10.21 66.08 10.72 12351558 0.00 0.00 camlKdtree__rectraverse_234
5.25 71.59 5.51 46247265 0.00 0.00 camlSolid__bbox_interval_123
3.89 75.67 4.08 11300856 0.00 0.00 camlLight__lighting_std_186
3.76 79.61 3.94 611337906 0.00 0.00 caml_modify
3.63 83.41 3.81 94081027 0.00 0.00 camlGroup__method_shadow_std_176
3.50 87.09 3.68 395703438 0.00 0.00 caml_send2
2.64 89.85 2.77 41809472 0.00 0.00 camlGroup__method_rayint_std_182
1.99 91.94 2.09 202046805 0.00 0.00 caml_send3
1.34 93.35 1.41 31871258 0.00 0.00 camlKdtree__method_shadow_std_346
1.26 94.67 1.32 228524 0.00 0.00 mark_slice
0.80 95.51 0.84 14376007 0.00 0.00 camlKdtree__method_rayint_std_356
0.75 96.30 0.79 218614924 0.00 0.00 caml_darken
0.71 97.05 0.75 12320768 0.00 0.00 camlTrace__trace_std_147
0.71 97.79 0.74 72076592 0.00 0.00 caml_oldify_one
....

It seems, at least on shallow trees, that I'm spending about a third of my time traversing the tree, a third of my time intersecting triangles, and the rest on other misclaneous details.  (caml_modify, caml_send2, caml_send3, mark_slice, caml_darken, and caml_oldify_one have to do with the ocaml runtime and aren't part of my code)
On average, with the optimized kdtree, a shadow ray intersects 37.8 non-leaf cells, 1.7 groups, and 5.1 triangles.
With the naive tree, the average shadow ray intersects 46.2 non-leaf cells, 3.7 groups, and  12.5 triangles.
(I'm quoting shadow ray statistics because my primary ray stats are somewhat confused by the reflections.)
(A group is a container primitive I use whenever a leaf cell has more than one primitive; I can simplify the kdtree code by treating leaves with multiple primitives as singleton leaves.)
Initially, I implemented the cost function from here [LINK http://www.acm.org/tog/resources/RTNews/html/rtnv17n1.html#art8], and that worked, but the formula was sufficiently complicated that I didn't really know how it worked.  Later, I switched over to a simpler cost function I found here: [LINK http://www.openrt.de/Siggraph05/UpdatedCourseNotes/Stoll_Realtime.ppt] which was much simpler and seemed to perform the same.
This is my cost function:
Code: [LINK # Select all]let tk1 = 1.0 (* node access cost *)
let tk2 = 1.0 (* plane intersect cost *)
let tk3 = 1.0 (* cost of accessing child node *)
let tk4 = 1.0 (* cost of aditional virtual function for "group" primitives *)
let tp = 2.0 (* primitive intersect cost *)
let tries = 8 (* how many potential splits do we try? *)
let max_exact = 32 (* how many objects do we allow before we stop trying all split candidates? *)
let split_cost solids n bb axis split =
let (bbl,bbr) = split_bbox bb axis split in (* split an AABB into two AABBs *)
let sap = sa bb.p1 bb.p2 (* parent box's surface area *)
and sal = sa bbl.p1 bbl.p2 (* surface area of left and right children *)
and sar = sa bbr.p1 bbr.p2
and pl = ref 0 (* primitive counters *)
and pr = ref 0
in let () = for i = 0 to n - 1
do
let s = solids.(i) in (* iterate over array of primitives *)
begin
   if (va (s#bbox()).p1 axis) < split then (* get primitive's AABB, then increment right and/or left counters accordingly *)
   incr pl
   else ();

   if (va (s#bbox()).p2 axis) > split then
   incr pr
   else ();
end
done
in tk1 +. tk2 +.
(((sal *. (float_of_int !pl) *. tp) +.
(sar *. (float_of_int !pr) *. tp)) /. sap)
(* what is the cost of an unsplit node ? *)
let nosplit_cost solids n bb =
(tk1 +. ((float_of_int n) *. tp))

"tries" and "max_exact" work as follows:  if there are less than "max_exact" primitives in a cell, we get our candidate split planes from the primitives contained in the cell.  Otherwise, we try "tries" evenly-spaced split planes.  It's kind of a hack, but it keeps the kd-tree construction time from being much worse than it is.
I also have a test in my kd-tree constructor that if more than half of the primitives are being duplicated into both cells, we just make that a leaf and don't try to split anymore.
I had an idea to reduce the cost of really big triangles by splitting them in half if they were larger than some threshold.  This didn't actually help - I think the decision to split a triangle needs to be made by the kd-tree constructor in order to be useful.
>> tbp wrote:Even if your scene has more coherence (in those reflections) due to lack of interpolated normals, you're surprisingly fast. I would have expected another order of magnitude on top of that. Congrats
Thank you.  I am somewhat curious what sort of tricks you are employing that would lead you to believe your renderer ought to be 80 times faster than mine...
>> I'm sure there's some margin to implement some optimizations with a 36mn kD-tree construction time budget

My constructor is pretty naive; there's no sorting, and I convert lists and arrays back and forth quite a bit more than necessary.

(L) [2007/01/10] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:My constructor is pretty naive; there's no sorting, and I convert lists and arrays back and forth quite a bit more than necessary.
Sorting at each step should get you into O(n.log²(n)) land and shave an order of magnitude in construction time without much hassle, if any at all. Sorting only once is about when even tough guys start to cry [SMILEY :)]
>> elihu wrote:Thank you. I am somewhat curious what sort of tricks you are employing that would lead you to believe your renderer ought to be 80 times faster than mine...
Coherence, coherence, coherence. Exploit it. Abuse it.
You start by shooting ray packets, that is on x86 4 rays stuffed into a SSE vector. Then you shoot bundles, that is a bunch of ray packets. Then you try to manipulate such bundles by their external properties.
That works wonderfully for first generation rays (primaries/shadows) and then as coherence deteriorates, you get diminishing returns unless you try hard to distillate the remaining coherence.

That and tight code [SMILEY :lol:]

(L) [2007/01/10] [jdh30] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:Phantom wrote:You could also take into consideration that we have been spending tons of time on optimizing kd-tree quality... Elihu, what did you do to get an optimized kd-tree?
Here's what gprof has to say, when I employ a naive tree on a level 2 triangle-based sphereflake:
Code: [LINK # Select all] % cumulative self self total
time seconds seconds calls s/call s/call name
21.41 22.46 22.46 31871258 0.00 0.00 camlKdtree__rectraverse_258
19.38 42.79 20.33 269751153 0.00 0.00 camlTriangle__method_shadow_std_239
11.99 55.37 12.58 145861326 0.00 0.00 camlTriangle__method_rayint_std_244

OCaml's objects are very flexible and very slow so I'd recommend avoiding objects if they're using a significant amount of time (which I assume *__method_* means). I rarely use OOP these days...
>> It seems, at least on shallow trees, that I'm spending about a third of my time traversing the tree, a third of my time intersecting triangles, and the rest on other misclaneous details. (caml_modify, caml_send2, caml_send3, mark_slice, caml_darken, and caml_oldify_one have to do with the ocaml runtime and aren't part of my code)

mark_slice is part of the GC and altering your source code can often have a big effect on GC. For example, the last time I looked it was much slower to do "let a, b = 1, 2 in" than "let a=1 and b=2 in" because the former allocates a pair and then garbage collects it.
I think your definition:
Code: [LINK # Select all]type kdnode = Leaf of solid | Branch of branch
and branch = {
frontside : kdnode;
backside : kdnode;
axis : int;
split : float;
}

is very elegant but will incur unnecessary boxing. You might try:
Code: [LINK # Select all]type kdnode = Leaf of solid | Branch of kdnode * kdnode * int * float

Other than that, it looks good! I bet if you post this to the OCaml list you'll get lots of useful advice.

I'll try to alter my ray tracer so we can compare performance...

Cheers,
Jon.

(L) [2007/01/10] [olliej] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> jdh30 wrote:OCaml's objects are very flexible and very slow so I'd recommend avoiding objects if they're using a significant amount of time (which I assume *__method_* means)

I'm guessing by that you mean virtual dispatch -- which would make sense and C++, etc all get an equivalent penalty. I assume that caml doesn't let you make non-virtual class methods which would probably have no overhead...

(L) [2007/01/11] [elihu] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> tbp wrote:Coherence, coherence,  coherence. Exploit it. Abuse it.
You start by shooting ray packets, that is on x86 4 rays stuffed into a SSE vector. Then you shoot bundles, that is a bunch of ray packets. Then you try to manipulate such bundles by their external properties.
That works wonderfully for first generation rays (primaries/shadows) and then as coherence deteriorates, you get diminishing returns unless you try hard to distillate the remaining coherence.
That and tight code :lol:
You said earlier that your sphereflake rendering was done with single rays.  Are you exploiting coherence in other ways (like employing MLRTA, for instance)?  I'm just curious if your algorithms are significantly different than mine, or if most of your performance advantages come from using tight C code (with a good compiler and lots of SSE)  rather than ocaml code of unknown quality.
I attempted to make my code tighter by re-writing the kd-tree traversal as a non-recursive function, using an array as my stack.  Unfortunately, it ended up being quite a bit slower.  My guess is that the ocaml array-access routines are less efficient than the regular function call stack.
  >> jdh30 wrote:For example, the last time I looked it was much slower to do "let a, b = 1, 2 in" than "let a=1 and b=2 in" because the former allocates a pair and then garbage collects it.
The trouble is I want to assign to two variables based on the result of an if statement, and as far as I know the only way to do that without using refs is with tuples.  However, I gave refs a try:
Code: [LINK # Select all]let rec rectraverse node near far =
match node with
   Branch (l,r,axis,split) ->
   let () = incr kdhit_shadow_ctr
   and dr = dir_rcp.(axis) in
   let d = (split -. ray.origin.(axis)) *. dr in
   let nearcell = ref l
   and farcell = ref r in
   let () =
   if dr < 0.0 then
   ( nearcell := r;
      farcell := l;)
   else ()
   in
   if d < near then
   rectraverse !farcell near far
   else if d <far> s#shadow_std ray dist
| EmptyLeaf -> false
and despite being a bit uglier, it is quite a bit faster; the whole renderer is about %10 faster once I updated both the regular and shadow traversal code; the 800k triangle level 4 sphereflake now renders in about 6.3 seconds instead of 7.
>>  You might try:
Code: [LINK # Select all]type kdnode = Leaf of solid | Branch of kdnode * kdnode * int * float
Other than that, it looks good! I bet if you post this to the OCaml list you'll get lots of useful advice.
I'll try to alter my ray tracer so we can compare performance...

My kdnode type now looks like this:
Code: [LINK # Select all]type kdnode = Leaf of solid | Branch of kdnode * kdnode * int * float | EmptyLeaf
Unfortunately, this didn't noticeably affect performance.  However, it does make my traversal code a bit neater, and I suspect it might reduce the memory footprint.
Thank you for the optimization advice, perhaps I shall put together another code release and announce it on an ocaml list.
If you want to compare performance with my raytracer, I would recommend writing a scene parser for one of the formats used by the standard procedural database [LINK http://www.acm.org/tog/resources/SPD/].
>> olliej wrote:jdh30 wrote:
OCaml's objects are very flexible and very slow so I'd recommend avoiding objects if they're using a significant amount of time (which I assume *__method_* means)

I'm guessing by that you mean virtual dispatch -- which would make sense and C++, etc all get an equivalent penalty. I assume that caml doesn't let you make non-virtual class methods which would probably have no overhead...

Initially, I represented objects as records of functions, like so:
Code: [LINK # Select all]type solid = {
rayint_std : ray -> float -> rayint_std -> unit; (* standard ray intersect function *)
shadow_std : ray -> float -> bool; (* shadow intersection *)
inside : vec -> bool; (* some day I may use this for csg tests *)
bbox : bbox;
}

and declared primitives constructors like so:
Code: [LINK # Select all]let sphere center radius texture =
let radiussqr = radius *. radius
and invradius = 1.0 /. radius
in
(* algorithm adapted from graphics gems 1 *)
let rayint_std ray dist rayint =
...a bunch of code....
and shadow_std ray dist =
...some more code...
(* maybe some day I'll do csg operations *)
and inside pos = if (vdistsqr center pos) <= radiussqr
then true
else false
in
{rayint_std=rayint_std;
shadow_std=shadow_std;
inside = inside;
bbox = { p1 = vdec center radius; (* old code; I've since replaced this with a function because I only need the bbox to build the kdtree *)
p2 = vinc center radius }; (* and I don't want to waste the space keeping 6 floats around *)
}

However, memory consumption was quite high.  Storing a bunch of functions with every single object was kind of redundant.  Also, I was suspicious that maybe those functions might be eating more memory than it appeared; I don't really what happens when closures are allocated.
I thought of representing my primitives like this:
Code: [LINK # Select all] type solid = Sphere of vec * float | Triangle of vec * vec * vec | Group of solid array | kdtree of kdnode
and a generic rayintersection function something like this:
Code: [LINK # Select all]
let rayint ray object = match object with
Sphere (center,radius) -> Sphere.rayint ray center radius
| Triangle (p1,p2,p3) -> Triangle.rayint ray p1 p2 p3
| Group solids -> ....

Where each primitive's ray intersection code is defined in its own module.  Unfortunately, this gets complicated when I try to write the kdtree or group ray intersection functions, which must call back into the generic ray intersection function.  As far as I understand, this kind of mutual recursion across modules is impossible in ocaml, so I would have to radically restructure my code, sticking all the ray-intersection code together in one big ugly file.

It's also quite likely that the pattern match (i.e. switch)  is less efficient than however the virtual functions are implemented, in which case I really ought to be using objects.

An alternative is to give up on multiple primitives and just use triangles, but I'd much rather retain the flexibility of multiple primitive types.  I like the idea that I can, for instance, stick one kdtree inside another if I want to, should the need arise (I could see that being useful in animation if something in the parent tree changes but the other doesn't - no sense re-sorting everything).  I'm also somewhat biased in that I've spent a lot of time in the past building scenes in pov-ray using just a text editor - povray's basic primitives are much more intuitive than plain triangles.  I could use a regular gui modelling tool and let it handle the triangle conversion, but that just doesn't seem as elegant.

The virtual function overhead is comparatively small; if I'm only intersecting a dozen or so primitives per ray, it's probably not going to dominate the computation time.  In general, I agree with jdh that object oriented code should be avoided unless the application really requires it.

(L) [2007/01/11] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:You said earlier that your sphereflake rendering was done with single rays. Are you exploiting coherence in other ways (like employing MLRTA, for instance)? I'm just curious if your algorithms are significantly different than mine, or if most of your performance advantages come from using tight C code (with a good compiler and lots of SSE) rather than ocaml code of unknown quality.
No, you're right, there's no coherence exploitation tricks in the single ray version; i mean ie MLRTA only makes sense if you can amortize its costs by shaving individual ray traversals.

My single ray implementation aims at correctness over speed, because that's my reference path. And even if it uses SSE, it's only scalar SSE ops automatically generated by compilers (whereas in packet/bundle versions explicit vectored SSE ops are used). Other than that i've payed a lot of attention to memory layouts (think cachelines) and NUMA awareness.

(L) [2007/01/11] [jdh30] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> olliej wrote:jdh30 wrote:OCaml's objects are very flexible and very slow so I'd recommend avoiding objects if they're using a significant amount of time (which I assume *__method_* means)
I'm guessing by that you mean virtual dispatch -- which would make sense and C++, etc all get an equivalent penalty. I assume that caml doesn't let you make non-virtual class methods which would probably have no overhead...

Virtual function dispatch isn't that slow. TBP's sphereflake takes 3.832s on my machine. The same scene with the C++ from my site takes 4.065s and that uses classes and virtual functions. There are other differences (STL vs skip lists) but the time taken and amount of code required is quite similar. Also, I'd be surprised if C++ compilers didn't optimise the virtual function dispatch into a tagged union.

I believe the performance cost of OCaml's objects is due to polymorphism, not virtual functions. Converting my ray tracer to use objects to represent vectors instead of records makes it 10x slower, for example, and there is no virtual function dispatch there. You really want to avoid objects in high-performance OCaml code. Objects aren't very useful anyway, IMHO...

Cheers,
Jon.

(L) [2007/01/11] [jdh30] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:tbp wrote:Coherence, coherence,  coherence. Exploit it. Abuse it.
You start by shooting ray packets, that is on x86 4 rays stuffed into a SSE vector. Then you shoot bundles, that is a bunch of ray packets. Then you try to manipulate such bundles by their external properties.
That works wonderfully for first generation rays (primaries/shadows) and then as coherence deteriorates, you get diminishing returns unless you try hard to distillate the remaining coherence.
That and tight code
You said earlier that your sphereflake rendering was done with single rays.  Are you exploiting coherence in other ways (like employing MLRTA, for instance)?  I'm just curious if your algorithms are significantly different than mine, or if most of your performance advantages come from using tight C code (with a good compiler and lots of SSE)  rather than ocaml code of unknown quality.
This is why I'd love to see an optimised trivial example (like the sphereflake). I want to know how much faster C code can be made.
I'm using AMD64 here. The only assembler I've done was ARM about 20 years ago but I'm assuming that SSE etc. have been built into the AMD64 architecture, so you'd expect compilers for this platform to take advantage of the modern equivalent of SSE. Is that about right?
>> I attempted to make my code tighter by re-writing the kd-tree traversal as a non-recursive function, using an array as my stack.  Unfortunately, it ended up being quite a bit slower.  My guess is that the ocaml array-access routines are less efficient than the regular function call stack.

You really don't want to be looking at such low-level optimisations with a high-level language like OCaml. There are complicated low-level trade-offs involved in everything. For example, using mutation can kick the GC into action, slowing everything down but it won't show up in an obvious way on a profile.
>>  jdh30 wrote:For example, the last time I looked it was much slower to do "let a, b = 1, 2 in" than "let a=1 and b=2 in" because the former allocates a pair and then garbage collects it.
The trouble is I want to assign to two variables based on the result of an if statement, and as far as I know the only way to do that without using refs is with tuples.  However, I gave refs a try:
Code: [LINK # Select all]let rec rectraverse node near far =
match node with
   Branch (l,r,axis,split) ->
   let () = incr kdhit_shadow_ctr
   and dr = dir_rcp.(axis) in
   let d = (split -. ray.origin.(axis)) *. dr in
   let nearcell = ref l
   and farcell = ref r in
   let () =
   if dr < 0.0 then
   ( nearcell := r;
      farcell := l;)
   else ()
   in
   if d < near then
   rectraverse !farcell near far
   else if d <far> s#shadow_std ray dist
| EmptyLeaf -> false
and despite being a bit uglier, it is quite a bit faster; the whole renderer is about %10 faster once I updated both the regular and shadow traversal code; the 800k triangle level 4 sphereflake now renders in about 6.3 seconds instead of 7.

I actually altered the same bit of code to a pair of "if"s. I didn't measure the performance though.
You know you can write:
f x;
instead of:
let () = f x in
>> My kdnode type now looks like this:
Code: [LINK # Select all]type kdnode = Leaf of solid | Branch of kdnode * kdnode * int * float | EmptyLeaf
Unfortunately, this didn't noticeably affect performance.  However, it does make my traversal code a bit neater, and I suspect it might reduce the memory footprint.
Thank you for the optimization advice, perhaps I shall put together another code release and announce it on an ocaml list.
If you want to compare performance with my raytracer, I would recommend writing a scene parser for one of the formats used by the standard procedural database [LINK http://www.acm.org/tog/resources/SPD/].

I'm quite busy with F# at the moment but I'll give it a go when I've got a spare moment...
>> Initially, I represented objects as records of functions, like so:
Code: [LINK # Select all]type solid = {
rayint_std : ray -> float -> rayint_std -> unit; (* standard ray intersect function *)
shadow_std : ray -> float -> bool; (* shadow intersection *)
inside : vec -> bool; (* some day I may use this for csg tests *)
bbox : bbox;
}

You're making the classic OCaml newbie mistake of trying to turn everything into an object. Things are much easier if you do the transpose.
>> and declared primitives constructors like so:
Code: [LINK # Select all]let sphere center radius texture =
let radiussqr = radius *. radius
and invradius = 1.0 /. radius
in
(* algorithm adapted from graphics gems 1 *)
let rayint_std ray dist rayint =
...a bunch of code....
and shadow_std ray dist =
...some more code...
(* maybe some day I'll do csg operations *)
and inside pos = if (vdistsqr center pos) <= radiussqr
then true
else false
in
{rayint_std=rayint_std;
shadow_std=shadow_std;
inside = inside;
bbox = { p1 = vdec center radius; (* old code; I've since replaced this with a function because I only need the bbox to build the kdtree *)
p2 = vinc center radius }; (* and I don't want to waste the space keeping 6 floats around *)
}

However, memory consumption was quite high.  Storing a bunch of functions with every single object was kind of redundant.  Also, I was suspicious that maybe those functions might be eating more memory than it appeared; I don't really what happens when closures are allocated.
I thought of representing my primitives like this:
Code: [LINK # Select all] type solid = Sphere of vec * float | Triangle of vec * vec * vec | Group of solid array | kdtree of kdnode
and a generic rayintersection function something like this:
Code: [LINK # Select all]
let rayint ray object = match object with
Sphere (center,radius) -> Sphere.rayint ray center radius
| Triangle (p1,p2,p3) -> Triangle.rayint ray p1 p2 p3
| Group solids -> ....

That's exactly how you should do it!
>> Where each primitive's ray intersection code is defined in its own module.  Unfortunately, this gets complicated when I try to write the kdtree or group ray intersection functions, which must call back into the generic ray intersection function.  As far as I understand, this kind of mutual recursion across modules is impossible in ocaml, so I would have to radically restructure my code, sticking all the ray-intersection code together in one big ugly file.

You can do mutual recursion across modules in the same compilation unit using "recursive modules" (see the manual). However, you've probably got a nice breakdown of your code into separate compilation units so you don't want to do that.
There is another, really easy, solution. People sometimes call it "untying the knot". Say you have a pair of mutually recursive functions:
let rec foo = bar 3
and bar = foo 4
You can remove the mutual recursion by passing bar to foo and foo to bar as arguments:
let foo bar = bar 3
let bar foo = foo 4
Now your functions are completely independent so you can put them in modules A and B, respectively. You can invoke "foo" as "A.foo B.bar".
>> It's also quite likely that the pattern match (i.e. switch)  is less efficient than however the virtual functions are implemented, in which case I really ought to be using objects.

Quite the opposite. Pattern matching is ridiculously fast, much faster than objects in C++. Also, exceptions are about 6x faster in OCaml than in C++, so you can use them to optimise things as well.
>> An alternative is to give up on multiple primitives and just use triangles, but I'd much rather retain the flexibility of multiple primitive types.  I like the idea that I can, for instance, stick one kdtree inside another if I want to, should the need arise (I could see that being useful in animation if something in the parent tree changes but the other doesn't - no sense re-sorting everything).  I'm also somewhat biased in that I've spent a lot of time in the past building scenes in pov-ray using just a text editor - povray's basic primitives are much more intuitive than plain triangles.  I could use a regular gui modelling tool and let it handle the triangle conversion, but that just doesn't seem as elegant.

I'd like to know what the performance trade-offs are. What was the conclusion when someone asked if it is faster to render a sphereflake as triangles?
>> The virtual function overhead is comparatively small; if I'm only intersecting a dozen or so primitives per ray, it's probably not going to dominate the computation time.  In general, I agree with jdh that object oriented code should be avoided unless the application really requires it.

Yes. I don't like OOP at the best of times but OCaml's structurally-subtyped OOP is very generic and slow, so you definitely want to avoid that. C++'s OOP isn't so bad because they're little more than structs.

Cheers,
Jon.

(L) [2007/01/11] [elihu] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

New version available! [LINK http://syn.cs.pdx.edu/~jsnow/glome/glome-0.2.tar.bz2] I decided to name it Glome after the fictitious kingdom in C.S. Lewis's book Till we have Faces. A google search revealed that it's also the name of a 4-dimensional hypersphere.

(L) [2007/01/12] [elihu] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> I actually altered the same bit of code to a pair of "if"s. I didn't measure the performance though.
That would work, but I wanted to avoid the extra branch.
>> You know you can write:
f x;
instead of:
let () = f x in
Yeah, sometimes I write things the former way rather than the latter; however, I usually only use the imperative syntax inside of parentheses or begin/end blocks because otherwise it's easy to write something that doesn't work as expected (especially with if/then/else statements).  I assume it compiles to identical code either way.
>> You're making the classic OCaml newbie mistake of trying to turn everything into an object. Things are much easier if you do the transpose.
...
Code: [LINK # Select all]
let rayint ray object = match object with
Sphere (center,radius) -> Sphere.rayint ray center radius
| Triangle (p1,p2,p3) -> Triangle.rayint ray p1 p2 p3
| Group solids -> ....

That's exactly how you should do it!
....
You can do mutual recursion across modules in the same compilation unit using "recursive modules" (see the manual). However, you've probably got a nice breakdown of your code into separate compilation units so you don't want to do that.
I actually don't care about separate compilation so much as the ability to organize my code into separate files, which as far as I can tell is incompatible with recursive modules.  I am also somewhat intimidated by the module syntax; I was greatly pleased one day when I discovered I could just stick code in different files and ocaml would treat them as modules automatically.
I am glad to see I have another option, though, if I decide to abandon objects.
>> There is another, really easy, solution. People sometimes call it "untying the knot". Say you have a pair of mutually recursive functions:
let rec foo = bar 3
and bar = foo 4
You can remove the mutual recursion by passing bar to foo and foo to bar as arguments:
let foo bar = bar 3
let bar foo = foo 4
Now your functions are completely independent so you can put them in modules A and B, respectively. You can invoke "foo" as "A.foo B.bar".
Alas, I do not understand your example.
>> It's also quite likely that the pattern match (i.e. switch)  is less efficient than however the virtual functions are implemented, in which case I really ought to be using objects.

Quite the opposite. Pattern matching is ridiculously fast, much faster than objects in C++. Also, exceptions are about 6x faster in OCaml than in C++, so you can use them to optimise things as well.

I tried a benchmark; you seem to be right:
Code: [LINK # Select all]type foo = Nothing | I of int | F of float
let add f1 (n1:float) =
match f1 with
Nothing -> n1
| I n -> (float_of_int n) +. n1
| F n -> n +. n1

class virtual bar =
object (self)
method virtual add : float -> float
end
class nothing () =
object (self)
inherit bar as b
initializer ()
method add n1 = n1
end
class i n2 =
object (self)
inherit bar as b
val n : int = n2
method add n1 = (float_of_int n) +. n1
end

class f n2 =
object (self)
inherit bar as b
val n : float = n2
method add n1 = n +. n1
end

let _ =
begin
let start = Sys.time()
and accum = ref 0.0
and x = new nothing ()
and y = new i (1)
and z = new f (1.0) in
begin
for i = 1 to 100000000
do
   accum := x#add !accum;
   accum := y#add !accum;
   accum := z#add !accum;
done;

Printf.printf "%f %f\n" (Sys.time () -. start) !accum;
flush stdout;
end;

let start = Sys.time()
and accum = ref 0.0
and x = Nothing
and y = I (1)
and z = F (1.0) in
begin
for i = 1 to 100000000
do
   accum := add x !accum;
   accum := add y !accum;
   accum := add z !accum;
done;

Printf.printf "%f %f\n" (Sys.time () -. start) !accum;
flush stdout;
end;
end

This result in:
6.296393 200000000.000000
1.856116 200000000.000000
The virtual call seems to be about 4 times as expensive as the pattern match.  I had assumed the pattern match was equivalent to a bunch of nested if-then-else blocks, but I tried adding additional constructors
Code: [LINK # Select all]type foo = A | B | C | D | Nothing | I of int | F of float
and placing them first in the pattern match, but it didn't affect performance.  Perhaps ocaml actually uses a jump table.  This is good, because I was concerned that a pattern match would get slower and slower as I add new primitive types.
Oddly, the pattern match was a bit slower (about 2.5 seconds) when I compiled with -inline 2 instead of 1 or 0.
>> An alternative is to give up on multiple primitives and just use triangles, but I'd much rather retain the flexibility of multiple primitive types.  I like the idea that I can, for instance, stick one kdtree inside another if I want to, should the need arise (I could see that being useful in animation if something in the parent tree changes but the other doesn't - no sense re-sorting everything).  I'm also somewhat biased in that I've spent a lot of time in the past building scenes in pov-ray using just a text editor - povray's basic primitives are much more intuitive than plain triangles.  I could use a regular gui modelling tool and let it handle the triangle conversion, but that just doesn't seem as elegant.

I'd like to know what the performance trade-offs are. What was the conclusion when someone asked if it is faster to render a sphereflake as triangles?

Well, when I tried it with the level 4 sphereflake, the triangle representation was more than twice as slow, and consumed 1.5 gigs of ram instead of 50 megs.  Unfortunately, using primitives like spheres and cones and cylinders and boxes really only works for artificial scenes like you might find in a game, and then only for certain kinds of objects.  Real world data sets (like scans of 3d objects) tend to be triangles.  So it's really a matter of what sort of scenes you plan on rendering.  As far as I can tell, research usually seems to focus on triangle based representations, because anything can be represented as triangles and it's easy to compare one algorithm with one another.

(L) [2007/01/12] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> jdh30 wrote:This is why I'd love to see an optimised trivial example (like the sphereflake). I want to know how much faster C code can be made.
I'm using AMD64 here. The only assembler I've done was ARM about 20 years ago but I'm assuming that SSE etc. have been built into the AMD64 architecture, so you'd expect compilers for this platform to take advantage of the modern equivalent of SSE. Is that about right?
Yes and no.
x86 & x86-64 now have SSE instruction sets on top of the aging 387. That brings respectively 8 and 16 registers of that form struct { float x,y,z,w; } (or struct { double x,y; }), with 2 operation subsets massaging either one component (scalar) or all four (vector). Scalar ops are only marginally faster.

Compilers use scalar operation set to directly transcribe typical floating point code, but for them to use vector ops automatically requires much voodoo; that's called auto vectorization: [LINK http://gcc.gnu.org/projects/tree-ssa/vectorization.html]

So really, arguing about scalar mode speed is a bit senseless as that's throwing 3/4 of the computationnal power by the window from the start.

(L) [2007/01/19] [elihu] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

I just posted glome-0.3; get it here: [LINK http://syn.cs.pdx.edu/~jsnow/glome/].

I re-wrote the kd-tree compiler.  Theoretically, it should be NlogN, but it was still pretty slow - 25 minutes or so for the 800k triangle sphereflake.  However, I reduced that to about 7 minutes by caching the bounding boxes instead of recomputing them every time I need to do a comparison.  I can get that down to about 5 minutes by not trying all the boundaries once the number of primitives in a cell exceeded a threshold.

Unfortunately, constructing a large tree can now overflow the stack; running "ulimit -s 256000" allowed me to build the 800k tree.

The algorithm works something like this:

The initial input is a list of (primitive,bounding box) pairs.  This is sorted 6 times; for each axis, we sort by both maximum bounding box value and minimum bounding box value (ascending order for both).

In the recursive step, we scan for a good split candidate by sweeping through each list pair, keeping a counter of how many objects are on either side of the split plane.  (If the next primitive in the minimum-value list is lower than the next primitive in maximum-value list, we increment the left side counter, otherwise we decrement the right side counter (supposing we are scanning left to right).)  Once we've found the best split plane, we filter the 6 lists according to which side of the split plane each item belongs on, and recurse.

This seems a little different than Havran's algorithm, which I don't fully grok.  I think my approach is probably easier to implement, but slower.

I do a lot more iterations over the lists than is absolutely necessary, and I'm making liberal use of map, fold, and filter where loops might be more efficient.  I also suspect that arrays may be faster than lists, but I don't know if this is true in ocaml.  The initial sorts take about 10% of the total construction time.

---

As for the raytracer itself, it's been restructured a bit according to Jon Harrop's advice.  It isn't object oriented anymore, and ray-intersection tests are written in a more functional style.  I can now render the 800k sphereflake in about 5 seconds.

Btw, what's a good average number of triangles to hit per ray?  If I fly close to a level3 triangle sphereflake, I intersect about 8 triangles per primary and about 5 per shadow ray.  Zooming in on the teapot seems to get me about 5 or 6 triangles per ray, regardless of triangle density.  I don't often see more than 8 triangles per ray.  Does this sound about right, or is there room for improvement?

(L) [2007/01/20] [tbp] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

You cannot really disjoint a kD tree builder from traversal when talking about quality (whatever you stuff under that term).
If you're aiming for a typical mono traversal, then your intersection/ray should be as low as possible (ignoring pretty much all other factors & tweaks like bubbling up voids etc)... that means hitting few leaves with little content etc... By that standard, 2 or 3 intersections per ray would be good.

(L) [2007/01/20] [jdh30] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:I just posted glome-0.3; get it here: [LINK http://syn.cs.pdx.edu/~jsnow/glome/].
I re-wrote the kd-tree compiler. Theoretically, it should be NlogN, but it was still pretty slow - 25 minutes or so for the 800k triangle sphereflake. However, I reduced that to about 7 minutes by caching the bounding boxes instead of recomputing them every time I need to do a comparison. I can get that down to about 5 minutes by not trying all the boundaries once the number of primitives in a cell exceeded a threshold.
Unfortunately, constructing a large tree can now overflow the stack; running "ulimit -s 256000" allowed me to build the 800k tree.

If you're stack overflowing then you can multiply your algorithmic complexity by n in OCaml (the GC treats the stack as an atom, traversing the whole stack periodically). Do you have a big scene that stack overflows it? Send it to me and I'll fix it for you... [SMILEY :-)]

Cheers,
Jon.

(L) [2007/01/20] [elihu] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> jdh30 wrote:Do you have a big scene that stack overflows it? Send it to me and I'll fix it for you... :-)
Cheers,
Jon.

I posted the level4 triangle sphereflake spd file on the glome page ([LINK http://syn.cs.pdx.edu/~jsnow/glome/]), there's a link near the bottom. Profiles would indicate that the gc is indeed the current bottleneck.

(L) [2007/01/20] [jdh30] [ocaml sphereflake raytracer with SAH-based kd-tree] Wayback!

>> elihu wrote:I posted the level4 triangle sphereflake spd file on the glome page ([LINK http://syn.cs.pdx.edu/~jsnow/glome/]), there's a link near the bottom. Profiles would indicate that the gc is indeed the current bottleneck.

Ah yes. The fix is quite simple. You're using the List.map and List.merge functions that are not tail recursive, so they eat stack space proportional to the length of the lists and then the GC slows down. The easiest fix is to write tail-recursive versions:

Code: [LINK # Select all]open List
let map f l = rev (rev_map f l)
let rec rev_merge l1 l2 accu =
match l1, l2 with
| [], l2 -> List.rev_append l2 accu
| l1, [] -> List.rev_append l1 accu
| h1::t1, h2::t2 ->
if h1 <= h2
then rev_merge t1 l2 (h1::accu)
else rev_merge l1 t2 (h2::accu)
let merge l1 l2 = rev (rev_merge l1 l2 [])

You could optimise that by avoiding reversals.

If you factor the build_kdtree_list function into many little functions (and try to avoid mutation) then profiling will tell you where the time is spent.

Cheers,
Jon.

back