Suggestions for Speedups

(L) [2008/09/23] [dr_eck] [Suggestions for Speedups] Wayback!

I finally got my code working.  Now it's time to optimize.  Here is the output from my first CodeAnalyst run:
Code: [LINK # Select all]Model::SurfTri::Intersect 61.99%
Accel::AccBVH::IntersectLeaf 27.08%
Model::Vec3::operator* 3.97%
This is pretty reasonable.  Now for the fun part.  Here's the breakdown for SurfTri::Intersect:
Code: [LINK # Select all]1 real SurfTri::Intersect(Ray r) const
2 {
3 real td = n*r.d;       // denominator of intersection equation
4 if (td == 0.0)
5    return ray_inf;       // ray is parallel to plane or rect is degenerate5
6 real tn = n*Vec3(r.o) + D; // - numerator of intersection equation
7 real tt = -tn / td;    // intersection distance
8 if (tt < r.tmin || tt > r.tmax)
9    return ray_inf;
10 // Wald's triangle intersection routine
11 Pnt3 ip = r(tt); // adjusted intersection point
12 real beta = c1*ip[u] + c2*ip[v] + c3;
13 if (beta < 0 || beta > 1) // >1 added by ske
14    return ray_inf;
15 real gamma = c4*ip[u] + c5*ip[v] + c6;
16 if (gamma < 0 || beta+gamma > 1)
17    return ray_inf;
18 return tt;
19 }
According to CA, line 3 (a simple dot product) takes 39% of the time while line 6 (a cast, a dot product and an add) takes 2.5%.  Could this be because of a cache miss?  Line 12 (2 multiplies & 2 adds) is the next biggest time hog at 22%, while line 15 takes only 0.12%.  This math is only 4 cycles, so I assume I'm again being killed by cache misses.  Is my assumption likely correct?  Suggestions?
The second most time consuming function I post as a warning to C++ programmers:
Code: [LINK # Select all]1 real AccBVH::IntersectLeaf(Ray &r, const AccBVHNode *node, Surface*& surf)
2 {
3 real t, tfirst = r.tmax; // intersection distance
4 for (int sp = node->start; sp < node->start+node->ndPrims; ++sp) {
5    t = surfs[sp]->Intersect(r);
6    if(t > r.tmin && t < tfirst )
7       tfirst = t, surf = surfs[sp];
8 }
9 return r.tmax = tfirst;
10 }
Would you believe that line 5 consumes 91% of the cycles?  You would if I told you that Intersect() is a virtual function call.  Ouch!

(L) [2008/09/24] [thomast] [Suggestions for Speedups] Wayback!

>> dr_eck wrote:The second most time consuming function I post as a warning to C++ programmers:
This is not my experience. Are you sure it is not the insides of intersect(), that take 91% of your time and not the virtual table lookup?
Here is a benchmark I have made:
bench.png
Which shows, that for a very fast function (one multiplication) you have to call the function over one billion times (10^9) to get any meaningful speed difference. For a larger function like intersect, that would translate to a lot more calls than that. Considering there are like 0.8 million pixels at 1024 * 768, according to this data it might not make any difference whether the call is virtual or not. On top of this my benchmarks on my ray tracer have shown no speed difference at all calling intersection as a virtual function or not.

(L) [2008/09/24] [thomast] [Suggestions for Speedups] Wayback!

The benchmark data I gave earlier made no sense. The virtual function call was actually faster when compiling with -O3 and -O2 and this shows in the graph! I made a new benchmark.
Test setup:
A class with a virtual function call F
Four inherited classes, which have identical functions F
Make four objects one of each inherited classes (C1, C2, C3, C4)
Have a pointer to base class P
Generate a random number and according to it assing P to one of the child objects addresses
Call P->F() and C1.F() many times summing the return values and measure the time difference and check that sums are equal
I Compiled with g++ with " -O3 -march=native " and I expect g++ to auto-inline the member functions.
Result (64-bit signed integer math):
F = 1 + x: Total time spent calling the function increases 410% to time spent calling non-virtual function
F = 1 + 5*x: 250%
F = ( 1 + 5*x ) / 5: 69%
F = 1 + sqrt(x): 0.4%
F = sqrt(x) + pow(x,2): virtual function call is 0.1% faster..
Conclusion:
For a very light function the virtual function call cost is significant on my Core2 platform, but even just one sqrt() in the function call and fps isn't probably going to change any significant amount unless the compiler is losing some other optimization opportunities mainly I guess because of non-inlined function call.

(L) [2008/09/24] [greenhybrid] [Suggestions for Speedups] Wayback!

I also tried the other extreme and had a big fat templated thing like (some months have passed by since I switched to a layout inspirated by pbrt):
Code: [LINK # Select all]graphics::samplers::screen::XYIterator<
misc::templates::surface <
graphics::image::color::AverageColor // surface of Color ::= sum <1..m> (color_n / m)
>,
graphics::cameras::FromPointToRect, // pinhole
graphics::samplers::ray::Simple // surface-integrator
> renderer;

This was quite efficient, but then to support only 3 basic types of surfaces (e.g. color, surface-normal-visualisation, intersection-distance-visualisation), only 3 basic types of cameras (pinhole, cylinder (360° view), cubemap), and simple Path-Tracing and Whitted-Style only, you would already end up with 3*3*2=12 distinct types which you have to handle wrt user-request and executable size. This is why I gave up that design and 'fell back' to virtual types (and some say that C++ without virtual types could not really be called a OOP language).
edit: I forgot in my calcuation that "graphics::samplers::screen::XYIterator" is a kind of screen-integrator, defined as iterating over (0..height)_0..width. Compare that to an integrator used in MLT, where you need random-access to the screen, so with more than one screen-integrator the product gets even higher ...

(L) [2008/09/24] [tbp] [Suggestions for Speedups] Wayback!

This reminder because it would be wrong to read that thread and get the impression you can have a cake and eat it.
for a virtual function call to happen, a final type has to be resolved, that means some side memory access of some sort. then an indirect call is used, which is till the next core2 generation the slowest form of all (others aren't benign either). as a side effect a whole array of transformations/optimizations can't be brought in; and each derived instance has to carry some reference to the vtable.You could statically, at compile-time, (try to) resolve the type and flatten it all at call site, and some compilers do, but for this to have any reasonable chance of success the whole program - and not just some translation units - has to be seen at once (LTO).
It may be convenient, but you got to pay for what you use.

(L) [2008/09/24] [raymo] [Suggestions for Speedups] Wayback!

A technique to amortize the virtual function cost is to use templates to wrap the loop. So the intersect function you have
could be turned into
Code: [LINK # Select all]template<class ConcreteSurfaceType>
__forceinline real AccBVH::IntersectLeaf(Ray &r, const AccBVHNode *node, VirtualSurface*& surf)
{
ConcreteSurfaceType* surf = static_cast<ConcreteSurfaceType>(surf); // should be able to do reinterpret_cast in debug builds
real t, tfirst = r.tmax; // intersection distance
for (int sp = node->start; sp < node->start+node->ndPrims; ++sp) {
t = surfs[sp]->ConcreteSurfaceType::Intersect(r); // this should be inlined
if(t > r.tmin && t < tfirst )
tfirst = t, surf = surfs[sp];
}
return r.tmax = tfirst;
}

if you have large leaf sizes then it removes the cost from per primitive to per leaf. Also it allows the compiler to optimize the loop. You would need to make sure a leaf contains all the same primitives when building the tree which isn't too difficult.

(L) [2008/09/25] [dr_eck] [Suggestions for Speedups] Wayback!

@raymo I've been using 16x16 packets for primary rays, and they do help a lot. Your idea of templatizing my function is an interesting one. As I read your code, it looks as if I would have to have only one type of ConcreteSurface in each leaf. The only way I can think of to do this would be to have a separate BVH for each surface type. Can you think of a better way?
If not, I guess I'm stuck with triangulating every surface type and building my BVH with only triangles. I'm sure this would be faster anyway because of the fast triangle intersection routines that are available. I'll probably use "ear-cutting" for polygons unless someone has code I could steal for a faster method. Can anyone suggest a method for triangulating a torus, or do I have to reinvent the wheel?

(L) [2008/09/25] [Shadow007] [Suggestions for Speedups] Wayback!

>> dr_eck wrote:As I read your code, it looks as if I would have to have only one type of ConcreteSurface in each leaf. The only way I can think of to do this would be to have a separate BVH for each surface type. Can you think of a better way?
One way would be to store in the leaf counts for each surface type. This way, you would call the right templated intersection for each surf type ...

(L) [2008/09/25] [raymo] [Suggestions for Speedups] Wayback!

A simple technique to make it so that only one type is in a leaf, is before creating a leaf you check if the contents of the leaf is the are all the same type. If not you set a split point at the first type difference.
Code: [LINK # Select all] split = Value of split point, -1 means that we need to make a leaf.
if ( split == -1 )
{
loop over contents of leaf
if ( type != prevtype ) { split = i }
}
if ( split == -1 ) { makeleaf }
// recursive bit
makeBVHTree( 0, split ) // left
makeBVHTree(split, end) // right
That way you know that each node is off only one type and it has very little overhead on the tree building. It means you only need to store one type for each leaf. For 99% of you nodes they will mostly be one type anyway.

Suggestions for Speedups back

Board: Board index ‹ Ray tracing ‹ Considered harmful