Suggestions for Speedups back

Board: Board index ‹ Ray tracing ‹ Considered harmful

(L) [2008/09/23] [dr_eck] [Suggestions for Speedups] Wayback!

I finally got my code working.  Now it's time to optimize.  Here is the output from my first CodeAnalyst run:
Code: [LINK # Select all]Model::SurfTri::Intersect        61.99%
Accel::AccBVH::IntersectLeaf  27.08%
Model::Vec3::operator*             3.97%
This is pretty reasonable.  Now for the fun part.  Here's the breakdown for SurfTri::Intersect:
Code: [LINK # Select all]1  real SurfTri::Intersect(Ray r) const
2  {
3     real td = n*r.d;         // denominator of intersection equation
4   if (td == 0.0)
5      return ray_inf;         // ray is parallel to plane or rect is degenerate5
6   real tn = n*Vec3(r.o) + D;   // - numerator of intersection equation
7   real tt = -tn / td;      // intersection distance
8   if (tt < r.tmin || tt > r.tmax)
9      return ray_inf;
10   // Wald's triangle intersection routine
11   Pnt3 ip = r(tt);   // adjusted intersection point
12   real beta = c1*ip[u] + c2*ip[v] + c3;
13   if (beta < 0 || beta > 1)   // >1 added by ske
14      return ray_inf;
15   real gamma = c4*ip[u] + c5*ip[v] + c6;
16   if (gamma < 0 || beta+gamma > 1)
17      return ray_inf;
18   return tt;
19  }
According to CA, line 3 (a simple dot product) takes 39% of the time while line 6 (a cast, a dot product and an add) takes 2.5%.  Could this be because of a cache miss?  Line 12 (2 multiplies & 2 adds) is the next biggest time hog at 22%, while line 15 takes only 0.12%.  This math is only 4 cycles, so I assume I'm again being killed by cache misses.  Is my assumption likely correct?  Suggestions?
The second most time consuming function I post as a warning to C++ programmers:
Code: [LINK # Select all]1  real AccBVH::IntersectLeaf(Ray &r, const AccBVHNode *node, Surface*& surf)
2  {
3   real t, tfirst = r.tmax;   // intersection distance
4   for (int sp = node->start; sp < node->start+node->ndPrims; ++sp) {
5      t = surfs[sp]->Intersect(r);
6      if(t > r.tmin && t < tfirst )
7         tfirst = t, surf = surfs[sp];
8   }
9   return r.tmax = tfirst;
10  }
Would you believe that line 5 consumes 91% of the cycles?  You would if I told you that Intersect() is a virtual function call.  Ouch!
(L) [2008/09/24] [Michael77] [Suggestions for Speedups] Wayback!

Have a look at the assembly output. From my experience, profiling the C++ code without having stats for the assembly level instructions is pretty useless.
(L) [2008/09/24] [thomast] [Suggestions for Speedups] Wayback!

>> dr_eck wrote:The second most time consuming function I post as a warning to C++ programmers:
This is not my experience. Are you sure it is not the insides of intersect(), that take 91% of your time and not the virtual table lookup?
Here is a benchmark I have made:
bench.png
Which shows, that for a very fast function (one multiplication) you have to call the function over one billion times (10^9) to get any meaningful speed difference. For a larger function like intersect, that would translate to a lot more calls than that. Considering there are like 0.8 million pixels at 1024 * 768, according to this data it might not make any difference whether the call is virtual or not. On top of this my benchmarks on my ray tracer have shown no speed difference at all calling intersection as a virtual function or not.
(L) [2008/09/24] [tarlack] [Suggestions for Speedups] Wayback!

the main problem with virtual functions is the loss of a lot of optimizations, mainly inlining...inline your non-virtual function, you'll see a lot of differences [SMILEY :)]
otherwise, I agree with you : the basic slowdown only due to the virtual table lookup is not that big [SMILEY :)]
(L) [2008/09/24] [thomast] [Suggestions for Speedups] Wayback!

The benchmark data I gave earlier made no sense. The virtual function call was actually faster when compiling with  -O3 and -O2 and this shows in the graph! I made a new benchmark.
Test setup:
A class with a virtual function call F
Four inherited classes, which have identical functions F
Make four objects one of each inherited classes (C1, C2, C3, C4)
Have a pointer to base class P
Generate a random number and according to it assing P to one of the child objects addresses
Call P->F() and C1.F() many times summing the return values and measure the time difference and check that sums are equal
I Compiled with g++  with " -O3 -march=native " and I expect g++ to auto-inline the member functions.
Result (64-bit signed integer math):
F = 1 + x: Total time spent calling the function increases 410% to time spent calling non-virtual function
F = 1 + 5*x: 250%
F = ( 1 + 5*x ) / 5: 69%
F = 1 + sqrt(x): 0.4%
F = sqrt(x) + pow(x,2): virtual function call is 0.1% faster..
Conclusion:
For a very light function the virtual function call cost is significant on my Core2 platform, but even just one sqrt() in the function call and fps isn't probably going to change any significant amount unless the compiler is losing some other optimization opportunities mainly I guess because of non-inlined function call.
(L) [2008/09/24] [greenhybrid] [Suggestions for Speedups] Wayback!

I also tried the other extreme and had a big fat templated thing like (some months have passed by since I switched to a layout inspirated by pbrt):
Code: [LINK # Select all]graphics::samplers::screen::XYIterator<
        misc::templates::surface <
                graphics::image::color::AverageColor  // surface of Color ::= sum <1..m> (color_n / m)
        >,
        graphics::cameras::FromPointToRect,  // pinhole
        graphics::samplers::ray::Simple // surface-integrator
> renderer;

This was quite efficient, but then to support only 3 basic types of surfaces (e.g. color, surface-normal-visualisation, intersection-distance-visualisation), only 3 basic types of cameras (pinhole, cylinder (360° view), cubemap), and simple Path-Tracing and Whitted-Style only, you would already end up with 3*3*2=12 distinct types which you have to handle wrt user-request and executable size. This is why I gave up that design and 'fell back'  to virtual types (and some say that C++ without virtual types could not really be called a OOP language).
edit: I forgot in my calcuation that "graphics::samplers::screen::XYIterator" is a kind of screen-integrator, defined as iterating over (0..height)_0..width. Compare that to an integrator used in MLT, where you need random-access to the screen, so with more than one screen-integrator the product gets even higher ...
(L) [2008/09/24] [greenhybrid] [Suggestions for Speedups] Wayback!

>> thomast wrote:For a very light function the virtual function call cost is significant on my Core2 platform
Not only on your platform, but on any current (imho). You always have to weigh out overhead vs. what significantly happens. If your function is only large enough, than even the vtable-traversal of 10-deep instances gets negligible.
(L) [2008/09/24] [tbp] [Suggestions for Speedups] Wayback!

This reminder because it would be wrong to read that thread and get the impression you can have a cake and eat it.
 for a virtual function call to happen, a final type has to be resolved, that means some side memory access of some sort. then an indirect call is used, which is till the next core2 generation the slowest form of all (others aren't benign either). as a side effect a whole array of transformations/optimizations can't be brought in; and each derived instance has to carry some reference to the vtable.You could statically, at compile-time, (try to) resolve the type and flatten it all at call site, and some compilers do, but for this to have any reasonable chance of success the whole program - and not just some translation units - has to be seen at once (LTO).
It may be convenient, but you got to pay for what you use.
(L) [2008/09/24] [raymo] [Suggestions for Speedups] Wayback!

A technique to amortize the virtual function cost is to use templates to wrap the loop. So the intersect function you have
could be turned into
Code: [LINK # Select all]template<class ConcreteSurfaceType>
__forceinline  real AccBVH::IntersectLeaf(Ray &r, const AccBVHNode *node, VirtualSurface*& surf)
  {
     ConcreteSurfaceType*  surf = static_cast<ConcreteSurfaceType>(surf); // should be able to do reinterpret_cast in debug builds
   real t, tfirst = r.tmax;   // intersection distance
   for (int sp = node->start; sp < node->start+node->ndPrims; ++sp) {
      t = surfs[sp]->ConcreteSurfaceType::Intersect(r); // this should be inlined
      if(t > r.tmin && t < tfirst )
         tfirst = t, surf = surfs[sp];
   }
   return r.tmax = tfirst;
  }

if you have large leaf sizes then it removes the cost from per primitive to per leaf. Also it allows the compiler to optimize the loop. You would need to make sure a leaf contains all the same primitives when building the tree which isn't too difficult.
(L) [2008/09/24] [raymo] [Suggestions for Speedups] Wayback!

Another thing to note is to use packets of rays ( you don't have to use SIMD) and amortize the virtual function cost further. ( so a 4 primitive leaf with 4 rays should reduce the virtual function  cost to around 6% of what it is currently).
(L) [2008/09/25] [dr_eck] [Suggestions for Speedups] Wayback!

@raymo  I've been using 16x16 packets for primary rays, and they do help a lot.  Your idea of templatizing my function is an interesting one.  As I read your code, it looks as if I would have to have only one type of ConcreteSurface in each leaf.  The only way I can think of to do this would be to have a separate BVH for each surface type.  Can you think of a better way?  
If not, I guess I'm stuck with triangulating every surface type and building my BVH with only triangles.  I'm sure this would be faster anyway because of the fast triangle intersection routines that are available.  I'll probably use "ear-cutting" for polygons unless someone has code I could steal for a faster method.  Can anyone suggest a method for triangulating a torus, or do I have to reinvent the wheel?
(L) [2008/09/25] [Shadow007] [Suggestions for Speedups] Wayback!

>> dr_eck wrote:As I read your code, it looks as if I would have to have only one type of ConcreteSurface in each leaf.  The only way I can think of to do this would be to have a separate BVH for each surface type.  Can you think of a better way?
One way would be to store in the leaf counts for each surface type. This way, you would call the right templated intersection for each surf type ...
(L) [2008/09/25] [raymo] [Suggestions for Speedups] Wayback!

A simple technique to make it so that only one type is in a leaf, is before creating a leaf you check if the contents of the leaf is the are all the same type. If not you set a split point at the first type difference.
Code: [LINK # Select all] split = Value of split point, -1 means that we need to make a leaf.
if ( split == -1 )
{
     loop over contents of leaf
           if ( type != prevtype ) { split = i }
}
if ( split == -1 ) { makeleaf }
// recursive bit
makeBVHTree( 0, split ) // left
makeBVHTree(split, end) // right
That way you know that each node is off only one type and it has very little overhead on the tree building. It means you only need to store one type for each leaf. For 99% of you nodes they will mostly be one type anyway.
(L) [2008/09/25] [raymo] [Suggestions for Speedups] Wayback!

I'd be wary of going down the only triangles route IMHO.
 A primitive in a leaf node can be arbitrarily complex for example I've got a menger sponge, sphereflake, water surface, bezier surfaces and other weirdness as primitives. Also it allows you to make transform node for instancing, a proxy node for streaming.
(L) [2008/09/25] [raymo] [Suggestions for Speedups] Wayback!

oh, and using spheres for particles works really well ( they are faster than triangles), and it allows you to mix motion blurred primitives with the faster static variants.

back