Can't pass virtual function objects to CUDA kernel is a pain

(L) [2013/04/03] [ost by shiqiu1105] [Can't pass virtual function objects to CUDA kernel is a pain] Wayback!

In my CPU ray tracer, I used polymorphism intensively.

For example, I had a bunch of different light types, point light, area light, etc., all derived from a Light base class.
When rendering, all I need is to loop through an arrary of Light pointers.

Now, without this capability, I have to explicitly store an array for each of the light types, and in order to query all light sources I need to loop through several arrays.
This is really ugly and inelegant to me.

Another alternative is to have a big switch-case clause, and choose different query methods, which also ugly.

I know that we can construct objects locally in the kernel code to call virtual functions too, but that's not cool to me either and I am afraid it wil hurt performance.

So I am asking is there a good solution to polymorphism in CUDA kernel? Such as a smart way to use template somehow and I can call different implementation of a methods just like in C++?

(L) [2013/04/06] [ost by keldor314] [Can't pass virtual function objects to CUDA kernel is a pain] Wayback!

A CPU will do polymorphism either through function pointers or though a big switch clause, depending on what the compiler thinks is fastest. Hence, you have to eat that cost there too.

It's probably worth noting that GPUs should be able to handle function pointers more efficiently than CPUs for micro-architectural reasons (if you're interested, I can explain), so be sure to give them a try.

Looping though an array for each light type may actually be fastest, both on CPU and GPU, assuming you don't mess up data locality in the process.

(L) [2013/04/06] [ost by hobold] [Can't pass virtual function objects to CUDA kernel is a pain] Wayback!

>> keldor314 wrote:It's probably worth noting that GPUs should be able to handle function pointers more efficiently than CPUs for micro-architectural reasons (if you're interested, I can explain), so be sure to give them a try.Are you referring to the GPU's latency hiding strategies with this comment? Or something else?

(L) [2013/04/06] [ost by graphicsMan] [Can't pass virtual function objects to CUDA kernel is a pain] Wayback!

Correct me if I'm wrong, but the problem is not that polymorphic (virtual) functions won't work in CUDA, it's that if you allocate the object on the CPU side, you can't copy the pointer to the GPU the way you do with simple structs and use the polymorphic functions. This leaves you with two strategies: 1) manage polymorphism in a straight C fashion, or 2) have some kind of factory that can take CPU objects and explicitly make new GPU objects, which can then be used as normal. Again, please correct me if I'm wrong; it's been several years since I programmed in CUDA.

(L) [2013/04/08] [ost by keldor314] [Can't pass virtual function objects to CUDA kernel is a pain] Wayback!

>> hobold wrote:keldor314 wrote:It's probably worth noting that GPUs should be able to handle function pointers more efficiently than CPUs for micro-architectural reasons (if you're interested, I can explain), so be sure to give them a try.Are you referring to the GPU's latency hiding strategies with this comment? Or something else?
More or less - function pointers are a worst case for branch prediction, since they are heavily data dependent, and can go to any number of different addresses.  Hence, you're very likely to get a branch mispredict.

So what happens on a mispredict?  Basically, you have to flush all instructions in the pipeline, since these are from after the branch, and are invalid if you got the branch wrong.  This means something like 15 stages times maybe 5 superscalar ports = 75 instructions (actually, many CPUs are wider and have deeper pipelines).  In addition, any instructions that got issued early with out of order will also be invalidated.  Thus, the cost of a branch mispredict is a stall across about 100 instructions, which is rather expensive.

Now, what about a GPU?  A GPU will try to issue instructions from different threads every cycle, something like round robin among however many threads are resident on the processor, skipping over any threads that are stalled.  This means that when a thread hits a branch, it is simply marked as stalled, and not issued from again until after the 20 or so cycles it takes to go through the pipeline and resolve the branch.  Since it's aggressively cycling between threads, a given thread won't execute more than once every few cycles in most cases, meaning there's usually plenty of time to decode the instruction and determine whether it's a branch before issuing further instructions on that thread.  This means that for a GPU, there will usually be no stall following an indirect branch (function pointer).

Can't pass virtual function objects to CUDA kernel is a pain back

Board: Home Board index Raytracing General Development