Hybrid : CPU & GPU strategies back
Board:
Board index
Raytracing
General Development
GPU
(L) [2014/01/17] [spectral] [Hybrid : CPU & GPU strategies] Wayback!Hi there,
I'm currently playing with a CPU & GPU hybrid rendering approach... the goal is to "plug" the current GPU intersection engine into an existing CPU renderer.
So, here are the performance I got :
Pure GPU : 180 MRPS
Hybrid : 12 MRPS
The GPU usage is:
Pure GPU : 97%
Hybrid : 37%
So, the hybrid approach is currently so slow that I think that there are some problems. simply because there is no shading, no bounce etc... I use a simple AO shading and ray always restart from the camera.
Some pseudo-code that show how it works :
Code: [LINK # Select all]for(i = 0; i < 8; i++)
_renderThreads[i].Start();
class RenderThread
{
Run()
{
while(1)
{
GenerateCameraRays(rays, 200000); // 200.000 rays in a set
gpuQueue[threadId]->SendToGPU(rays);
gpuQueue[threadId]->ExecuteRaysIntersections(rays);
gpuQueue[threadId]->WaitForGPUCompletion();
gpuQueue[threadId]->ReadGPUHits(hits);
Shade(hits);
}
}
}
So, I'm looking to improve the speed... if someone has experience with this, any comment is welcomed [SMILEY ;-)]
(L) [2014/01/17] [jbikker] [Hybrid : CPU & GPU strategies] Wayback!I suppose your GPU is spending too much time waiting for the transfers (and the CPU work). One thing you could do is to make the transfers asynchronous: have a double buffer for the rays (2x 200k), and fill one while the other is processed. In CUDA, the processing will suffer very little from all the copying going on in the background.
(L) [2014/01/17] [macnihilist] [Hybrid : CPU & GPU strategies] Wayback!I actually tried the same thing, because "C++ on CPU for complex rendering algorithms and shading and GPU for chewing through large intersection batches" sounded really nice.
Also, PCIE throughput seemed reasonable enough today to try this if you're not after real-time.
In practice (at least my approach) didn't work so well.
There was some improvement from using the GPU as an additional 'coprocessor', but it was in no way using the GPU to it's full potential.
I didn't use async transfers to the GPU as jbikker suggested, but at least the CPU was doing some parallel work while waiting for the results from the GPU.
Async or not, I found it very hard to keep the workload balanced, devices were constantly waiting for each other.
Another (smaller) problem was that you have huge memory consumption with large batches (it's not only the rays/results but also the 'interim results' you have to save for each ray/path to let the integrator continue after the results are there).
Maybe I just wasn't putting enough thought into it or doing something stupid, but I decided to ditch the approach (I'm mainly after interactivity, not flexibility and complex scenes).
What I'm trying now is writing the renderer (integrator/shader) as a C kernel and then compiling that to CUDA/OpenCL/ISPC.
Then there are intersection engines for each compilation target that allow the whole thing to run on a single device.
So, similar to what Cycles does but not in a Megakernel style, but with renderer kernels and intersection kernels separated.
You can then use multiple devices to render a single image, but the devices are more decoupled and can run more concurrently (it's actually almost the same as cluster rendering over the network).
This is all in the early stages and I don't have any reliable numbers, but it seems to works much better.
Of course, you can't use all the nice C++ features and existing code.
(Which is really sad, because most of this stuff is just a compilation problem in the end -- not having virtual, templates, and operator overloading can be quite annoying...)
Well, after reading this again I realize this little experience report probably won't help you much with your actual problem, but you said any comment was welcome. [SMILEY ;)]
(L) [2014/01/17] [spectral] [Hybrid : CPU & GPU strategies] Wayback!>> jbikker wrote:I suppose your GPU is spending too much time waiting for the transfers (and the CPU work). One thing you could do is to make the transfers asynchronous: have a double buffer for the rays (2x 200k), and fill one while the other is processed. In CUDA, the processing will suffer very little from all the copying going on in the background.
Hi Jacco,
It is what I also expect... but notice that in the current case I have 8 threads... and 8 "gpu commands queues". So, I should have a lot of parallel asynchronous send/receive/execute commands in the queues...
... at the end it should be the same than having several "sets" in the same thread... or I forgot something [SMILEY :-P]
(L) [2014/01/17] [Dietger] [Hybrid : CPU & GPU strategies] Wayback!>> spectral wrote:It is what I also expect... but notice that in the current case I have 8 threads... and 8 "gpu commands queues". So, I should have a lot of parallel asynchronous send/receive/execute commands in the queues...
... at the end it should be the same than having several "sets" in the same thread... or I forgot something
Issuing memory transfers and kernel executions from different threads is not enough to achieve concurrent memory transfer and kernel execution. In cuda only memory transfers from pinned memory can overlap with kernel execution on the GPU and only if both are issues on different (non-default) streams.
(L) [2014/01/17] [spectral] [Hybrid : CPU & GPU strategies] Wayback!Thanks Dietger,
But I have also different set of pinned memory !
Of course, I use the same "command queue" for thread's memory transfer and 1 kernel execution [SMILEY :-P]
(L) [2014/02/10] [tarlack] [Hybrid : CPU & GPU strategies] Wayback!Hello,
Is your GPU workload linear with respect to the CPU workload ? (you have N rays on CPU, and k*N work on GPU ?) If so, unless k is gigantic, I don't think it will work, and I think that your best bet is to find another formulation for your rendering algorithm to exhibit something like N tasks on CPU -> N^a tasks on GPU with a > 1.
For complex scenes, CPU/GPU hybridization is not just a matter of compilation. GPU-only algorithms cannot handle nicely tens of gigabytes of textures, arbitrarily complex shader codes, measured BRDFs, BTFs and so on while making rays bounce everywhere in the scene. Even by clustering and using on-device caches seems not obvious, because assuming a coherence in ray space for global illumination after even a few (say, two or three) diffuse bounces seems quite dubious to me. But again, I'm talking about complex (large) scenes, not the cornell box or a (set of) glasses on a table, which, although very complex to render, I do not consider as complex scenes. But put these glasses in a complete scottish pub, with all the tables, objects, mirrors, food, measured wood BRDF, a BTF for the fabric of the barman's kilt and all other fabric clothes of all the clients, tens of small lights, paintings, curved shiny chromes, plants, etc, and now we are talking about a (somewhat more) complex scene...
(L) [2014/02/10] [spectral] [Hybrid : CPU & GPU strategies] Wayback!Sure,
But here I'm just looking to "improve" the speed of an existing CPU renderer (only the intersection test) ... all I need to store is the geometry !
So, I agree with your vision of the GPU renderers but notice that with time GPU have more & more memory.... more & more cores etc etc... so it is just a question of time and even for some scenes GPUs performs very well [SMILEY ;-)]
(L) [2014/02/10] [tarlack] [Hybrid : CPU & GPU strategies] Wayback!Just for the intersection test, it is likely that the memory throughput will be your main problem while you stick to one ray produced by CPU->one GPU task, even with double buffering and async transfers. Could you find a way to build on CPU something that will be able to "explode" on GPU, even just for isect tests ? Something like computing the parameters of a "procedural ray production" system ?
(L) [2014/02/10] [Dade] [Hybrid : CPU & GPU strategies] Wayback!>> tarlack wrote:Just for the intersection test, it is likely that the memory throughput will be your main problem while you stick to one ray produced by CPU->one GPU task, even with double buffering and async transfers. Could you find a way to build on CPU something that will be able to "explode" on GPU, even just for isect tests ? Something like computing the parameters of a "procedural ray production" system ?
Check: [LINK http://www.researchgate.net/publication/220507126_Combinatorial_Bidirectional_Path-Tracing_for_Efficient_Hybrid_CPUGPU_Rendering/file/9c96051824f2341ee2.pdf]
However, in practice, hybrid rendering will never provide the kind of speedup you can achieve with a full OpenCL rendered (or CUDA or whatever you are using).
back