Realtime Raytracing on an already busy CPU?

(L) [2014/07/03] [tby AranHase] [Realtime Raytracing on an already busy CPU?] Wayback!

Hi, I'm having a major bottleneck on my system during the synchronization between my CPU and GPU and would like some help.
Right now I have a realtime raytracer working on a discrete GPU (OpenCL + AMD Implementation). The raytracer is actually a very simple one and runs completely in the GPU. It computes just one sample per pixel. For every primary ray intersection I launch a single shadow ray. The GPU can render it fairly easily (200+ FPS).
The model I'm rendering is actually being completely managed in the CPU. The octree is created and modified in CPU all the time. I have two threads working close to 100% all the time doing it.
This works fine when I'm modifying just a small part of the octree, as I can send just a small chunk of memory to the GPU to update its state. The main problem I'm having right now is when many parts of the octree are modified. It can go from 200+ FPS to 10 FPS just because most of my time is spend copying memory from CPU to GPU.
One solution I see for this problem is to actually do the rendering entirely in CPU, so no copy is needed. But, I could not find any work that talks about doing rendering plus some others computations in the CPU at the same time. Most of the benchmarks focus just on the rendering part. Also, my CPU is not really that good (Intel i7 4770K Haswell). Another solution I'm thinking about is to use APU (like GPU inside the CPU), so both devices shares an universal memory, again, no copy needed.
So, I would like to know if someone around here had any similar experience. Or could give me a place to search for answers. Or anything that can help me in any way. That would be much appreciated [SMILEY :)]
Thank you and sorry about my bad English...

(L) [2014/07/05] [tby AranHase] [Realtime Raytracing on an already busy CPU?] Wayback!

>> lion wrote:For CPU render try [LINK http://embree.github.io/]
I looked into Embree and if I understood correctly they focus on photo-realistic rendering. What I want is actually a way simpler raytracer. No global illumination, no soft shadows, no AA, no reflection, no refraction. Just basic, direct path, instant, illumination. But, since its already done, I'll try doing some benchmarks if it. Thank you for the link.
>> lion wrote:How you organize your octree? It's single OCL buffer?
The Octree is in a single continuous unidimensional array. Each node has a size of 64bits. 32 bits is an "offset" to the first child node (the others children are stored in sequence). The "offset" is basically the node index in this array, so I can exchange it between CPU and GPU without problems (no pointers). The others 16bits I use for bit masks. 8 bits indicates which children are non-empty, and 8 bits to indicate which ones are leaf. 16 bits are unused for now, but keeps the data aligned.
* Ninja edit: Yes. I'm organizing it in an OpenCL buffer. I tried putting it in pinned memory to send through DMA to the GPU, but did not get a huge performance gain.

(L) [2014/07/06] [tby lion] [Realtime Raytracing on an already busy CPU?] Wayback!

>> AranHase wrote:if I understood correctly they focus on photo-realistic rendering.
They split render in two parts, photo-realistic render is other part [LINK http://embree.github.io/renderer.html]
[LINK http://embree.github.io/] do only ray traversal and basic shading.
And how static nodes fragmented with dynamic ones? I assume you update GPU buffer by some big parts (dynamic and static nodes affected as well), or small parts (but it's increase update commands).
May be do something to split dynamic part from static one? and update dynamic part with few commands and no data overhead?
Or move part\full structure construction to GPU too, among with GPU-CPU memory sharing I do not see any alternatives (even with UMA or equivalent data transferred through PCI).

(L) [2014/07/06] [tby AranHase] [Realtime Raytracing on an already busy CPU?] Wayback!

>> lion wrote:They split render in two parts, photo-realistic render is other part [LINK http://embree.github.io/renderer.html]
[LINK http://embree.github.io/] do only ray traversal and basic shading.
Nice, I'll start working on embree soon.
>> lion wrote:And how static nodes fragmented with dynamic ones? I assume you update GPU buffer by some big parts (dynamic and static nodes affected as well), or small parts (but it's increase update commands).

At first I was sending the entire octree array to the GPU every frame. Obviously it was slow. So I just divided the array into smaller chunks of memory. Whenever a chunk changes, I mark it to send to the GPU. So, just changed chunks are now being uploaded. So, yes, static and dynamic nodes are all grouped in the same data structure. Here is a graph of how chunks of 8MB behave on the GPU:
[IMG #1 Image]
The blue line is the CPU->GPU copy and the green ones are the raytracer kernels. As can be seem, even relatively small chunks takes some time to be uploaded (and some extra time doing nothing). And this is a well behave (70+ fps) sample.
>> lion wrote:May be do something to split dynamic part from static one? and update dynamic part with few commands and no data overhead? Or move part\full structure construction to GPU too, among with GPU-CPU memory sharing I do not see any alternatives (even with UMA or equivalent data transferred through PCI)
Yeah, I'm thinking I need a smarter way to exchange information between devices. Problem is, most operations I'm doing are sequential, so not really good way to use the GPU. I guess I could try passing a list of changes to the GPU, and let the GPU apply the changes to the global memory. But, on the example above, I actually made millions of changes in the octree, and only three chunks of memory changed.
My next step now is to try embree and see if I can do realtime CPU rendering while having others threads consuming CPU computing power, caching/instruction memory (sigh) and main memory access.

[IMG #1]:

(L) [2014/07/06] [tby Dade] [Realtime Raytracing on an already busy CPU?] Wayback!

>> AranHase wrote:The blue line is the CPU->GPU copy and the green ones are the raytracer kernels. As can be seem, even relatively small chunks takes some time to be uploaded (and some extra time doing nothing). And this is a well behave (70+ fps) sample.

To start, you should overlap OpenCL buffer write operations with kernel execution. How to achieve this result depends a bit on the OpenCL implementation you are using but you are probably going to need 2 buffers (one in use by the executed kernel and the other updated by the CPU). It will work like a 2 stages pipeline. Indeed, this will increase the bandwidth (i.e. fps) but it will not reduce the per frame latency.
As general rule with OpenCL (and GPGPU in general), CPU <-> GPU bandwidth is quite high when compared to the overhead of starting a CPU <-> GPU communication: your idea to break the octree buffer in multiple chunks and having multiple buffer writes is probably slower than a single large buffer write operation. In my experience transferring 8mb or 80mb requires about the same time if you include the driver overhead.

(L) [2014/07/06] [tby AranHase] [Realtime Raytracing on an already busy CPU?] Wayback!

>> Dade wrote:To start, you should overlap OpenCL buffer write operations with kernel execution. How to achieve this result depends a bit on the OpenCL implementation you are using but you are probably going to need 2 buffers (one in use by the executed kernel and the other updated by the CPU). It will work like a 2 stages pipeline. Indeed, this will increase the bandwidth (i.e. fps) but it will not reduce the per frame latency.

Thank you Dade. I thought it was impossible on AMD hardware because the queue is always in-order, but it seems it may be possible to do it using two command queues (my google-fu returning mixed results). I'll try it later and see how the code deals with two queues.
>> Dade wrote:As general rule with OpenCL (and GPGPU in general), CPU <-> GPU bandwidth is quite high when compared to the overhead of starting a CPU <-> GPU communication: your idea to break the octree buffer in multiple chunks and having multiple buffer writes is probably slower than a single large buffer write operation. In my experience transferring 8mb or 80mb requires about the same time if you include the driver overhead.
Oh yes. You are completely right. My little test was just to see how small chunks behave, but I get better performance than sending the entire array at once. There should be a sweet spot somewhere but I doubt it will improve too much the performance. The overhead is way too high for an interactive application.

(L) [2014/07/07] [tby Dade] [Realtime Raytracing on an already busy CPU?] Wayback!

>> AranHase wrote:Thank you Dade. I thought it was impossible on AMD hardware because the queue is always in-order, but it seems it may be possible to do it using two command queues (my google-fu returning mixed results). I'll try it later and see how the code deals with two queues.

There should be an example of how to obtain overlapped transfers inside the AMD OpenCL SDK (checking ... in samples/opencl/cl/TransferOverlap directory).

(L) [2014/07/08] [tby AranHase] [Realtime Raytracing on an already busy CPU?] Wayback!

>> Dade wrote:There should be an example of how to obtain overlapped transfers inside the AMD OpenCL SDK (checking ... in samples/opencl/cl/TransferOverlap directory).
Thank you again Dade. I somehow missed the samples from the AMD APP SDK. "TrasnferOverlap" is a very interesting sample. They use a "Zero-copy buffer flag" I didn't known about. Its AMD specific by the looks of it, and the sample runs way faster than the flags I've been using. I'll do some more testing.

Realtime Raytracing on an already busy CPU? back

Board: Board index Raytracing General Development