(L) [2014/02/01] [tby lion] [CUDA and branch prediction] Wayback!Branch is costly operation on GPU, so we want control branch weight if compiler failed to do so.
Is there analog for __builtin_expect() or __assume() for CUDA?
Same question on nvidia forum [LINK https://devtalk.nvidia.com/default/topic/458657/non-divergent-branch-nvcc-and-ptx/] but no one answers.
My current solution - compile to ptx, modify and assemble.
(L) [2014/02/09] [tby papaboo] [CUDA and branch prediction] Wayback!I haven't found anything similar too it no. However, branching is mostly performance intensive if some threads in a warp (halfwarp maybe) take divergent paths. So instead of focusing on and optimizing the branch, you might want to look at your overall algorithm and se if you can launch warps that take a relatively coherent path through the code.
Fx. I recently sped up my path tracer by 2x by implementing a coherent tracing scheme, [LINK http://graphics.ucsd.edu/~henrik/papers/coherent_path_tracing.pdf], where all rays in a warp used the same random sequence. This did not remove any of the branches in the codebase, but merely increased the possibility that threads in the same warp would take similar paths and thus perform the same computations and fetch the same memory.
If divergence is a huge problem, then another approach is periodically resort your input data in order to achieve some coherence.
If you can't ensure that your threads are coherent, then some general tips to reduce the overhead are:
(Partial) loop unrolling of tight loops. That extra register for the loop variable and the increment instruction can be a lot of overhead in a tight loop.
Use the ternary operator b ? x : y. As far as I know this compiles to a branchless instruction.
(L) [2014/02/09] [tby lion] [CUDA and branch prediction] Wayback!Yep, I know about that, we need all warp do exactly same instructions to achieve full utilization.
Also I note that there no effect from converting bra\bra.uni instructions in simple kernels (probably hardware branch prediction buffer), but gain some effect in complex kernels.