(L) [2013/04/06] [ost
by shiqiu1105] [Random number generator on GPU] Wayback!I am currenlty writing my own GPU tracer.
And I now need random number, or quasi random number for sampling.
But how to get random number with CUDA? What do people usually do?
Generate a bunch of random numbers on the CPU and pass it to kernel? Or use some library to generate numbers in the kernel directly?
Or implement their own RNG in the kernel??
(L) [2013/04/06] [ost
by keldor314] [Random number generator on GPU] Wayback!Just write your own.  This is the one I use:
Code: [LINK # Select all]__constant__ unsigned int shift1[4] = {6, 2, 13, 3};
__constant__ unsigned int shift2[4] = {13, 27, 21, 12};
__constant__ unsigned int shift3[4] = {18, 2, 7, 13};
__constant__ unsigned int offset[4] = {4294967294, 4294967288, 4294967280, 4294967168};
__shared__ unsigned int randStates[32];
__device__ unsigned int TausStep(unsigned int &z)
{
    int index = threadIdx.x&3;
    unsigned int b = (((z << shift1[index]) ^ z) >> shift2[index]);
    return z = (((z & offset[index]) << shift3[index]) ^ b);
}
__device__ unsigned int randInt()
{
    TausStep(randStates[threadIdx.x&31]);
    return (randStates[(threadIdx.x)&31]^randStates[(threadIdx.x+1)&31]^randStates[(threadIdx.x+2)&31]^randStates[(threadIdx.x+3)&31]);
}
__device__ float randFloat()
//This function returns a random float in [0,1] and updates seed
{
    unsigned int y = randInt();
    return __int_as_float((y&0x007FFFFF)|0x3F800000)-1.0f;
}
You seed the RNG from the CPU - just pass in an array of 32 random ints and copy it into randStates.  Just be sure that the CPU is using a different algorithm!
It's a varient of the Tausworthe RNG, optimized to partially share state between threads.  This gives it a tiny seed footprint (32 bits per thread, this reduces cache/shared memory pressure) while retaining a much higher quality compared to non-shared 32 bits per thread generators.
There's some room for optimization, even just moving the shifts and offsets out of constant memory and into shared memory should help performance on Kepler (though not on Fermi or earlier).  A downside with recent hardware is that this generator has a number of bit shifts.  You could modify it to turn the left shifts into muls, which is a win on Kepler, but not so much on other hardware (I don't think it's actually a loss anywhere, though).  There's nothing you can do with the right shift, though.