SIMDy

(L) [2017/11/06] [tby Tahir007] [SIMDy] Wayback!

Hi there,
I'd like to take this opportunity to advertise tool that might be very useful for developing ray tracers or any other
compute intensive applications on CPU. I develop package for python that allow you to utilize CPU SIMD instructions (SSE, AVX, AVX2, AVX-512, FMA).
Basically its JIT compiler that compiles simplified Python code to native x86 machine code. To use SIMD instructions I add
vector data types float32x4, float32x8, float32x16, etc... so you can easily do explicit vectorization. Before compilation I check what
instruction sets CPU supports and then select best one. So basically this means if you want to achieve maximum performance all that is needed
to do is to use biggest vector types supported. (float32x16, float64x8, int32x16) as much as possible and all the magic happens automatically.
Even if your CPU only has SSE instruction sets you still benefit from using wide vector types because of memory locality.
This tool is still WIP because there is still lots of work to be done but even at this stage it is very useful. I start developing
path tracer just to show how this tool is used.
Here is one trivial example (Calculation of pi using Monte Carlo)
Code: [LINK # Select all]from multiprocessing import cpu_count
from simdy import int64, float64, simdy_kernel
@simdy_kernel(nthreads=cpu_count())
def calculate_pi(n_samples: int64) -> float64:
inside = int64x4(0)
for i in range(n_samples):
x = 2.0 * random_float64x4() - float64x4(1.0)
y = 2.0 * random_float64x4() - float64x4(1.0)
inside += select(int64x4(1), int64x4(0), x * x + y * y < float64x4(1.0))
nn = inside[0] + inside[1] + inside[2] + inside[3]
result = 4.0 * float64(nn) / float64(n_samples * 4)
return result
result = calculate_pi(int64(25_000_000))
print(sum(result) / cpu_count())

LINKS:
SIMDy - [LINK http://www.tahir007.com/]
Path Tracer - [LINK https://bitbucket.org/Tahir007/quark]

My question for you guys is what you think about this tool?
When i transform BVH tree to BVH16 i got almost got two times performance with AVX-512 instructions. [SMILEY :)]

(L) [2017/11/07] [tby Tahir007] [SIMDy] Wayback!

Thanks for positive opinions about project:)
Yes you are right when you sad that you be very surprised if this was fast as optimized C++. I am programming about 15 years now and on numerous occasions i tried to optimize some function with hand written assembly code and compiler always beat me, but I learn lot in the process. Over the years a got better in assembly but still i admit that C++ compilers generates better code than I am. But when you turn to SIMD instructions things
are suddenly changed. Now programmer is responsible for writing compiler SIMD intrinsic so now I compete with other programmers and not compiler.
And also because I am doing JIT compilation i have lot's more context to work with because I know exactly what CPU you have. So in the and its not clear which code will be faster that why I sad that you get similar performance as optimized C++. [SMILEY :)]
Now i will show simple example just to see exactly what is going on and how SIMDy works. Below example is trivial but it will show one of biggest advantage of SIMDy and that is how it adapt to different instruction sets automatically, depend of you CPU capabilities for handling float64x8 data type AVX-512, AVX2, AVX or SSE will be used. Best thing here is that programmer does't care about your CPU is just works. Even if your CPU
have only SSE instruction you still benefit from float64x8 type because of memory locality. Hint: for best performance always use float64x8 [SMILEY :)]
Here I put explicitly AVX-512 as preferred instruction set because currently default is AVX2 but this will be fixed in next version and default will
be AVX-512.

Code: [LINK # Select all]from simdy import Kernel, float64x8, ISet
source = """
a = b * c + float64x8(2.0)
"""
args = [('a', float64x8()), ('b', float64x8()), ('c', float64x8())]
# I forgot to put AVX512 as default :x but this will be fixed in next version
k = Kernel(source, args=args, iset=ISet.AVX512) # if you dont set iset default is AVX2
# put some values for parameters of kernel
k.set_value('b', float64x8(2.0))
k.set_value('c', float64x8(3.0))
k.run()
print(k.get_value('a'))
# you can of course inspect assembly code if you want
print(k.asm)

Yes you can write kernels in Python and use it from C++ but in that case you must embed Python in your project and use it from there.
Communication between Python and C++ can be in both directions, people usually are not aware of this. [SMILEY :)]

(L) [2018/01/19] [tby ypoissant] [SIMDy] Wayback!

While this looks interesting for Python developers, I still prefer developing my renderers with C++.
And for C++ SIMD development, I found Vc ([LINK https://github.com/VcDevel/Vc]) to be a very good library (it is one of the two libraries being proposed for future C++ standardization, the other one being boost::simd). It also support many CPUs and SIMD architectures.
It does the same as what you are proposing except in native C++.
I'd be currious to see performance comparison between your python path tracer and the same path tracer written in C++ using Vc for vectorization.

(L) [2018/02/15] [tby Tahir007] [SIMDy] Wayback!

In the future maybe i will do comparison between Vc library and SIMDy package.
I do not agree that this is interesting only for Python developers because it is very easy to embed Python interpreter in C++ application than SIMDy
can also be used from c++.
For example in context of renderers if you embed Python interpreter in C++ you can use SIMDy as very flexible shading language like OSL (Open Shading Language), so user can write scripts that are very very fast.

SIMDy back

Board: Home Board index Raytracing Visuals, Tools, Demos & Sources