AVX512 MBVH4 Traversal back

Board: Home Board index Raytracing General Development

(L) [2016/09/23] [ost by mpeterson] [AVX512 MBVH4 Traversal] Wayback!

with intels knl being now available to everyone, people start
asking for a native avx512 port of clpt ([LINK http://ompf2.com/viewtopic.php?f=3&t=2075]).
knl seems to be the first accelerator from intel with some kind of power under the hood
(knf and knc have been simple nonstarters). so i did an optimized implementation
of clpt for avx512 and was surprised about the outcome.
clpt is by far the fastest rt-kernel for cpus today but was never compared to gpus.
so i was looking around for some numbers. not much to find ! so i used the
medium numbers (viewpoint 2) from amd firerays 2 on firepro w9100 and measured the test-scenes
on cuda by using the implementation from nvidia ([LINK http://www.nvidia.com/object/nvidia_research_pub_011.html])
optimized for nv titan. to make it short: using coherent ray traversal, knl can render most of the scenes
i have around stable below 1 ms into a 1024x1024 frambuffer.

rem: cuda and knl numbers are avg. values calculated out of a sequence of several thousand frames (scene fly-thru).
amd firerays is single shot.

[IMG #1 Image]
[IMG #1]:Not scraped: https://web.archive.org/web/20210621014638im_/https://picload.org/image/rdrigccw/cr_cmp.png
(L) [2016/09/26] [ost by rtpt] [AVX512 MBVH4 Traversal] Wayback!

A Nvidia 1080gtx should be twice as fast as the
titan. Can you please run your tests on current
hardware ?
(L) [2016/09/27] [ost by jbikker] [AVX512 MBVH4 Traversal] Wayback!

Could you also test divergent rays? Architectures that rely on caches (i.e. CPUs) seem to suffer greatly from divergent mem access, while architectures that hide latencies using many threads typically fare much better. I would be suprised to see that the latest CPU-like device outperforms the latest GPU device in that setting (in fact, I don't expect it to come even close).
(L) [2016/09/28] [ost by atlas] [AVX512 MBVH4 Traversal] Wayback!

Price point between the devices is also a consideration, I'm not sure this is an apples-to-apples comparison. Power envelopes aside, how many GPUs can you buy for the price of a Knight's Landing?

Getting over 2.5B rays/s on a CPU is exciting though, but I agree we have to see the incoherent numbers.
(L) [2016/09/28] [ost by MohamedSakr] [AVX512 MBVH4 Traversal] Wayback!

great results, but as others said, in divergence case CPU will crawl (cache misses, waiting memory...).
(L) [2016/09/30] [ost by mpeterson] [AVX512 MBVH4 Traversal] Wayback!

yes, i would like to run the bench on latest gpu gen. but titan is all i have around.concerning the incoherent transport: yes it will be a different story for shure. first of all, the implementation is not straight forward on avx512 (avx512 is pretty inflexible when it comes to random access streaming/computation -> there is no fast way to shuffle single elements around, limited integer/int16 support etc.). so implementation time is pretty high (a clear disadvantage here). on the other side: running our full blown pt with avx2 backend on knl the performance is great. on average more than 3x compared to octane renderer on the titan (except simple scenes).
(L) [2016/09/30] [ost by MohamedSakr] [AVX512 MBVH4 Traversal] Wayback!

does this test include texture access? like a standard interior scene full of textures. (as the bottleneck is always memory).
(L) [2016/10/04] [ost by mpeterson] [AVX512 MBVH4 Traversal] Wayback!

yes (nn sampling and bi-linear sampling). keep in mind that knl has 90gb/s on pretty large mem and extra 400gb/s on 16gb.
atm we are playing around with all the diff. mem. options. beside this, we try to run the pt as a special kind of "stand-alone-app"
on knl without os noise. a lot o new stuff here to explore...
(L) [2016/10/05] [ost by MohamedSakr] [AVX512 MBVH4 Traversal] Wayback!

>> mpeterson wrote:yes (nn sampling and bi-linear sampling). keep in mind that knl has 90gb/s on pretty large mem and extra 400gb/s on 16gb.
atm we are playing around with all the diff. mem. options. beside this, we try to run the pt as a special kind of "stand-alone-app"
on knl without os noise. a lot o new stuff here to explore...
it would be interesting if you test it on a production ready renderer (like Cycles). , as it is well known for its bruteforce PT nature, and it uses embree. (got CPU/OpenCL/CUDA).

back