feb 24, Horde vs OpenMP, round #2

(L) [2006/02/24] [tbp] [feb 24, Horde vs OpenMP, round #2] Wayback!

Lesson learnt, thou shall not disregard locality.

In those 2 shots, only primary packets are traced and then shaded with a simple Schlick (one light in that scene); there's no unbundling of packets or shadows or texturing or anything to keep things simple and analysable. Among all stats, only the timing (and fps) is correct, forget about the rest. The background color, when visible, tells which cpu was used to render that tile.

Click to enlarge, heh.

Horde.

[IMG #1 ]

OpenMP (Microsoft implementation).

[IMG #2 ]

So, that's 1024*768 shaded primary rays at ~10fps with 100% coverage (they all hit the scene bounding box), or 7.8M shaded ray/s. That sounds reasonable.

Even if you can't really extrapolate, because there's too much 'feature' discrepancies, that's what the regular single-threaded renderer gives.

[IMG #3 ]

If you were wondering if la Horde is using a static distribution among cpu/threads, it's not. It's just that the load per tile is pretty even in the first shot.

[IMG #4 ]

Now i need to clean up the mess and make the single/horde renderer equivalent feature set wise.

I think i'm pretty close to a 2x speedup with 2 cpu now, even if there is still some inneficiencies, but that's just an educated guess. And that's worth peanuts.

I also think it should scale gracefully to 4, 8 way systems and more. And again that's wishful thinking. Sadly i don't have access to such a beast, and i'll never know for sure.

[IMG #1]:Not scraped: https://web.archive.org/web/20061004013147im_/http://ompf.org/ray/wip/pix/20060224-01-horde-small.jpg
[IMG #2]:Not scraped: https://web.archive.org/web/20061004013147im_/http://ompf.org/ray/wip/pix/20060224-02-openmp-small.jpg
[IMG #3]:Not scraped: https://web.archive.org/web/20061004013147im_/http://ompf.org/ray/wip/pix/20060224-03-standard-small.jpg
[IMG #4]:Not scraped: https://web.archive.org/web/20061004013147im_/http://ompf.org/ray/wip/pix/20060224-04-horde-distrib-small.jpg

(L) [2006/02/25] [tbp] [feb 24, Horde vs OpenMP, round #2] Wayback!

After a fierce fight with gcc that decided to stab me in the back, i got la Horde flying on linux.

It seems to be roughly >10% faster than on win32, but i'm not too sure and i'm not done with it anyway.

[IMG #1 ]

Another major acheivment is that now i can render 100 fps of nothingness, at 1024x768, when all rays miss the scene. I felt like you shouldn't have left that page without knowing that crucial fact.

[IMG #1]:Not scraped: https://web.archive.org/web/20061004013147im_/http://ompf.org/ray/wip/pix/20060225-01-horde-linux-small.jpg

(L) [2006/02/26] [tbp] [feb 24, Horde vs OpenMP, round #2] Wayback!

After fixing a little race condition [SMILEY Embarassed], and switching back to good old semaphores i thought a little scalability/overhead test was due.

In both pictures, threads are binded to each cpu alternatively, all rays miss the scene (so we get a better feel of the overhead) and red channel is set proportionnaly to the thread id, green is set if running on cpu #0 and blue if on cpu #1.

Reference pix: 2 threads, one on each cpu.

[IMG #1 ]

Messy pix: 512 threads, half binded to each cpu.

[IMG #2 ]

There's a 1.3ms difference out of a total of 9.9ms per frame, and frankly i don't know how much can be attributed to the kernel trying to schedule 512 threads on only 2 cpu.

That's a worst case scenario, and not quite realistic i might add, yet we're only taking a ~15% performance hit.

I think i can safely say the Horde scales well and the linux kernel deserves some mad props.

I'm still a bit puzzled by the fact that it's faster to render the scene from the bottom-up and not the other way around.

In the same vein, it's faster to have threads write "entire lines" than have them render more rectangular tiles (that's why it's done that way in those shots, even if the software allows for fancier cuts). That smells like some cache coherence issue (or the like), because i don't write to the framebuffer via non temporal moves (yes, i'm waving hands).

Scalability, checked. Now onto measuring if i'm getting my full x2 speedup.

[IMG #1]:Not scraped: https://web.archive.org/web/20061004013147im_/http://ompf.org/ray/wip/pix/20060226-01-horde-linux-2threads-small.png
[IMG #2]:Not scraped: https://web.archive.org/web/20061004013147im_/http://ompf.org/ray/wip/pix/20060226-02-horde-linux-512threads-small.png

(L) [2006/02/26] [tbp] [feb 24, Horde vs OpenMP, round #2] Wayback!

To clarify what i was saying, there's only 2 points implying synchronization between threads and their master: when a batch gets started and when it's done. Or put another way, in this case at the start of the frame and once it's done and there's no contention/wait/synch/etc in between while remaining fully dynamic; it's up to the kernel scheduler to do its job [SMILEY Smile]

feb 24, Horde vs OpenMP, round #2 back