multithreading back
(L) [2006/02/22] [Phantom] [multithreading] Wayback!On DevMaster.net, Davepermen mentiones that his software got twice as fast by multithreading it, on a single-core system. I am not sure what kind of software he is developing (wasn't it a software rasterizer?) and he is not explaining why he got the speedup (he doesn't even know apparently) but it's still an interesting claim. Could this be true? I need to check it. Will convert my current code to multithreaded code asap.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/02/22] [playmesumch00ns] [multithreading] Wayback!Hmmm... sounds  a bit fishy to me.
On a single-core system, multi-threading will probably be slightly slower since you've got to do all that mutex locking/unlocking stuff. My guess is the algorithmic changes he made to get it multi-threaded had the side-effect of making his code more efficient, rather than some magic happening.
That said, it's definitely a good idea to multi-thread, even if you #ifdef for different single- and multi-threaded builds. Since a raytracer's fairly easy to multi-thread it's well-worth doing asap.
(L) [2006/02/22] [Ono-Sendai] [multithreading] Wayback!Getting a speedup with 2 threads on a non HT system sounds fishy to me as well.
playmesumch00ns: U don't necessarily have to do mutex stuff: the scene kdtree can be read only, and different threads can write to different places in the output image buffer.
_________________
[LINK http://indigorenderer.com/]
(L) [2006/02/22] [Phantom] [multithreading] Wayback!The guy mumbles something about stalls being exploited by the other threads and stuff like that. I fail to see how it could be such a big win, or a win at all for that matter. If I would execute the tracer in two threads, I could easily send tiles (16x16 in my case) to each thread, but even then, one of the threads will be waiting after completing it's last tile, so there has to be a small delay no matter what I do.
On the other hand, I could imagine that one thread gets data in the cache that another thread uses shortly after that. This effect is probably negated by threads that push data out of the cache by using data that the other thread never uses... Anyway, I really should give it a try. I will probably make a command line switch or whatever to specify the number of threads to be used for the rendering, so I can dynamically scale it.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/02/22] [toxie] [multithreading] Wayback!2x seems to be definetly wrong. If this is really the case then the code has ENOURMOUS stalls in it (f.e. a lot of loads from RAM not being in the cache).
The biggest improvement i can remember from own experiences was a 50% increase at the time i implemented the first version of our RTRT. But that was simply for tracing rays without any shading involved. In our current code (including simple shading) HT gives a 10%-20% speed increase (depending on scene size!).
(L) [2006/02/22] [Phantom] [multithreading] Wayback!He's not even talking about HT. He claims these gains on a single-core proc.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/02/22] [Phantom] [multithreading] Wayback!Well if I could a 4% or even a 20% speed increase by merely multithreading I would be a very happy man. In fact, I would be happy if multithreading didn't decrease performance on a single proc system, as this code would run better on HT and DC. Would be great if I wouldn't need separate versions.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/02/22] [playmesumch00ns] [multithreading] Wayback!I'm planning to go even finer-grained with my raytracer and have each thread work on at least a packet (4 rays) at a time. Should keep eveything hot since all the threads will access most of the same (read only) data, and should also lead to a more even load distribution.
(L) [2006/02/22] [Phantom] [multithreading] Wayback!When can we expect results for that?
_________________
--------------------------------------------------------------
Whatever
(L) [2006/02/22] [playmesumch00ns] [multithreading] Wayback!When I get some time!
I've got to rewrite the sampler and bucket rendering code to support it properly. At the moment I'm rendering a bucket per thread, but that makes it really ugly for doing multi-pixel filtering over bucket boundaries, so I'm rewriting the whole lot, hopefully by the end of the week, as I'm on holiday next week.
I need to hack some object loading code together and try it with some more complex scenes. Then I can do some tests running single, dual and multi-process (we've got some dual-proc-dual-core opterons here at work [SMILEY Smile]) and see how the results stack up.
(L) [2006/02/22] [Phantom] [multithreading] Wayback!Hey, bring a laptop when you go on holiday. I always do. [SMILEY Smile] No laptop? Print out tons of papers. [SMILEY Wink]
_________________
--------------------------------------------------------------
Whatever
(L) [2006/02/22] [tbp] [multithreading] Wayback!Davepermen result is just an artefact, it's hard to conjecture a reason and i won't try. I mean if you have 2 threads fighting for cycles on the same mono-core cpu, you'll end wasting more time in the scheduler & switching contexts and that's it.
I won't even consider HT when talking about 'multithreading' as it's just a hack. Then there's some difference if you're running an SMP (aka UMA) or NUMA box but as that's mostly related to non-read-only data sharing it's not that relevant in our case (unless you're doing deferred something).
So far i've only have an OpenMP render path going, and in some conditions i can realize a 2x speedup with my dual opteron (basically on scenes large enough). Problem is it's not the right tool for the job, it's wasting too much time in undue synchronization; at least for rendering, raytracing isn't asking for that much.
I haven't had the time to write my own dispatcher yet as it's certainly not trivial (well, if you want it to be efficient) and i have other stuff on the grill.
Another interesting problem, and a bit more complex, is how to parallelize the kd-tree construction.
(L) [2006/02/22] [Phantom] [multithreading] Wayback!Why do you use OpenMP? Why not simply start two threads, assume that each one gets dispatched to a single core, and assign them roughly half of the screen each?
And I don't see the kd-tree problem either: Can't you simply do the first split, and then dispatch each child to a separate thread?
_________________
--------------------------------------------------------------
Whatever
(L) [2006/02/22] [tbp] [multithreading] Wayback!I use OpenMP atm, but like i've said it's ill suited because there's more to it than being a thin wrapper over the OS threading; it's deeply coupled with the compiler.
You say, this is a parallel block. Then you say, for example, dispatch everything within that 'for' to different threads (with different dispatch tactics etc). Behind your back, OpenMP will ensure proper barriers and sharing.
Of course there's a cost to all that work, and in the case of // rendering it's unasked for. You want a fast lockless slot mechanism to distribute the work and that's it.
For the kd-tree is gets a bit more involved. You have to pay attention to all your pools (either you lock or you do that locally and variants in between), and then there's the problem of expressing the build in a non recursive fashion.
Recursion isn't that much of a problem if careful, but then say each time you do a split you assign each child to a different worker etc... you have to pay for dispatch, synchronization (recollection point) etc...
And you don't want to waste all your cycles dispatching and synchronizing. Plus such a simple tactic (dispatch each child) isn't going to work in the real world i fear. But what do i know, i'm not there yet.
(L) [2006/02/22] [Phantom] [multithreading] Wayback!Not each child, just the two childs that result from the first split. If you have four cores, dispatch new threads until you get two levels deep.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/02/22] [tbp] [multithreading] Wayback!Yes, but no.
It's not like those trees are balanced, right? So you'll initially dispatch the topmost left to cpu#1, right to cpu#2, cpu #1 will leafify everything at, say, lvl 5 while cpu #2 is going down to lvl 9842895. And cpu #1 is idling.
So that won't cut it.
Not that re-dispatching at each step is a solution either.
(L) [2006/02/22] [Ho Ho] [multithreading] Wayback!What about splitting at e.g first four recursions and only keeping at most N threads active where N is the number of cores? if one thread finishes it notifies the master thread and it starts another one. It isn't that hard, I've done it before [SMILEY Smile]
_________________
In theory, there is no difference between theory and practice. But, in practice, there is.
Jan L.A. van de Snepscheut
(L) [2006/02/22] [tbp] [multithreading] Wayback!That's another way to put the "fast lockless slot mechanism to distribute the work" i've alluded to.
It's not hard to implement. What is hard is to do it right [SMILEY Wink]
(L) [2006/02/23] [playmesumch00ns] [multithreading] Wayback!Okay my first stab at multithreading on a sample-by-sample basis went horribly, horribly wrong. All sorts of nasty synchronisation issues, and horrific crashes too [SMILEY Smile]
I'm going to need a pretty major redesign of my sampler and bucket code. Might take some time. Can't take the lappy on holiday: girlfriend would kill me [SMILEY Wink]
(L) [2006/02/24] [tbp] [multithreading] Wayback!For a change, i've put my money where my mouth is and tonite i've implemented that lock-free-wait-free dispatcher i was ranting about. It's called the Horde(c)(tm).
I've hooked it up so it renders the same way the OpenMP renderer does.
It's really minimalist yet it's exactly on par with its OpenMP counterpart. Yay. So, my bottleneck is somewhere else and my next bet will be on locality.
<depressed programmer tone>
At least it works.
Right now stuff gets batched by the master, and then there's a fierce battle between threads to get a piece. The distribution is as fast as can be, there's no lock or synch but at the recollection point (memory barrier + synch).
It's ideal for anything remotely looking like defered rendering, but it won't do the trick for a kd-tree compiler where the load is more random. Or maybe not, i can't think straight anymore.
I've learnt a few things in the process, and above all that you shouldn't try to do some fancy moves suspending/resuming threads on xp if you want a chance for IPC primitives to work as advertised (or at all). Surprisingly deadlocks aren't quite my idea of fun.
Oh and whomever designed msvc inline assembler should be shot.
(L) [2006/02/24] [playmesumch00ns] [multithreading] Wayback!Well I got something working last night. It's not too fancy and it's basically just a modification of my original bucket-per-thread code, but it will handle filtering a bit better. There's no reason why I couldn't not do a pixel per thread, but I won't for reasons I'll explain in a second.
The problem with splitting up the image between multiple threads comes down mainly to multi-pixel filtering. For instance, consider the case of divvying up into 32x32 pixel buckets, using a 2-pixel-wide filter. Each bucket will have a ~2-pixel strip down each edge that is shared between neighbouring buckets as for a given x*y pixel region the actual area of the image you need to sample is (x+filterwidth)*(y+filterwidth). A naive solution (and indeed what I did first of all for the sake of simplicity) is just to extend the sample region of each bucket. Of course this means you're sampling many pixels twice over. Not good!
So what I'm doing now is having each bucket handle rendering its own pixels, but hold pointers to its neighbours (down, right, and bottom-right). So when when a bucket has finished rendering a sample it checks to see if the sample's in the shared region, and adds it to the neighbour buckets if it is.
The usual way of implementing filtering is to keep an accumulator of the pixel radiance, together with the total sample weight for all samples under that pixel. This means filtering each sample onto all the pixels it covers as it comes in.
In a multi-thread context this is bad, because you need to lock the bucket memory while you do the filtering (quite a costly process), which means threads will often be stalling while they wait for the bucket to finish the sample its currently processing. This situation gets worse the smaller your buckets (or thread regions) are, so doing a pixel per thread is the worst case.
So instead, I've chosen to do it the Reyes-style way. I hold all the samples for a bucket, unfiltered, until the bucket is finished, in an array of std::vectors. Pushing a sample onto a vector is very fast compared to a filtering operation, so threads shouldn't stall when adding samples to neighbouring buckets. Each sample is reference-counted, so buckets can just delete all their sample data when they're finished without worrying about whether neighbouring buckets still need it. This means that I only need to keep samples in memory for the buckets I'm rendering. For large buckets with many "deep samples" (lots of output channels), this could still be a significant amount of memory, but I can always dynamically split off chunks of each bucket as I go along to keep the memory usage within some limit.
Storing all the samples "live" and having the bucket drive the sampling also opens up more possibilities for doing adaptive super-sampling.
Once I've got the last couple of bugs squashed I'll try some benchmarks to see if I can get close to a perfect multiple gain in speed on 2- and 4-processor systems. I'll also want to try some more difficult scenes than my hard-coded cornell box. I figure for any multi-threading strategy that spits the image into large chunks, the worst case would be looking down a long tunnel, with geometry in many different planes, as neighbouring pixels are likely to access wildly different parts of the tree. I might try mocking something up when I get back from hols, and maybe even post some pictures!
(L) [2006/02/24] [tbp] [multithreading] Wayback!Eureka!
Locality, locality, locality.
Better than derailing this thread, i've spammed the [LINK http://ompf.org/forum/viewtopic.php?t=48 visual section].
I think i'm closing in on the heralded x speedup on x cpu/cores.
Playmesumch00ns, i don't have a good grasp of about what you're really doing so i'll try to refrain from further commentage, nonetheless merely thinking about doing all that bookeeping makes me sick [SMILEY Smile]
And i'm a bit envious you have access to a 4-way box.
(L) [2006/02/25] [Guest] [multithreading] Wayback!Thought about this last night and the filtering algorithm I described above won't work at all. I'll give it some deeper thought while I'm on holilday! And yes, the bookkeeping is a nightmare, but at the moment I can't think of a neat solution that doesn't involve a lot of that sort of thing.
Basically the problem is that for bucketed rendering with a multi-pixel filter, samples need to be shared between neighouring buckets.
This isn't so much of a problem if you keep all the buckets in memory, but for rendering an arbitrarily large image (not just in terms of size, but more in terms of having multiple output variables per sample), I really need to send each bucket to the display and then free its memory as soon as possible.
The tricky part is working out when a bucket is finished and can be removed.
On a happier note, trying the simple cornell box scene I have at 512x512, 16spp gave me times of ~8 seconds on one thread, and ~4.5 seconds on two. That's a pretty good gain so far, and I'm sure there's room for improvement [SMILEY Smile]
(L) [2006/03/05] [davepermen] [multithreading] Wayback!Well, here i am. No algorithmic changes. of course i had to rewrite my algorithm (for a raytracer, not rastericer) to draw several rectangles instead of one big. still, if i say to the engine to use only 1 thread (with the actual algorithm) it runs at about 50% of the speed as if i enable 64 (not 2) individual threads.
(L) [2006/03/12] [tbp] [multithreading] Wayback!I now have a perfect x2 speedup rendering scenes on 2 cpu, within measurement noise, even for degenerate cases like when all rays miss the scene. Phew.
(L) [2006/05/10] [Phantom] [multithreading] Wayback!I'm back at multithreading. Here's my plan, please shoot:
- The renderer that I have now renders lines of 4x4 packets. I'll isolate the code that renders a single line and put it in it's own method.
- Once that works flawlessly, I will introduce threads. I want to spawn two threads for dual core, initially.
- The threads are going to watch a 'render stack'. As soon as something appears on this stack, it is rendered.
- In the meantime, the main thread fills the stack, until it's full. Then it waits until the last line is processed by the render threads.
- Done. Render threads are paused and waiting for the next frame.
Any problems with this approach?
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/11] [tbp] [multithreading] Wayback!That's what i'm doing.
Except i don't see the point of doing that by yourself, unless you want to tackle synchronisation on your own because you think you can make a better job than say, OMP.
OMP, being on done on the compiler side, has access to more knobs than you could dream of. On the other hand pseudo-hard realtime constraints were never part of the design spec.
Because you haven't described the meat of what your trying to acheive, and that's really how you synchronize.
(L) [2006/05/11] [Phantom] [multithreading] Wayback!I thought you did your own dispatcher?
And what is there to sync? There is a bucket with 256 jobs to do, and each thread picks one of these when it's done with the previous one. There is a counter that indicates how many jobs are on the stack, so picking a task is a matter of decreasing this counter (thus, no risk that two threads pick the same task). There's also a counter that keeps track of completed tasks. When this reaches 256, the main thread knows that the image is ready.
Writing to the screen should not be an issue, since two threads will not write to the same pixels.
The only catch I see is what the threads need to do when they have digested the last task: If they start polling the task stack frantically until it's filled again, they are eating valuable processor time by just waiting. Same for the main task: After it has filled the task stack, it's idling and waiting for the two other threads to complete.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/11] [tbp] [multithreading] Wayback!Depends on what you call a dispatcher, ultimately only the kernel (well for 1:1 thread/process libs) really 'dispatch'.
For that task at end, which is really simple, there's not many options. What you describe is a FIFO. That's what i was discussing back then.
Note that atomically bumping the top of the stack index isn't good enough, you also have to ensure coherence across cores/cpu (x86 case).
Also what you describe is active polling. It's a waste of cycles.
What you need is a smarter way to synchronize workers with the mastah, a kind of docking sequence. That means scheduling, therefore OS asssistance.
Details. Evil details.
(L) [2006/05/11] [Phantom] [multithreading] Wayback!So how does OpenMP help then? Do I simply specify a loop that renders the 256 lines of 4x4 tiles, and tell OpenMP to try to do that in parallel?
I know, I'm lazy, I didn't even bother reading anything about OpenMP. [SMILEY Smile] Will do so in a sec.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/11] [tbp] [multithreading] Wayback!Yeah something like that. You stick a pragma before the loop to say: hey bozo! spread that thing over the hardware, will you?
You can guide it a bit if you want etc... and it will do all the nasty synchro work for you.
Like i've said it's not meant for pseudo-hard realtime, but it's way simpler than doing it by hand. Plus it has opportunities to exploit some properties only exposed to the compiler (because it's tightly integrated - generally -)
EDIT: Perhaps it wasn't clear, but i don't use OMP because for our narrowly defined problem i came up with something more efficient (in terms of run-time, certainly not coding time [SMILEY Razz] )
(L) [2006/05/11] [Phantom] [multithreading] Wayback!That's what I thought before you started suggesting OpenMP. [SMILEY Smile] I believe you mentioned gains close to 10% or so when multithreading using OpenMP. Going dual core doesn't make sense if I don't get that 2x speedup, imho.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/11] [tbp] [multithreading] Wayback!Well, it all depends on the workload.
What i got is a 2x ratio (+/- noise) on the whole envelope, even on ligth frames rendering really fast with lots of dispatching. That's at that end that OMP was kinda weak; elsewhere it's amortized enough.
It's not really a suprise that a specialized bit of code beats a more generic version for what it's designed [SMILEY Smile]
Plus i'm sure Intel has put more work into OMP than m$, if not for one good reason they got into that OMP business earlier.
(L) [2006/05/11] [Lynx] [multithreading] Wayback!Uhm question, does mailboxing actually do you any good?
At least when i let my (still not O(n log n)) kd-tree do triangle clipping, the effect of mailboxing was pretty negligible, and making it threadable just didn't seem to be worth it...but probably my non-SIMD code just doesn't cut it...
Anyway, i'm doing the tile distribution simply by calling a threadsafe function from each thread that hands out tiles. That way you could also just force the threads to suspend until you're ready for the next frame by simply not letting the request function return until the tiles are reset, which happens as soon as you keep the mutex locked which is necessary to pop a tile from the tile array.
You could also try to avoid threads waiting on each other by creating subsets of tiles with their own mutex (e.g. 4 threads => each is expected to do 64 tiles) so they only block each other when one of the thread was faster finishing its subset and you have to lock a global mutex once, try to re-balance the work and continue...
(L) [2006/05/11] [tbp] [multithreading] Wayback!But the trick is that you don't need no heavy synchronization primitives (mutex and horrors like that) but at 2 points that require scheduling: begin/end of frame rendering (assuming that's how grained your rendering is).
In between you need at most a bunch of user-space coherent atomic ops, x86 have plenty.
Now that assumes, ie in the case of mailboxing, that you keep it reasonable [SMILEY Razz]
Then competition takes care of load balancing.
(L) [2006/05/12] [Phantom] [multithreading] Wayback!OK, I have some basic multithreading working. Initially it took me some time to figure out why it worked in debug mode but not in release mode, until I found out that I changed the calling convention from __cdecl to fastcal, which appears not to be a good idea. [SMILEY Smile]
Performance is OKish, I'm experiencing something like a 80% speedup compared to single core tracing, which still means frame times are lower than anything I ever produced before.
Apparently there's some data fighting going on, as I get some random black packets and even a complete broken packet line every now and then (although far less than 1% of the packets appears to have problems, so that's probably going to be hard to track...), but I guess I'll find the problem soon enough.
Syncing is really simple now: The threads both run until there's nothing left on the task stack; the main thread keeps Sleep()ing for 1ms until all lines are done. That's an average oversleep time of 0.5ms, which is less than 1% even for scene6 at 800x480.
I'll keep you guys posted.
EDIT: Ow crap it gotta be my mailboxing that's messing up my show... That's not funny.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/13] [Phantom] [multithreading] Wayback!That's what everyone keeps saying, yet I get about 8% more speed using it.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/13] [lycium] [multithreading] Wayback!have you spent as much time optimising the hashing as you have the mailboxing? sure you aren't missing anything obvious? the only time i can imagine hashing being slower is due to capacity/replacement misses, which i would further imagine are fairly rare since you're usually testing < 8 or so objects per ray. that tiny table simply must be in l2 more often than mailboxes spread all over the place (how many hundreds of cycles to access mem on a modern cpu? hoho had some numbers- i think even re-intersecting might be faster than that!)
edit: the more i think about it, the stranger it seems; you have to (cold) load the mailbox, intersect the object, get a miss, write to the mailbox, intersect some more stuff, then read from that mailbox again. that means the write has to retire first, then you have to grab it back into l2. surely there's *no way* that's faster than spending a few cycles to intersect the data you have on hand?
on the other hand, this is handwaving based on experiments i did some time ago on a tbird 1.4ghz, and my mem management was especially sucky. the mailboxing/intersectionhashing speedup was also highly scene dependent, and helped most in scenes where rays went parallel to long objects spanning many leaves. so i can't speak for your implementation but, you know, strong intuition is difficult to silence... eh, maybe i should just stfu and do some work :/
(L) [2006/05/13] [Phantom] [multithreading] Wayback!Let me explain how I do 'mailboxing' right now (it's probably the wrong word for what I do):
When I touch a primitive with a ray, I tag that primitive. Next time I encounter the same primitive in another leaf, I can detect that I already processed that primitive, and so I skip it.
This requires of course that intersections for a ray and a primitive are allowed to be outside the leaf that is being processed. This in turn means that Toxie's suggestion to mask out invalid rays for the intersection loop doesn't work in my case.
I tested several setups, but of all these cases my simple approach is by far the fastest (though I realize that's just in my code, so in other tracers, other approaches may be better).
For my dual core tracer, this approach leads to problems btw: Since the other thread could be processing the same primitive (and thus setting the 'last checked ray ID' for a prim to something else), theoretically more primitives than neccessary are tested. Right now I solved this by using 16bits for 'thread 0 ray IDs' and the other 16bits for 'thread 1 ray IDs', but this wouldn't work for quadcore, obviously.
I hope my 'algorithm' is clear; it's a bit different than what others do I guess, especially since I don't throw out hits outside the current voxel. It does pay off however.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/13] [Phantom] [multithreading] Wayback!Loads and stores: I store the ray id directly in the primitive structure that I'm using anyway, so it's in L1 cache. I don't see how this could hurt. Only problem is that I can't dismiss groups of four rays based on the mask (like Toxie suggested), but in practise, this rarely happened, while hits outside the current voxel happen all the time.
EDIT: Hard to keep up with your posting speed. [SMILEY Smile]
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/13] [Lynx] [multithreading] Wayback!Hm...with 2x16 bit IDs you already have to zero all mailboxes every 65535 rays, don't you? Because otherwise you're risking to dismiss a triangle tested 64k rays ago, not with this ray. And 64k already means over 10 mailbox clearings a frame with 1024x786 [SMILEY Shocked]
Or is there a trick? Or are you arguing that the chance is still reasonably low for realtime usage?
(L) [2006/05/13] [Phantom] [multithreading] Wayback!That chance is *really* low, I guess. Chances that a primitive has been hit in the previous frame by a ray that had the same id as a ray that is hitting it now is almost zero. But I agree, it could happen. In that case, a packet will not be rendered, leaving the previous contents on the screen (since I don't zero the frame buffer).
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/13] [Lynx] [multithreading] Wayback!Hm...but thinking a bit more of it, once you have shadows, reflections etc. i'd expect things to become pretty much random-alike...
e.g. no statistical relation between pixel number and mailbox-ID to rely on and so you can't really avoid a 1/2^16 chance of doing a wrong intersection...sure that sounds really low, but doing ~2^20 pixels per frame and assuming you need more than one ray per pixel you already risk several bad pixel per frame, like wrong shadow here, wrong reflection there...!?
I can only hope i'm wrong (as usual) [SMILEY Very Happy]
(L) [2006/05/13] [Phantom] [multithreading] Wayback!Well with the demo I just uploaded, there's one frame every 20 seconds or so that is completely garbled. I'm not sure if it's the mailboxing, but it seems likely this is the problem. So I guess I have to look for a better solution (also because of quad cores and things like that...)
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/15] [playmesumch00ns] [multithreading] Wayback!What's the speed hit like if you just remove the mailboxing code? I seem to remember someone (was it toxie on that video) saying they thought it was best just to ignore it. That's what I'm doing (certianly makes async a lot easier), but I haven't got to the stage where I can create any trees that are likely to test how efficient that is.
(L) [2006/05/15] [Phantom] [multithreading] Wayback!It's not huge. The penalty for removing mailboxing is about 8%, but then I can enable some other tweaks, so the total hit would be more like 5%, perhaps even less. And that's only needed on dual cores, mono core rendering won't hit id cycling within several hours.
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/15] [Phantom] [multithreading] Wayback!Here's a link to someone who actually explains what a race condition is, plus some other goodies.
[LINK http://paulbridger.net/race_conditions]
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/15] [Phantom] [multithreading] Wayback!I'm not supreme at multithreading, which is why I mentioned I'm trying to learn. But you're right, I should have read up on the topic. I did read up by now, and so I wanted to share the info that I found valuable. I could have shared that in more neutral wordings of course, but then again, tbp (and hopefully others) aren't easily offended. [SMILEY Smile]
_________________
--------------------------------------------------------------
Whatever
(L) [2006/05/15] [tbp] [multithreading] Wayback!I'm easily irritated, even more so when i have the impression i'm discussing with a brick [SMILEY Razz]
I guess i was expecting a bit better of you Jacco.
Hmm, i'm not sure that article is a good introduction to that topic; in fact i'm sure it's not. Grab a decent CS book and lookup for the IPC chapter. Because there's a bunch of essential concepts to grok and race conditions are only a symptom.
(L) [2006/05/15] [tbp] [multithreading] Wayback!There's no supermind out there.
Now like i expect someone taking time to post (and time to read that post for other) a message about, say, raytracers to know what a ray is i expect people arguing about multithreading to know about IPC. You included.
That's it.
EDIT: And that's all there is behind my brick remark; you can't have a conversation when you're not discussing the same topic.
(L) [2006/05/15] [madmethods] [multithreading] Wayback!Ouch; don't do that.  Bad juju on caching multiprocessors.  You don't want to mix read-only data with read/write data, and you don't want to mix thread-specific data with non-thread-specific data.  Put the primitive data in one area, and separate out the mailboxing IDs by thread.
What you're seeing performance-wise is similar to what others have seen -- the fastest kd-tree RT implementations do use some variant of mailboxing (both the Intel stuff and the Saarland U stuff), but the impact is not huge.
-G
(L) [2006/05/15] [Phantom] [multithreading] Wayback!Good point, I'm glad I'm sharing data and experiences. It's paying off big time.
(little war settled outside forum)
_________________
--------------------------------------------------------------
Whatever
back