RT07 back

Board: Board index ‹ Logistics ‹ Scratchpad

(L) [2007/06/25] [toxie] [RT07] Wayback!

who else is going to visit "us" in ulm? (except for the ones i already know: phantom, bouliiii, necro, hanatos. anyone i missed??!)
(L) [2007/06/25] [greenhybrid] [RT07] Wayback!

if the bucks are wise, I'll come [SMILEY ;)]
(L) [2007/06/25] [tbp] [RT07] Wayback!

Oï, judging by those handles, it sounds more like a goth party [SMILEY ;)]
(L) [2007/06/25] [greenhybrid] [RT07] Wayback!

>> tbp wrote:Oï, judging by those handles, it sounds more like a goth party

I lol'd!

Would really like to know who else will come visit us teutons [SMILEY :D]
(L) [2007/06/25] [Mark_Larson] [RT07] Wayback!

what is ULM?
(L) [2007/06/25] [greenhybrid] [RT07] Wayback!

if the bucks are wise, I'll come [SMILEY Wink]
_________________
[LINK http://greenhybrid.net/ greenhybrid.net]

Real Men code Software Graphics.
(L) [2007/06/25] [tbp] [RT07] Wayback!

Oï, judging by those handles, it sounds more like a goth party [SMILEY Wink]
_________________
May you live in interesting times.

[LINK https://gna.org/projects/radius/ radius] | [LINK http://ompf.org/ ompf] | [LINK http://ompf.org/wiki/ WompfKi]
(L) [2007/06/25] [Shadow007] [RT07] Wayback!

>> Mark_Larson wrote:what is ULM?

Ulm is a City in Germany, where the "Symposium on Interactive Ray Tracing 2007" will take place.

You'll get more information in the following links :
[LINK http://ompf.org/forum/viewtopic.php?t=376]
[LINK http://www.uni-ulm.de/rt07/RT07.html]
(L) [2007/06/25] [goodbyte] [RT07] Wayback!

I might show up as well.
(L) [2007/06/26] [moris] [RT07] Wayback!

I think, I will come too.
(L) [2007/06/27] [Darnal] [RT07] Wayback!

Good chance I'll be there, just need to pester the supervisor to send me even though I don't have a paper ready ... next year I keep telling myself, next year [SMILEY ;-)]
(L) [2007/06/27] [Darnal] [RT07] Wayback!

Good chance I'll be there, just need to pester the supervisor to send me even though I don't have a paper ready ... next year I keep telling myself, next year [SMILEY Wink]
(L) [2007/07/06] [GaRRiLL] [RT07] Wayback!

Hello everybody!
I'm new on this forum. But I want to visit the symposium very much!
I have some issues. I'm from Russia, and I have to get visa to Europe. I haven't ever visited foreign countries. The german embassy demands to know if I'm invited to the symposium. But the registration is not opened yet, as I saw. May I get the visa as a tourist now, and visit Germany in september?
I also don't know any word in german [SMILEY :)].

May somebody consult me in above issues?
Is anybody from Russia going to visit Ulm?

I haven't submited my paper before 17 June, because it wasn't ready. But I have some results in Interactive RT on GeForce 8800GTX and CUDA.
(L) [2007/07/06] [GaRRiLL] [RT07] Wayback!

Hello everybody!

I'm new on this forum. But I want to visit the symposium very much!

I have some issues. I'm from Russia, and I have to get visa to Europe. I haven't ever visited foreign countries. The german embassy demands to know if I'm invited to the symposium. But the registration is not opened yet, as I saw. May I get the visa as a tourist now, and visit Germany in september?

I also don't know any word in german [SMILEY Smile].


May somebody consult me in above issues?

Is anybody from Russia going to visit Ulm?


I haven't submited my paper before 17 June, because it wasn't ready. But I have some results in Interactive RT on GeForce 8800GTX and CUDA.
(L) [2007/09/07] [fpsunflower] [RT07] Wayback!

I'll be there too! In fact I'm on my way [SMILEY ;)]
(L) [2007/09/07] [fpsunflower] [RT07] Wayback!

I'll be there too! In fact I'm on my way [SMILEY Wink]
(L) [2007/09/07] [toxie] [RT07] Wayback!

Boozing will start on Sunday evening.. Be there or be square!
(L) [2007/09/07] [goodbyte] [RT07] Wayback!

Awww... I won't get there until 9 in the evning ;-(
(L) [2007/09/08] [Wussie] [RT07] Wayback!

bah [SMILEY :(] poor student lacks the funds :'( Phantom's lil' thing is going to be pretty neat though [SMILEY :)]
Enjoy yourselves!
(L) [2007/09/08] [Wussie] [RT07] Wayback!

bah [SMILEY Sad] poor student lacks the funds :'( Phantom's lil' thing is going to be pretty neat though [SMILEY Smile]

Enjoy yourselves!
(L) [2007/09/12] [goodbyte] [RT07] Wayback!

Does anyone have a good picture of the open problem poll? Mine turned out very black...    [SMILEY :(]
(L) [2007/09/12] [goodbyte] [RT07] Wayback!

Does anyone have a good picture of the open problem poll? Mine turned out very black...    [SMILEY Sad]
(L) [2007/09/14] [fpsunflower] [RT07] Wayback!

[IMG #1 Image]
[IMG #1]:[IMG:#0]
(L) [2007/09/14] [toxie] [RT07] Wayback!

i especially like: SIMD > 4, real world cameras, efficient anti-aliasing and soft-shadows..






..ironic mode OFF
(L) [2007/09/14] [davepermen] [RT07] Wayback!

i'd like to see SIMD gone.. more simple cores with no SIMD capabilities.. instead of 4 SIMD floats, i'd prefer 4 MIMD cores with one float each..

*dreaming* ..
(L) [2007/09/14] [greenhybrid] [RT07] Wayback!

>> davepermen wrote:i'd like to see SIMD gone.. more simple cores with no SIMD capabilities.. instead of 4 SIMD floats, i'd prefer 4 MIMD cores with one float each..
*dreaming* ..


ditto. and every core with it's very own cache, and maybe a new kind of RAM that allows for each core seperate access) (but I am maybe naive and not up to date regarding state-of-the-art in RAM techniques, so feel free to enlighten me [SMILEY :D])
(L) [2007/09/14] [davepermen] [RT07] Wayback!

i'm interested in .. what was it called again.. well, that ram with the database style write modes to it.. transactional ram, yes, that was it.. sort of "thats my ram" said the thread.. and so it was.. [SMILEY :)]

should help make threading much simpler with sort-of automatic locking.

but yes, imaging that a current quadcore would instead be a 16core with simple fpus.. this would help compilers a big time.. paralellize to 4 cores or 16 is no real difference.. but at the same time optimize for sse and all.. hell..

and i'd love it, working mostly in .NET nowadays. it would make JIT compilers much more effective as well. my raytracer would by default be about as fast as todays simd optimised ones (with same algorithms implemented, that is.. of course).
(L) [2007/09/14] [davepermen] [RT07] Wayback!

shameless plug

i just had to rant .. [SMILEY :)] [LINK http://davepermen.spaces.live.com/blog/cns!789D02F1FCA626E0!152.entry]

[SMILEY :)]

i know it's some idealistic, espencially numbers in there [SMILEY :)] but still.. it's about the idea..
(L) [2007/09/14] [davepermen] [RT07] Wayback!

i'm interested in .. what was it called again.. well, that ram with the database style write modes to it.. transactional ram, yes, that was it.. sort of "thats my ram" said the thread.. and so it was.. [SMILEY Smile]


should help make threading much simpler with sort-of automatic locking.


but yes, imaging that a current quadcore would instead be a 16core with simple fpus.. this would help compilers a big time.. paralellize to 4 cores or 16 is no real difference.. but at the same time optimize for sse and all.. hell..


and i'd love it, working mostly in .NET nowadays. it would make JIT compilers much more effective as well. my raytracer would by default be about as fast as todays simd optimised ones (with same algorithms implemented, that is.. of course).
(L) [2007/09/14] [davepermen] [RT07] Wayback!

shameless plug


i just had to rant .. [SMILEY Smile] [url]http://davepermen.spaces.live.com/blog/cns!789D02F1FCA626E0!152.entry[/url]


[SMILEY Smile]


i know it's some idealistic, espencially numbers in there [SMILEY Smile] but still.. it's about the idea..
(L) [2007/09/14] [ingenious] [RT07] Wayback!

We should have added 'participating media' just to remind people not everything boils down to ray/surface intersections.

Oh, by the way, can we set up some place where people can upload pictures from the venue? I came too late for the group picture...
(L) [2007/09/14] [davepermen] [RT07] Wayback!

yes indeed. fog, smoke, etc.. and SSS (i love that acronym).
(L) [2007/09/14] [rogon] [RT07] Wayback!

The reason SIMD is here to stay is that from a hardware perspective, it is much easier. MIMD implies a complicated distribution network, something that will eat up transistors that I'd rather have spent on more functional units.

The problem is two-fold:

1. Compiler Optimization.

Compilers have a really hard time utilizing SIMD instructions. This is because most compilers are not equipped to "think" in parallel. Even supposedly good parallelizers fail miserably on slightly more complicated cases. I've tested several compilers on the following simple code:

Code: [LINK # Select all]struct Elem { float a, b, c, d } ALIGN(16);
void test(int n, Elem const *const __restrict input, Elem *__restrict output)
{
    for (int i = n; i >= 0; --i)
        output[i].d = ((input[i].a + input[i].b) * input[i].c) * input[i].d;
}


I'm giving the compiler all possible knowledge. It is aligned. It is a simple loop. It should be able to SIMD-ify this trivially by doing a four-at-a-time shuffle. But it just won't! Also, on non-Intel platforms (Altivec and SPE) - where there is no out-of-order execution, this loop can be software pipelined, but none come even close.

2. Language.

Only recently has there been efforts into nested parallel languages, Haskell being one of the fore-runners. See for instance  [LINK http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell]. By code transformations we can still "think" one on and one out, but the compiler should be able to produce wonderful parallel code, even for such a task such as ray-tracing.

Rogon.
(L) [2007/09/14] [davepermen] [RT07] Wayback!

well, todays MIMD basic is layed out: Multicore.. just strip the SIMD out of the current core, makes them simpler.. and then, put more on the die (like current quadcores)..

[SMILEY :)]

it would solve so much issues.. they should not put the problem to the programmer..
(L) [2007/09/14] [rogon] [RT07] Wayback!

The reason SIMD is here to stay is that from a hardware perspective, it is much easier. MIMD implies a complicated distribution network, something that will eat up transistors that I'd rather have spent on more functional units.


The problem is two-fold:


1. Compiler Optimization.


Compilers have a really hard time utilizing SIMD instructions. This is because most compilers are not equipped to "think" in parallel. Even supposedly good parallelizers fail miserably on slightly more complicated cases. I've tested several compilers on the following simple code:
(L) [2007/09/14] [davepermen] [RT07] Wayback!

well, todays MIMD basic is layed out: Multicore.. just strip the SIMD out of the current core, makes them simpler.. and then, put more on the die (like current quadcores)..


[SMILEY Smile]


it would solve so much issues.. they should not put the problem to the programmer..
(L) [2007/09/15] [Alan] [RT07] Wayback!

It takes a lot more transistors to add an extra core, expecially with OOOE etc, than it does to add an extra functional unit to an existing core.
(L) [2007/09/15] [davepermen] [RT07] Wayback!

sure.. still today we CAN have more cores, we can have hyperthreading, and even 80core systems. SIMD is in such system a bit of an old relict.

it has high performance, yes. but no scalability, and means more work for developers, compilers, and thus, mostly apps that are less performant than they could be.

hey, it _is_ just a rant. [SMILEY :)]
(L) [2007/09/15] [ingenious] [RT07] Wayback!

But Intel's plans are to both increase the number of cores and SIMD's width and flexibility. It's all about balancing. You can't just say "let's remove the SIMD units and free some space for more cores".

As they've added SIMD to their processors, there must have been a good reason to do so. One of the simplest comes from its name - single instruction. Telling the CPU "subtract these chunks component-wise" is more efficient than telling it "subtract this one from this one, then subtract this one from this one, etc.." And if you want to write only scalar programs for your quadcore, no problem. But 4x1 < 4x4  [SMILEY :D]
(L) [2007/09/15] [davepermen] [RT07] Wayback!

sure.. still today we CAN have more cores, we can have hyperthreading, and even 80core systems. SIMD is in such system a bit of an old relict.


it has high performance, yes. but no scalability, and means more work for developers, compilers, and thus, mostly apps that are less performant than they could be.


hey, it _is_ just a rant. [SMILEY Smile]
(L) [2007/09/15] [ingenious] [RT07] Wayback!

But Intel's plans are to both increase the number of cores and SIMD's width and flexibility. It's all about balancing. You can't just say "let's remove the SIMD units and free some space for more cores".


As they've added SIMD to their processors, there must have been a good reason to do so. One of the simplest comes from its name - single instruction. Telling the CPU "subtract these chunks component-wise" is more efficient than telling it "subtract this one from this one, then subtract this one from this one, etc.." And if you want to write only scalar programs for your quadcore, no problem. But 4x1 < 4x4  [SMILEY Very Happy]
(L) [2007/09/15] [nAo] [RT07] Wayback!

Educate games developers about what?
(L) [2007/09/15] [davepermen] [RT07] Wayback!

>> ingenious wrote:But Intel's plans are to both increase the number of cores and SIMD's width and flexibility. It's all about balancing. You can't just say "let's remove the SIMD units and free some space for more cores".
As they've added SIMD to their processors, there must have been a good reason to do so. One of the simplest comes from its name - single instruction. Telling the CPU "subtract these chunks component-wise" is more efficient than telling it "subtract this one from this one, then subtract this one from this one, etc.." And if you want to write only scalar programs for your quadcore, no problem. But 4x1 < 4x4  

problem is SIMD by default is restricted by this "Single Instruction".. normally you want to apply Single Algorithm, but not instruction.

16x1 >> 4x4 just because you could actually 100% utilise those units, without huge work and redesign and restructuring of all your existing stuff..

and i know that SIMD has its uses. btw, they expand SIMD currently just because of one reason: because they can't get their compilers to simply utilize it.. thats why they expand it: to make it more compiler friendly.

remove it, make 4x as much floatingpoint units, re-enable hyperthreading (wich they invented because it doesn't cost much). every hyperthread gets access to one of those 4 floatingpoint units (wich cost as much silicon as one SSE unit.. or less, because of the lower instruction cost).. this would allow for SAMD (single algorithm multiple data) even on an ordinary single core without _that_ high cost.

main reason they push SSE: they invented it, and can still build on it. train programmers on it, write compilers on it. it lets the industry be "magic", and let them make money.
(L) [2007/09/15] [ingenious] [RT07] Wayback!

I agree that 16x1 >> 4x4, but do you really think it's that technologically simple? Do you have the required background in modern CPU construction and issues to say such strong words?
(L) [2007/09/15] [Alan] [RT07] Wayback!

Most CPUs are still designed more or less by the Von Neumann model. Adding SIMD to a CPU is just a matter of increasing bus widths, changing the datapath a little and adding parallel functional units (four adders instead of one etc). MIMD requires duplication of most of the frontend control logic as well as multiple backend execution units which are unlike SIMD capible of acting independently. Which uses up a lot more die area, and produces a lot more heat.

Look at how limited the instruction set on the CellBE's SPUs is compared to a general purpose CPU for an example of the tradeoffs which have to be made to support a lot of cores. And remember those are in order CPUs too with really small control logic compared to the Intel chips most people work with.
(L) [2007/09/16] [lycium] [RT07] Wayback!

i think the current nvidia and ati/amd gpus are a neat little example of 16x1 vs 4x4 parallelism: 16x1 is more efficient on average with lower peak computation, while 4x4 is raw power to the max and needs optimising (athlon64 vs p4 was a similar story too). it's all about switching overhead / execution width, so if your problem is parallel and compute-dense it really helps to thin down the instruction decode requirements via simd.

before we talk about modern cpus ditching simd, we should talk about them ditching that 386-era cruft... simd makes a lot of sense in the current x86 scheme of things because it leverages that flexible+compressed instruction format.
(L) [2007/09/16] [greenhybrid] [RT07] Wayback!

I am really dreaming of a hypercore thingy, maybe with just 500MHz or even less per core, but then thousands of them .... Raw Monte Carlo Ray Tracing in Realtime anyone?
(L) [2007/09/16] [Lynx] [RT07] Wayback!

The problem indeed is that putting 4 cores without SIMD on a die needs much more transistors than one core with a 4x SIMD unit...everything has to be duplicated, but executing the actual float operation is almost a minor step in the CPU pipeline...
But SMT indeed seems to promise an affordable solution, look at Sun's "Niagara", sure the T1 only has a single FPU set on the whole chip, but T2 will at least have one per core (recently read the CPU is already finished, but no mainboards exist yet), and maybe that "throughput computing" paradigm will meet high performance floating point computation soon so we get CPUs with hardware threaded SISD cores instead of software threaded SIMD cores...i just suck so much at SIMD coding [SMILEY :)]
(L) [2007/09/16] [Lynx] [RT07] Wayback!

The problem indeed is that putting 4 cores without SIMD on a die needs much more transistors than one core with a 4x SIMD unit...everything has to be duplicated, but executing the actual float operation is almost a minor step in the CPU pipeline...

But SMT indeed seems to promise an affordable solution, look at Sun's "Niagara", sure the T1 only has a single FPU set on the whole chip, but T2 will at least have one per core (recently read the CPU is already finished, but no mainboards exist yet), and maybe that "throughput computing" paradigm will meet high performance floating point computation soon so we get CPUs with hardware threaded SISD cores instead of software threaded SIMD cores...i just suck so much at SIMD coding [SMILEY Smile]
(L) [2007/09/16] [lycium] [RT07] Wayback!

<getting a little off-topic here>
(L) [2007/09/16] [davepermen] [RT07] Wayback!

a little? [SMILEY :)]

i'm sorry..
(L) [2007/09/16] [lycium] [RT07] Wayback!

kein stress, i'm sure toxie will clip this off to a new thread in the "considered harmful" section or so :)

(btw, my remark about your blog: simd makes a lot of sense from transistors vs flops efficiency, and if you want speed you shouldn't complain while coding in c# ;) that's a little bit like "want to have your cake and eat it" :P)
(L) [2007/09/16] [davepermen] [RT07] Wayback!

a little? [SMILEY Smile]


i'm sorry..
(L) [2007/09/16] [davepermen] [RT07] Wayback!

sure, simd is great for chip developers.. but it's not great for coders, as nearly no algorithm maps 100% to simd (but most map 100% to a mimd, or simple sisd + threading algorithm)..

and no, just because we can do packet tracing doesn't mean simd fits 100% to us. packet tracing algorithms are normally never 100% as efficient as non-packet algorithms.. (but they are still great, given the current architectures..)
(L) [2007/09/16] [lycium] [RT07] Wayback!

>> davepermen wrote:sure, simd is great for chip developers.. but it's not great for coders

ehm... would all the coders who haven't used simd before please put up their hands?

i think you might be unique in this respect ;) there are plenty of opportunities to use simd, no matter how incoherent your program flow is (e.g. nick and i find plenty of places to use it in metropolis light transport rendering).
(L) [2007/09/16] [davepermen] [RT07] Wayback!

but it's still not your first choise.. normally you "prototype" your code with simple floatingpoint code, as there, codeflow, branches, logic is easy and visually simple to see. and then, you remap for simd.

now with threading, you don't need a remap (espencially not in raytracing or rastericing, as you normally parallelize at a higher level). the only reason to remap is because simd doesn't fit to code, you have to fit your code to simd.

and you only do it because you know you can gain performance by it. now if we instead of 4x simd would have 4x more (hyper or what ever suits your needs) threads running in parallel, you would not have the need to redesign the codeflow to fit simd, and then actually implementing it.

and no codeflow except basic examples fit 100% to simd, means the simd code is more complex, and less performing than the original sisd version on 4x as fast hw (be it 4x cores, or simply 4x ghz, or what ever mix). those losses may be minimal.. but they remind one, that you're not coding for an optimal hardware.. the niagara2 might be much more nice.. or something like cell..



i don't say we should not use SIMD as it _is_right_there_to_use_ and it's great to enhance your apps performance. i just state it should not be there in the first place, as it's simply something making all our lives unneededly complicated. simd fits parallelization at the wrong place.. there, where it can't scale (it's just a constant speedgain of max 4x.. threading or overclocking can scale without (non-hardware-based) limits), and there, where it needs a lot of manual work for compilers and coders.. at the lowest level possible.

if there wouldn't be simd, we would have to do only two things:
making our code scalable with threads (depending on case, even over the network)
basic optimisation (sisd floating point can be done very well at compiler, and coder level without need for ugly lowlevel work at all)

the whole last step, replacing all your code with simd, would just not be there, making code by default quite high performing (means all our os' and apps would perform well by default..), and coding very rapid (wich would help all our os' and apps, too).



but i never say don't use it, as it's there for usage.. there's a difference between using, whats given, and thinking about how it may be without it. i will support simd everywhere needed, because i know what gains i can have, using it. but i still know how live could be much easier for everyone without it.

and no, today there is no excuse except "because we have it since years". once you've set up a multicore processor, the amount of cores doesn't really matter anymore. most transistors go to cache today anyways, and with the intelway of putting multiple dies onto one to save costs, it would not be that hard to spit out say small quadcores with no simd, and put 2, or more of those dies on a single sellable unit, and call it octocore or 16core.. or even more if needed.

see niagara2 for a simple "threads are everything" cpu.. still waiting for some raytracing code on it.. i'd buy such a cpu if i could.. i guess [SMILEY :)]
(L) [2007/09/17] [lycium] [RT07] Wayback!

>> davepermen wrote:but it's still not your first choise.. normally you "prototype" your code with simple floatingpoint code, as there, codeflow, branches, logic is easy and visually simple to see. and then, you remap for simd.
foresight.
 >> davepermen wrote:now with threading, you don't need a remap (espencially not in raytracing or rastericing, as you normally parallelize at a higher level).
monte carlo.
 >> davepermen wrote:the only reason to remap is because simd doesn't fit to code, you have to fit your code to simd.
that's right, 4x performance doesn't come without some effort. this is a new thing in computing? am i so old that already people have forgotten the days of spending weeks optimising crucial things in assembly language? using a few intrinsics with register-variables is a walk in the park.
 >> davepermen wrote:and you only do it because you know you can gain performance by it.
?!
 >> davepermen wrote:now if we instead of 4x simd would have 4x more (hyper or what ever suits your needs) threads running in parallel, you would not have the need to redesign the codeflow to fit simd, and then actually implementing it.
nonsense. just because you don't use any thread-unsafe code doesn't mean everyone else doesn't; moreover, those threads have little chance to share data the way you can with simd (e.g. in k-d tree packet traversal you not only get 4x the computation, but you also cut your bandwidth requirements by 4x instead of exploding it 4x).
i'd like to name that crucial element again: foresight. after your third or fourth ray tracer, you begin to understand how to properly design them and can exercise a little foresight to see how you can cleanly account for wider execution, be that with fatter loops (good for x64) or simd (good in general). all it takes is a little time sitting and staring into space over a cup of coffee, really i don't think that's much...
 >> davepermen wrote:and no codeflow except basic examples fit 100% to simd, means the simd code is more complex, and less performing than the original sisd version on 4x as fast hw
there are quite a few things that fit exactly 100%. i should be doing other things atmo, so if you don't believe me that's fine, i'm not going to take the time to make some long list of simd'able things in graphics programming ;)
 >> davepermen wrote:simd fits parallelization at the wrong place.. there, where it can't scale (it's just a constant speedgain of max 4x.. threading or overclocking can scale without (non-hardware-based) limits), and there, where it needs a lot of manual work for compilers and coders.. at the lowest level possible.
again you fail to acknowledge that there are places where you can get >= 4x speedup (ofc you need the single cycle sse to see 4x improvement), and more importantly the transistors vs performance picture.

want more threads? hell, even that's too difficult, why not just demand, "MORE GHZ PLEASE!"? there are some realities you're just not taking into account, and i assure you that intel+amd's engineers aren't stupid.


well, i really can't go and reply to all the things in your post, i should be studying :( truly, you should just ask yourself, "why do chip manufacturers include simd at all?", "would they do it if no one used it?", "why do i see sse being used so much?", "i mentioned the cell, it has simd instructions in its spe units, why?", ... things like that. the answers are pretty logical once you step away from the purely "make it easy for me to have fast c# code" viewpoint ;)
(L) [2007/09/17] [davepermen] [RT07] Wayback!

but it's still not your first choise.. normally you "prototype" your code with simple floatingpoint code, as there, codeflow, branches, logic is easy and visually simple to see. and then, you remap for simd.


now with threading, you don't need a remap (espencially not in raytracing or rastericing, as you normally parallelize at a higher level). the only reason to remap is because simd doesn't fit to code, you have to fit your code to simd.


and you only do it because you know you can gain performance by it. now if we instead of 4x simd would have 4x more (hyper or what ever suits your needs) threads running in parallel, you would not have the need to redesign the codeflow to fit simd, and then actually implementing it.


and no codeflow except basic examples fit 100% to simd, means the simd code is more complex, and less performing than the original sisd version on 4x as fast hw (be it 4x cores, or simply 4x ghz, or what ever mix). those losses may be minimal.. but they remind one, that you're not coding for an optimal hardware.. the niagara2 might be much more nice.. or something like cell..




i don't say we should not use SIMD as it _is_right_there_to_use_ and it's great to enhance your apps performance. i just state it should not be there in the first place, as it's simply something making all our lives unneededly complicated. simd fits parallelization at the wrong place.. there, where it can't scale (it's just a constant speedgain of max 4x.. threading or overclocking can scale without (non-hardware-based) limits), and there, where it needs a lot of manual work for compilers and coders.. at the lowest level possible.


if there wouldn't be simd, we would have to do only two things:

making our code scalable with threads (depending on case, even over the network)

basic optimisation (sisd floating point can be done very well at compiler, and coder level without need for ugly lowlevel work at all)


the whole last step, replacing all your code with simd, would just not be there, making code by default quite high performing (means all our os' and apps would perform well by default..), and coding very rapid (wich would help all our os' and apps, too).




but i never say don't use it, as it's there for usage.. there's a difference between using, whats given, and thinking about how it may be without it. i will support simd everywhere needed, because i know what gains i can have, using it. but i still know how live could be much easier for everyone without it.


and no, today there is no excuse except "because we have it since years". once you've set up a multicore processor, the amount of cores doesn't really matter anymore. most transistors go to cache today anyways, and with the intelway of putting multiple dies onto one to save costs, it would not be that hard to spit out say small quadcores with no simd, and put 2, or more of those dies on a single sellable unit, and call it octocore or 16core.. or even more if needed.


see niagara2 for a simple "threads are everything" cpu.. still waiting for some raytracing code on it.. i'd buy such a cpu if i could.. i guess [SMILEY Smile]
(L) [2007/09/17] [davepermen] [RT07] Wayback!

"why do chip manufacturers include simd at all?"
because it was cheap to add it, and gain some performance over the other manufacturers

"would they do it if no one used it?"
they add it onto the one cpu that everyone uses.. people WILL use it to gain the performance

"why do i see sse being used so much?"
no other option to make high performing code given the monopoly of x86 since long

"i mentioned the cell, it has simd instructions in its spe units, why?"
because everyone is trained to "have simd is a requirement"


and yes i'm not stupid. î just take an extreme position currently to show the other side. i do use simd, and i apreciate the gain i get from it. i just see that it's a very complicated tool for a fixed gain, so it's not really a long term and long scaling solution. code scales with ghz, and code scales with threads (in massive data processing code, at least.. raytracing, graphics, audio processing, video/audio codecs). it never scales with simd. it has just a constant performance gain.

having a constant gain is GREAT for your app. but it's not great in general. thats why we currently all implement threading, because we know without it, our apps will run badly in the future, won't scale at all.


all i say is SIMD is complicated, not the default way you would code else, and a lowlevel hack to enable a bit higher performance.

i'm not talking about c# directly. but the fact that c# can't really use sse shows there is something inherently wrong with it. c# can use ghz to scale, it can use threads to scale, and it can use 100% of the cpu _except simd_. this shows there is an issue with simd. the issue is, it's possibly not really the right tool for the job. it's an overcomplicated, much too lowlevel tool.

tons of applications are written without simd in mind. a lot of these (espencially server-apps) have threading in. apps like that scale since years, both with ghz and threads, even today. but they never get the full performance of a system, because simd doesn't "just work". old code can't use new simd as well. they can use more ghz and more threads. apps scale to new hardware. they never scale to simd except if you do it manually with a lot more work.


thats my point. not the fact that you shouldn't use it. it's there, gives performance. so for anyone around here wanting high performance, it is the last resort. and it's needed.

i just say it shouldn't be. it makes all our lives complicated, wich it wouldn't be without it.


and i never say it would be only easy. of course not. but it would be cool if ordinary code, spit out by ordinary compilers that can nearly be written by you and me, can spit out great performing code. it would be great if JITted code would be as high performing as ordinary code. it would be great if even javascript could use 100% of the cpu. every os would use 100% everywhere it needs to, every app would. if simd would not exist, we would be there by today.

and i guess, the fact that we would loose 4x the speed by removing simd would have been solved if they wouldn't have added it in the first place. so we wouldn't be 4x slower today by not having simd.


we have to use it today, i'm aware of that. and i design my code for it, and work with it. i just see the amount of work that would be saved without it. dunno if you see that actually. it took me years to see that, worked only with c++, at a very low level, did all the stuff. but i don't have the time and energy to do that all the time, there are tools and languages wich let you be much more productive, and stepping back is quite hard... having to step back is even harder.. [SMILEY :)]
(L) [2007/09/17] [toxie] [RT07] Wayback!

i'm all with davepermen..

but as SIMD saves a lot of space on the chip (compared to separate cores) while offering large (potential!) speed increases, hardware developers won't give it up in the next years..
(L) [2007/09/17] [davepermen] [RT07] Wayback!

yes..

what is strange is, they make simd more and more general, to have (finally) a chance for good compiler / language support possible. at this point, the simd unit got so complex, it could be more easy to just strip it and make a smaller non-simd design. the x86-encoding unit is a tiny piece today, even with simd encoding. and one could map all those instructions to the existing fpu for execution, so simd code would be as fast as fpu code. that way, a single core would drop quite in transistor count. replace those with individual fpu's, put back hyperthreading, and we could have similar performance without the lock to a single instruction.

i'm no expert, but i don't think the actual transistorcost would be that much different. once SIMD reaches a full fpu replacement x4, you can as well just put in 4 fpu's with that transistor count. and hyperthreading doesn't cost much transistors, at least thats what intel stated all the time.

this would be sort of the "amd-way" of processor: it can handle all code, but it does handle all code well (athlons always had simd only at half speed => fpu code was quite nice in performance compared to simd.. now they have a full 4x simd unit.. step in the wrong direction [SMILEY :(] hehe).
(L) [2007/09/17] [Phantom] [RT07] Wayback!

The best way to reduce the pain of SIMD is to use it as a way to parallelize your application, if your code path allows this. But I do agrree with davepermen, it's a strange thing. It feels like a 'quick optimization opportunity' on the HW side, with little consideration for the programming side. It does not give you any NEW functionality, it just improves performance of existing functionality, but with a ton of constraints.

In the early days of the Pentium, the CPU would try to fit code to the U and V pipelines, and to split code over the floating point unit and the integer unit in parallel. All this could be done by good compilers, and only rarely, the programmer would write code to specifically exploit this. With SIMD, your C code is changed beyond recognition.

We love the extra power, and of course it's nice to have an edge over those silly C# boys, but to say it's a well-designed and demand-driven extention... No way.
(L) [2007/09/17] [davepermen] [RT07] Wayback!

those silly c# boys..

that hurts..

very



much


[SMILEY :)]

(if you call us silly, you don't know c# yet, really.. [SMILEY :)])
(L) [2007/09/17] [davepermen] [RT07] Wayback!

oh, and.. it's bad that there actually exists that difference. mainly given trough such features like sse wich don't allow "the c# boys" to be as performant. this _is_ a bad thing. (and i know how the JIT compiler writers are unhappy about this.. they use sse for sisd optimisation over the fpu, where appropriate, but simd simply doesn't work well, espencially not in a JIT environment, it would take too long to analize the code for it..)
(L) [2007/09/17] [davepermen] [RT07] Wayback!

>> lycium wrote:that's a little bit like "want to have your cake and eat it" )
this should ALWAYS be our target in every place...



when will this thread be split? [SMILEY :P]
(L) [2007/09/17] [davepermen] [RT07] Wayback!

"why do chip manufacturers include simd at all?"

because it was cheap to add it, and gain some performance over the other manufacturers


"would they do it if no one used it?"

they add it onto the one cpu that everyone uses.. people WILL use it to gain the performance


"why do i see sse being used so much?"

no other option to make high performing code given the monopoly of x86 since long


"i mentioned the cell, it has simd instructions in its spe units, why?"

because everyone is trained to "have simd is a requirement"



and yes i'm not stupid. î just take an extreme position currently to show the other side. i do use simd, and i apreciate the gain i get from it. i just see that it's a very complicated tool for a fixed gain, so it's not really a long term and long scaling solution. code scales with ghz, and code scales with threads (in massive data processing code, at least.. raytracing, graphics, audio processing, video/audio codecs). it never scales with simd. it has just a constant performance gain.


having a constant gain is GREAT for your app. but it's not great in general. thats why we currently all implement threading, because we know without it, our apps will run badly in the future, won't scale at all.



all i say is SIMD is complicated, not the default way you would code else, and a lowlevel hack to enable a bit higher performance.


i'm not talking about c# directly. but the fact that c# can't really use sse shows there is something inherently wrong with it. c# can use ghz to scale, it can use threads to scale, and it can use 100% of the cpu _except simd_. this shows there is an issue with simd. the issue is, it's possibly not really the right tool for the job. it's an overcomplicated, much too lowlevel tool.


tons of applications are written without simd in mind. a lot of these (espencially server-apps) have threading in. apps like that scale since years, both with ghz and threads, even today. but they never get the full performance of a system, because simd doesn't "just work". old code can't use new simd as well. they can use more ghz and more threads. apps scale to new hardware. they never scale to simd except if you do it manually with a lot more work.



thats my point. not the fact that you shouldn't use it. it's there, gives performance. so for anyone around here wanting high performance, it is the last resort. and it's needed.


i just say it shouldn't be. it makes all our lives complicated, wich it wouldn't be without it.



and i never say it would be only easy. of course not. but it would be cool if ordinary code, spit out by ordinary compilers that can nearly be written by you and me, can spit out great performing code. it would be great if JITted code would be as high performing as ordinary code. it would be great if even javascript could use 100% of the cpu. every os would use 100% everywhere it needs to, every app would. if simd would not exist, we would be there by today.


and i guess, the fact that we would loose 4x the speed by removing simd would have been solved if they wouldn't have added it in the first place. so we wouldn't be 4x slower today by not having simd.



we have to use it today, i'm aware of that. and i design my code for it, and work with it. i just see the amount of work that would be saved without it. dunno if you see that actually. it took me years to see that, worked only with c++, at a very low level, did all the stuff. but i don't have the time and energy to do that all the time, there are tools and languages wich let you be much more productive, and stepping back is quite hard... having to step back is even harder.. [SMILEY Smile]
(L) [2007/09/17] [toxie] [RT07] Wayback!

i like the idea of stripping x86 to the bare metal, so resulting in just a tiny and simple RISC architecture, but lots of cores..

after all there is a reason why vector-computers "died" some while ago.. so i wonder how long SIMD will survive..
(L) [2007/09/17] [davepermen] [RT07] Wayback!

yes..


what is strange is, they make simd more and more general, to have (finally) a chance for good compiler / language support possible. at this point, the simd unit got so complex, it could be more easy to just strip it and make a smaller non-simd design. the x86-encoding unit is a tiny piece today, even with simd encoding. and one could map all those instructions to the existing fpu for execution, so simd code would be as fast as fpu code. that way, a single core would drop quite in transistor count. replace those with individual fpu's, put back hyperthreading, and we could have similar performance without the lock to a single instruction.


i'm no expert, but i don't think the actual transistorcost would be that much different. once SIMD reaches a full fpu replacement x4, you can as well just put in 4 fpu's with that transistor count. and hyperthreading doesn't cost much transistors, at least thats what intel stated all the time.


this would be sort of the "amd-way" of processor: it can handle all code, but it does handle all code well (athlons always had simd only at half speed => fpu code was quite nice in performance compared to simd.. now they have a full 4x simd unit.. step in the wrong direction [SMILEY Sad] hehe).
(L) [2007/09/17] [davepermen] [RT07] Wayback!

those silly c# boys..


that hurts..


very




much



[SMILEY Smile]


(if you call us silly, you don't know c# yet, really.. [SMILEY Smile])
(L) [2007/09/17] [Shadow007] [RT07] Wayback!

Wer're speaking of a 4 times speedup here. We could talk of a 2 years step in the future !


It may not be the "order of magnitude" you can get by using a better algorithm (which is what the researchers are payed to find [SMILEY Wink] ), but it's better than nothing !


Moore's law says twice the amount of transistors each year (not sure about the period, but let's say 1 year).

By interpretating that as 2 times more cores each year, a 4 times speedup is something like 2 years won !


Of course SIMD is really harder than simple code... so is multithread !

SIMD won't scale as well as multicore (it DOESNT scale at all [SMILEY Smile] ), but it still IS a 2 years advance... which has it's cost.



PS : please feel free to ignore my post, as I don't use SIMD personally/professionally, don't have any experience with it etc ...
(L) [2007/09/17] [Shadow007] [RT07] Wayback!

Forgot ...


Of course, if removing the SSE* parts of the core amounted to more than one half of the proc, it would be better to exchange it ... But it seems to me it accounts to no more than a few percents... Any specialist to contradict me ?
(L) [2007/09/17] [Shadow007] [RT07] Wayback!

You're right, we're talking of a "virtual" 4 times speedup from SIMD. But it would also be a "virtual" speedup from multicore : multicore doesn't necessarily scale real well either (depending on the algorithm of course) ...
(L) [2007/09/17] [davepermen] [RT07] Wayback!

jep. but you can take what ever algorithm, and if you can split the data into 2, 4, 8 ... etc pieces (easy in rendering situations), you can just spread the data and the algorithm. if there is no bandwidth or what ever issue, your algorithm will run just as well on the threads than on the single one, without _any_ change in code (wich can still be highlevel, clean, easy readable and maintainable).


simd never has that feature. you have to rewrite your algorithm again in asm, or with the intristics, replacing it's highlevel, it's clean readability, it's maintainability, etc.


and currently, we have all to use both features to scale to the todays platform. but one of those two is more easy, more clean, usable in most languages, and will scale well for the future. the other has a fixed scale of 2x max, or 4x max, depending on cpu, will not scale automatically if new instructions get added, and does not automatically run on all systems, only on the ones supporting it (some have SSE, some SSE2, some SSE3, some SSE4, some SSE5 (or what is the amd-one called again?). incompatibility, multiple code paths, obscurity, etc..


now of those two features, usable to make graphics more performing, i ask you, do you want to drop if you would have the choise?
(L) [2007/09/17] [Ho Ho] [RT07] Wayback!

Problem why SIMD stuff is not used by JIT'ed code is because computers are dumb [SMILEY Smile] They simply can't find good places where they could use it. On one beautiful day when regular compiles (GCC, ICC) can optimize for SIMD JIT compilers should be able to handle it aswell.


Cutting out SSE from a CPU to make room for additional cores is not exactly efficient way to get more performance. Observe (single core of K10 without L2/3 cache):

[IMG #1 ]

SSE+x87 together take at most 20% of the computational core. You would need to have at least around 10 of those cores before you can add a new one made up from the transistors you saved from removing SSE units. Removing instruction decode and branch prediction units would free up about as much as complete removal of SSE+x87.


Now when looking at a really simple CPU things are a bit different but not too much:

[LINK http://en.wikipedia.org/wiki/Cell_microprocessor_implementations#Cell_floorplan]

[LINK http://en.wikipedia.org/wiki/Cell_microprocessor_implementations#SPE_floorplan]

In every SPE the computational units do not take too much die space (~17.5% per SPE), you can't really save too much by making them SISD instead of SIMD as you will still need the other ~72.5% of the SPE to make it useful.


So if you want to loose SIMD to get more die space for SISD cores you should be asking if you are willing to accept loosing potentially up to 4x in speed to get at most 10-20% better SISD throughput. Also as Phantom said then besides giving more computational power SIMD can reduce memory bandwidth usage and that is much more important than having a bit faster computations. Core count will be increasing by a lot in the coming years but I'm quite sure that memory bandwidth and cache sizes will not magically explode.


I'd say that even though SIMD is tough to code it does make a whole lot of sense to have it. Also Intel has said that they will likely widen their SIMD units even more, Larrabee should have 512bit/16x32bit FP SIMD units.
_________________
In theory, there is no difference between theory and practice. But, in practice, there is.

Jan L.A. van de Snepscheut
[IMG #1]:Not scraped: https://web.archive.org/web/20071025012207im_/http://upload.wikimedia.org/wikipedia/en/2/27/K10h.jpg
(L) [2007/09/17] [davepermen] [RT07] Wayback!

well, how does memory bandwith have an issue? if you process the same amount of data it doesn't matter if you use simd or not.. a non-simd method would even be nicer, as you process in serial, means more time to cache and load the data while yet at work.


and well, i'd have to look up how much a hyperthread uses in cpu space.. as i remember, intel said it doesn't cost much. and changing the simd to individual fpus would not cost much eighter. so putting say 4 hyperthreads into a single core would not cost that much, and splitting the simd unit into 4 fpu units wouldn't, eighter.


i know that cores are costy... [SMILEY Sad]
(L) [2007/09/17] [davepermen] [RT07] Wayback!

[IMG #1 ]


hyperthreading doesn't cost much on die size.. as stated here. (this was just from a quick google image search [SMILEY Smile] but i guess at least the die increase number is correct)
[IMG #1]:Not scraped: https://web.archive.org/web/20071025012207im_/http://pcforum.hu/site.pc/text/quicknews/01413/hyperthreading.jpg
(L) [2007/09/17] [Michael77] [RT07] Wayback!

Hyperthreading is of nearly no use anyway [SMILEY Wink] Seriously, Hyperthreading only works well if you do two totally different calculations where one uses different processor resources than the other. Using Hyperthreading in raytracing doesn´t help anything (in fact, on a dual p4 with hyperthreading my test showed that going from 2 to 3 threads increases performance by about 20%, going to 4 threads just increases performance by about 15% only - so is in fact slower than using just 3 threads. On a true quadcore the same code improves by about 80% per added core (which is still not optimal but not so bad anyway).
(L) [2007/09/17] [davepermen] [RT07] Wayback!

well, if the 4 float units of the SIMD unit would be splitted, then the cpu would have 4 floatingpoint units for 'free'. then up to 4 hyperthreads could use the individual 4 fp units and run in parallel.


it actually looks like intel is planning something similar for the next introduction of hyperthreading in the next cpu generation...




and in p4 days, it was not the hyperthreading that was bad, it was the whole cpu [SMILEY Smile] (espencially this one on wich ordinary code ran very bad, and only specifically coded code could run well.. and sse has this same behaviour)
(L) [2007/09/17] [davepermen] [RT07] Wayback!

no, simd is a problem with nearly all algorithms. and it's a hardware problem, as you simply can't make good tools for it.


and please explain me how a simd unit processing 4 fp cost less than 4 fp units. that split would not cost anything.


adding 4 thread-units, hyperthread or real cores, that would increase. as seen, for hyperthreads 5%.



stating simd is good because it's fast because it's in hardware is as stating rastericing is the only way to go because it's fast because it has gpus.


we here at the raytracing board know that this is _not_ the way to think about it. a simd' algorithm is a worse algorithm than a sisd algorithm, always, except when there is a 1:1 mapping. wich is _not_ the case in _any_ tree-accelerated algorithm existing, where branching exists. wich any tree has, and any fast raytracing algorithm caused by this.
(L) [2007/09/17] [Alan] [RT07] Wayback!

In modern CPUs the actual execution units are totally dwarfed by the control logic which is needed to drive them. Even in those die photos, the percentage of the SIMD area which actually contains adders, multipliers, shifters etc is tiny. Most of it is control logic.


If drive multiple execution units all off the same control logic like you do in SIMD, you can very cheaply (in terms of transistors), do more work per cycle. With the large limitation that all the execution hardware has to be doing doing the same thing.


If you want them doing different things you need different control logic for each chunk of actual execution hardware.


This just isn't as efficient when 90% or more of die area is control logic and less than 10% is actual maths performing.
(L) [2007/09/17] [toxie] [RT07] Wayback!

but this is mainly due to the fact that x86 has become such a horribly complex bastard.

if designing a reasonably simple RISC core (not as bad as the Cell-SPEs though maybe) the die-space could be drastically decreased per core!
_________________
Eat plutonium death you disgusting alien weirdos!
(L) [2007/09/17] [ingenious] [RT07] Wayback!

Hey, and they raised the topic about fixed point ray tracing at the conference. Let's move on and discuss about an ultra-fast massively parallel fixed point arithmetic processor in the future?  [SMILEY Cool]
(L) [2007/09/17] [davepermen] [RT07] Wayback!

well, we _do_ have some integer units nearly idling.. [SMILEY Smile]
(L) [2007/09/17] [toxie] [RT07] Wayback!

phantom already got the idea at the conference: why not interleave a standard FPU RTRT core and an Integer RTRT core? could result in double performance..  )
_________________
Eat plutonium death you disgusting alien weirdos!
(L) [2007/09/17] [davepermen] [RT07] Wayback!

jokes are only good when they have some truth in it, not? [SMILEY Smile]


and actually, we're drifting BACK ONTOPIC!!! [SMILEY Smile]


strange.. [SMILEY Smile]


i, by myself, thought once about using the integer units for all the shading stuff (non-hdr like 'back in those days' [SMILEY Smile]), and the floatingpoint units for the intersectionstuff..


[SMILEY Smile]
(L) [2007/09/18] [ingenious] [RT07] Wayback!

Indeed, it's always nice to look things from a different point of view [SMILEY Smile] I love brainstorming  [SMILEY Cool]
(L) [2007/09/18] [davepermen] [RT07] Wayback!

at a certain cost for the aa samples we could start talking about beams again [SMILEY Smile]
(L) [2007/09/18] [lycium] [RT07] Wayback!

... and what happens when you hit specular surfaces with high curvature?
(L) [2007/09/18] [davepermen] [RT07] Wayback!

then you eighter use several beams, or rays.. thats what i say.. what about using _both_, and weight'n'choose appropriately?
(L) [2007/09/18] [davepermen] [RT07] Wayback!

I'm impressed how much different off-topic 'on-topic' topics we can have in one topic... [SMILEY Smile]
(L) [2007/09/18] [lycium] [RT07] Wayback!

i haven't even started about your dj'ing and club activities ;)
(L) [2007/09/18] [davepermen] [RT07] Wayback!

well, that would be offtopic offtopic.. [SMILEY Smile]


but we can always talk about that [SMILEY Smile] [LINK http://www.beatcast.ch/ www.beatcast.ch] I'm episode 4 [SMILEY Smile] *shameless plug*


btw, i think more or less the idea I have about frustums/beams around rays to optimize in a more-or-less coherent case is covered in the vertex_frustum.pdf of this year? I could only scroll-mouse-read it so far [SMILEY Smile]

back