Single Vec3f SSE implementation - does it worth it? back

(L) [2007/09/17] [ingenious] [Single Vec3f SSE implementation - does it worth it?] Wayback!

I want to ask anyone who has deeper understanding of SSE and registers.


I have a Vec3f4 class, which (as the name suggests) represents a collection of four 3-component float vectors. It is implemented internally using the "standard" horizontal SSE memory lauout:
(L) [2007/09/17] [Wussie] [Single Vec3f SSE implementation - does it worth it?] Wayback!

Hmmm, I'm not terribly familiar with the factual differences, but the 'big tutor' Phantom would definitely advise to step up and do the 3x4 approach. Use SSE to make calculations with 4 vectors at the same time. x,x,x,x + y,y,y,y + z,z,z,z. I believe he has a paper lying around somewhere explaining some more details, check out the recent rt07 threads on this forum, that should contain some useful info.
(L) [2007/09/18] [lycium] [Single Vec3f SSE implementation - does it worth it?] Wayback!

you should definitely keep the dummy variable just for alignment reasons even if you don't use sse; furthermore, you should give the compiler some explicit instruction to keep that stuff aligned, the convenient _MM_ALIGN16 does the job.


however, you won't get the full speedup with dot products etc this way. the best way to fully utilise sse is to go from AOS to SOA format, ie keep xxxx yyyy zzzz seperate and then have 1 simd op correspond to 1 scalar op. and then, you should take care:


1. to store the constant data you process with in SOA format to avoid shuffling in the processing loop (e.g. matrices should be re-arranged outside the processing loop)

2. to have a sufficiently much processing such that your SOA->AOS shuffling at the end (via _MM_TRANSPOSE4_PS usually) is amortised
(L) [2007/09/18] [ingenious] [Single Vec3f SSE implementation - does it worth it?] Wayback!

Yes, all this applies, but I am asking for the good old single Vec3f here. Does it make sense to implement it with SSE internally or not, given its components will be accessed often separately and horizontal operations like dot product don't benefit much from SSE. That's all I'm asking. See the upmost starting post.
(L) [2007/09/18] [lycium] [Single Vec3f SSE implementation - does it worth it?] Wayback!

hmm, i'm leaning towards a categorical "no", but "measure and see" is of course a better alternative. (aside: however, i've found sse to be much more useful for my colour class, since there are basically no horizontal ops and there's plenty of chance to use those integer conversion instructions.)


scalar ops are pretty damn fast, and if you're doing lots of dot products etc. then yeah stay away from sse. maybe investigate the sse3 horizontal ops? if you do, please report back here as i've never used them, nor heard any results from people who have except for tbp (who says it's slow).
(L) [2007/09/18] [ingenious] [Single Vec3f SSE implementation - does it worth it?] Wayback!

Well, of course the measure-and-see tactic is always better, and that's why I asked - maybe someone has tried it before me.


As for the SSE3... My old Centrino doesn't support it unfortunately  [SMILEY Sad]  But I think I've heard the _mm_hadd_ps is not the most robust intrinsic available (not sure if it even maps to a compound op)...
(L) [2007/09/18] [ingenious] [Single Vec3f SSE implementation - does it worth it?] Wayback!

OK, I get it... I guess I'll have to try, since my setting is a bit different.


And by Centrino I meant Pentium M [SMILEY Smile]
(L) [2007/09/18] [Zakalwe] [Single Vec3f SSE implementation - does it worth it?] Wayback!

I get no measurable speedup from doing this. I believe it's probably because I don't use Vector3 calls all that much compared to kD-Tree traversals and triangle intersections.
(L) [2007/09/18] [lycium] [Single Vec3f SSE implementation - does it worth it?] Wayback!

did you try benching it in isolation?
(L) [2007/09/18] [Michael77] [Single Vec3f SSE implementation - does it worth it?] Wayback!

You should also consider the fact that you will be wasting 25% of memory - which might be ok for something like a ray structure but is totally unacceptable for vertices/normals etc. when dealing with larger scenes (> 1M).
(L) [2007/09/18] [lycium] [Single Vec3f SSE implementation - does it worth it?] Wayback!

multiplication with scalars won't show improvement, and will probably be slower: _mm_set1_ps is a composite, made of shuffles, so until those new desktop chips with fast shuffling are out, that's going to be slower.
(L) [2007/09/18] [Zakalwe] [Single Vec3f SSE implementation - does it worth it?] Wayback!

Yeah, I only used SSE for the direct multiplication of two vector types due to the _mm_set1_ps overhead. Speaking of which, I tried storing my splitting coordinate etc as __m128s rather than incur the shuffle overheads, but the triaccels grew larger and swallowed any computational speedup with less accels per-cache line.


Swings and roundabouts as they say [SMILEY Smile]
(L) [2007/09/18] [goodbyte] [Single Vec3f SSE implementation - does it worth it?] Wayback!

I have two versions which I call real3 (with sse) and real3_ms (without sse), where ms stands for minimum storage. I use real3 for ordinary calculations and real3_ms when I need to pack my data into fewer bytes. The main benefit of using the sse version is when the source is a transposed 3x4 matrix since your compiler can then keep the values in registers all the way. However I have learned (the hard way) that problems which don't fit the SoA approach are usually computed faster with a single scalar approach.


And also a final note, if you decide to go with the sse approach, make sure you initialize your dummy argument to avoid NaN values, they can really eat all your performance otherwise.

back