SSE3 detection & and _mm_hadd_ps emulation with SSE2

(L) [2008/01/29] [ingenious] [SSE3 detection & and _mm_hadd_ps emulation with SSE2] Wayback!

Hi guys,
I want to use _mm_hadd_ps in my code but the problem is that my Pentium M only has SSE2. My questions to you are:
1) Is there any constant defined that I can use to check which instruction set is available in compilation time? For example:
Code: [LINK # Select all]#ifdef SSE3_AVAILABLE
// use _mm_hadd_ps
#else
// don't use it
#endif
2) How to most efficiently simulate the behavior of _mm_hadd_ps with SSE2 intrinsics?
EDIT: OK, I can set the build system to check for SSE3 and define some macro if there's nothing available in the standard headers, but what about _mm_hadd_ps with SSE2 ? And how fast/slow is _mm_hadd_ps anyway?

(L) [2008/01/30] [lycium] [SSE3 detection & and _mm_hadd_ps emulation with SSE2] Wayback!

horizontal ops are supposedly slower than the normal simd ops (i've never measured it myself though, only got a core 2 duo recently), so if possible you should re-arrange your data into SOA form and work on it like that; this brings the biggest gains, since it both utilises the super fast sse instructions (single cycle on c2d and k10) and also tends to lead to highly streamed code.

(L) [2008/01/30] [tbp] [SSE3 detection & and _mm_hadd_ps emulation with SSE2] Wayback!

Somewhat redundant remark: it would be more productive to re-examine what you're trying to achieve that asks for a haddps.
Besides, given how long it currently takes for any such horizontal op to run, question #2 doesn't make much sense.
[LINK http://mubench.sourceforge.net/results.html]

(L) [2008/01/30] [ingenious] [SSE3 detection & and _mm_hadd_ps emulation with SSE2] Wayback!

Thanks for the table, tbp.
Actually, what I'm trying to do is a pretty common task - I want to do write a 4x4 matrix which is robust in multiplication with standard 3-component vectors (stored in a __m128), and a packet of 4 3-component vectors (stored SOA format - (x, x, x, x), (y, y, y, y), (z, z, z, z)).
I'm wondering how to store the matrix components in SIMD vectors - by rows or by columns. Both have advantages and disadvantages. Any ideas/code you might suggest? 10x!

(L) [2008/01/30] [tbp] [SSE3 detection & and _mm_hadd_ps emulation with SSE2] Wayback!

I've never bothered to write a decent SIMD matrix something, and while i have marvelous ideas aplenty odds for making a fool of myself are real good; so i'd suggest looking into a BLAS package, like [LINK http://math-atlas.sourceforge.net/ ATLAS], to see what they are up to.
Feel free to report back [SMILEY ;)]

(L) [2008/01/30] [Michael77] [SSE3 detection & and _mm_hadd_ps emulation with SSE2] Wayback!

I am doing multiplication of a simple transform matrix with a single 3-component vector like this (matrix stored with one column represented as one sse-float):
Code: [LINK # Select all]    const vec3f v0(1.0, 2.0f, 3.0f); // simple non sse vector
   const float4 v0x = _mm_load1_ps( &v0.x);
   const float4 v0y = _mm_load1_ps( &v0.y);
   const float4 v0z = _mm_load1_ps( &v0.z);
   const float4 r0x = _mm_mul_ps( mat.v[0], v0x);
   const float4 r0y = _mm_mul_ps( mat.v[1], v0y);
   const float4 r0z = _mm_mul_ps( mat.v[2], v0z);
   const float4 v0term1 = add_ps( mat.v[3], r0x);
   const float4 v0term2 = add_ps( r0y, r0z);
   const float4 p0 = add_ps( v0term1, v0term2); // resulting sse float with x,y,z,0

Of course, if there is any more efficient way, I would love to hear it [SMILEY :)]

(L) [2008/01/30] [ingenious] [SSE3 detection & and _mm_hadd_ps emulation with SSE2] Wayback!

Well, I do the multiplication with a 3-component vector in the same way (I actually use _mm_set1_ps, is the _mm_load1_ps faster?), and that is why I store the matrix's components in simd vectors by columns.
And I wanted to do the multiplication with a packet of 4 vectors in a similar way, but it turned out that storing the matrix by rows would be better. So I started thinking how to do the multiplication efficiently with the single 3-component vector if the matrix was stored "horizontally". And I figured out a way how to do it with 4 multiplications and 3 hadd_ps's.
How would you compare the above implementation with a one which uses 4 multiplications and 3 _mm_hadd_ps's to multiply a 4x4 matrix with a 4-component vector (x, y, z, 1) ?
And any ideas about multiplying a 4x4 matrix with a packet of 4 3/4-component vectors?

SSE3 detection & and _mm_hadd_ps emulation with SSE2 back

Board: Board index ‹ Ray tracing ‹ Considered harmful