GCC SIMD vector types back

Board: Board index Raytracing Considered Harmfull

(L) [2014/05/01] [phkahler] [GCC SIMD vector types] Wayback!

I'm taking my first dive into vectorization, using a recent version of GCC. I like their method of defining vector types described here:
[LINK http://jeanjacques.lacrampe.free.fr/webada/doc/gnat/gcc_6.html#SEC160]
This allows you to use ordinary code for basic vector operations, as well as passing them as parameters and return types. This raises a lot of questions for me, but for today I want to focus on the issue of hardware that does not directly support the types used. Specifically I'm creating a 32 byte v4df on a machine (Athlon64 circa 2005) that only has SSE2. They say GCC will still allow the use of the vector types but will revert to a smaller size internally. I got the impression from somewhere that it will do SSE2 vectors in this case, but the documentation is so sparse that I can't confirm. I really don't want to break it down myself or convert to v4sf. Just get the performance available with architecture independent code.
My question is: Will it internally use 2 2-element vectors with SSE2, or will it break it all the way down to scalar code since my hardware doesn't support the full 256bit AVX registers?
(L) [2014/05/02] [Dade] [GCC SIMD vector types] Wayback!

>> phkahler wrote:My question is: Will it internally use 2 2-element vectors with SSE2, or will it break it all the way down to scalar code since my hardware doesn't support the full 256bit AVX registers?
You can just get the assembler source output ([LINK http://stackoverflow.com/questions/137038/how-do-you-get-assembler-output-from-c-c-source-in-gcc]) of a very simple (i.e. one line of code with a vector operation) test program and check what kind of instruction the GCC outputs in your case.
P.S. I'm afraid that GCC will fallback to scalar code in your case.
(L) [2014/05/02] [phkahler] [GCC SIMD vector types] Wayback!

Thanks for the tip.  I compiled the following with -S and -O3 in test.c
Code: [LINK # Select all]typedef v4df __attribute__ ((vector_size (32)));
v4df addem(v4df a, v4df b)
{
   return a+b;
}
And got this for output:
Code: [LINK # Select all]addem:
.LFB0:
   .cfi_startproc
   movdqa   8(%rsp), %xmm2
   movq   %rdi, %rax
   movdqa   40(%rsp), %xmm1
   paddd   %xmm2, %xmm1
   movdqa   %xmm1, -104(%rsp)
   movq   -104(%rsp), %rdx
   movdqa   %xmm1, -88(%rsp)
   movq   %rdx, (%rdi)
   movq   -80(%rsp), %rdx
   movdqa   24(%rsp), %xmm0
   movq   %rdx, 8(%rdi)
   paddd   56(%rsp), %xmm0
   movdqa   %xmm0, -72(%rsp)
   movq   -72(%rsp), %rdx
   movq   %rdx, 16(%rdi)
   movq   -64(%rsp), %rdx
   movq   %rdx, 24(%rdi)
   ret
   .cfi_endproc
.
While that seems like an aweful lot of instructions for O3, it is using a lot of SSE and there are 2 vector add instructions. I also got the following warnings when I compiled:
 >> test.c: In function ‘addem’:
test.c:4:6: note: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
 v4df addem(v4df a, v4df b)
      ^
test.c:4:6: warning: AVX vector argument without AVX enabled changes the ABI [enabled by default]

The attribute "vector size" is different from "mode" which specifies the internal type to use. So by specifying the length instead of type it does get to choose the size. This makes sense because it has a vectorizer - if it did generate scalar code (the worst kind with a loop) it would be able to vectorize it for the target. It makes sense that it can do this, I just needed to be sure.
So vector code that can run on x86, ARM, PPC, or other without modification. GCC will make it use whatever vector resources are available on the target. Isn't it about time C and C++ got an official version of this?
(L) [2014/05/03] [phkahler] [GCC SIMD vector types] Wayback!

Oops. I left out the double in the typedef. That seems to change the asm but not the conclusion:
Code: [LINK # Select all]addem:
.LFB0:
   .cfi_startproc
   movapd   8(%rsp), %xmm2
   movq   %rdi, %rax
   movapd   40(%rsp), %xmm1
   addpd   %xmm2, %xmm1
   movapd   24(%rsp), %xmm0
   addpd   56(%rsp), %xmm0
   movapd   %xmm1, -104(%rsp)
   movq   -104(%rsp), %rdx
   movapd   %xmm1, -88(%rsp)
   movapd   %xmm0, -72(%rsp)
   movq   %rdx, (%rdi)
   movq   -80(%rsp), %rdx
   movq   %rdx, 8(%rdi)
   movq   -72(%rsp), %rdx
   movq   %rdx, 16(%rdi)
   movq   -64(%rsp), %rdx
   movq   %rdx, 24(%rdi)
   ret
   .cfi_endproc

But that's a lot of movq for what?

back