(L) [2014/05/01] [phkahler] [GCC SIMD vector types] Wayback!I'm taking my first dive into vectorization, using a recent version of GCC. I like their method of defining vector types described here:
[LINK http://jeanjacques.lacrampe.free.fr/webada/doc/gnat/gcc_6.html#SEC160]
This allows you to use ordinary code for basic vector operations, as well as passing them as parameters and return types. This raises a lot of questions for me, but for today I want to focus on the issue of hardware that does not directly support the types used. Specifically I'm creating a 32 byte v4df on a machine (Athlon64 circa 2005) that only has SSE2. They say GCC will still allow the use of the vector types but will revert to a smaller size internally. I got the impression from somewhere that it will do SSE2 vectors in this case, but the documentation is so sparse that I can't confirm. I really don't want to break it down myself or convert to v4sf. Just get the performance available with architecture independent code.
My question is: Will it internally use 2 2-element vectors with SSE2, or will it break it all the way down to scalar code since my hardware doesn't support the full 256bit AVX registers?
(L) [2014/05/02] [Dade] [GCC SIMD vector types] Wayback!>> phkahler wrote:My question is: Will it internally use 2 2-element vectors with SSE2, or will it break it all the way down to scalar code since my hardware doesn't support the full 256bit AVX registers?
You can just get the assembler source output ([LINK http://stackoverflow.com/questions/137038/how-do-you-get-assembler-output-from-c-c-source-in-gcc]) of a very simple (i.e. one line of code with a vector operation) test program and check what kind of instruction the GCC outputs in your case.
P.S. I'm afraid that GCC will fallback to scalar code in your case.
(L) [2014/05/02] [phkahler] [GCC SIMD vector types] Wayback!Thanks for the tip.  I compiled the following with -S and -O3 in test.c
Code: [LINK # Select all]typedef v4df __attribute__ ((vector_size (32)));
v4df addem(v4df a, v4df b)
{
return a+b;
}
And got this for output:
Code: [LINK # Select all]addem:
.LFB0:
.cfi_startproc
movdqa 8(%rsp), %xmm2
movq %rdi, %rax
movdqa 40(%rsp), %xmm1
paddd %xmm2, %xmm1
movdqa %xmm1, -104(%rsp)
movq -104(%rsp), %rdx
movdqa %xmm1, -88(%rsp)
movq %rdx, (%rdi)
movq -80(%rsp), %rdx
movdqa 24(%rsp), %xmm0
movq %rdx, 8(%rdi)
paddd 56(%rsp), %xmm0
movdqa %xmm0, -72(%rsp)
movq -72(%rsp), %rdx
movq %rdx, 16(%rdi)
movq -64(%rsp), %rdx
movq %rdx, 24(%rdi)
ret
.cfi_endproc
.
While that seems like an aweful lot of instructions for O3, it is using a lot of SSE and there are 2 vector add instructions. I also got the following warnings when I compiled:
 >> test.c: In function ‘addem’:
test.c:4:6: note: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
 v4df addem(v4df a, v4df b)
      ^
test.c:4:6: warning: AVX vector argument without AVX enabled changes the ABI [enabled by default]
The attribute "vector size" is different from "mode" which specifies the internal type to use. So by specifying the length instead of type it does get to choose the size. This makes sense because it has a vectorizer - if it did generate scalar code (the worst kind with a loop) it would be able to vectorize it for the target. It makes sense that it can do this, I just needed to be sure.
So vector code that can run on x86, ARM, PPC, or other without modification. GCC will make it use whatever vector resources are available on the target. Isn't it about time C and C++ got an official version of this?
(L) [2014/05/03] [phkahler] [GCC SIMD vector types] Wayback!Oops. I left out the double in the typedef. That seems to change the asm but not the conclusion:
Code: [LINK # Select all]addem:
.LFB0:
.cfi_startproc
movapd 8(%rsp), %xmm2
movq %rdi, %rax
movapd 40(%rsp), %xmm1
addpd %xmm2, %xmm1
movapd 24(%rsp), %xmm0
addpd 56(%rsp), %xmm0
movapd %xmm1, -104(%rsp)
movq -104(%rsp), %rdx
movapd %xmm1, -88(%rsp)
movapd %xmm0, -72(%rsp)
movq %rdx, (%rdi)
movq -80(%rsp), %rdx
movq %rdx, 8(%rdi)
movq -72(%rsp), %rdx
movq %rdx, 16(%rdi)
movq -64(%rsp), %rdx
movq %rdx, 24(%rdi)
ret
.cfi_endproc
But that's a lot of movq for what?