(L) [2006/09/13] [Michael77] [SSE2 Instruction cycles?] Wayback!Hi,
does anyone know a reference of how many cpu cycle each sse instruction needs (in the best case of course). I am currently wondering, if it makes any difference to use _mm_load1_ps or _mm_set_ps1 to set a single float value to all registers.
Michael
(L) [2006/09/13] [tbp] [SSE2 Instruction cycles?] Wayback!A full load is faster, if only because of reduced dependencies and load-excute mechanisms. But then if that read is postponed for some reason...
Plus you can't say exactly what will be generated for a given  _mm_set_ps1 because it's a composite and compilers can do as they fancy (ie writing a full vector beforehand and loading it).
Sooooo... if you want to scrutinize what's happening at this level, disassemble.
[LINK http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF]
[LINK http://www.agner.org/optimize/]
No direct link to Intel's doc, they keep moving it around.
_________________
May you live in interesting times.
[LINK https://gna.org/projects/radius/ radius] | [LINK http://ompf.org/ ompf] | [LINK http://ompf.org/wiki/ WompfKi]