LATER: My mistake. I thought the vpshaq instruction took a single shift-count that it applied to both the low and high 64-bit elements it operates upon (similar to instructions that take 8-bit immediate values). But instead vpshaq takes one 64-bit shift-count field for each 64-bit value it shifts, and my code only specified one in memory (and I guess the other was zero by pure coincidence, preventing the other 64-bits in the source register from being shifted). Doh! Nice instruction that vpshaq... but unfortunately I just noticed that intel doesn't support it. Bummer. I will leave the message posted in case anyone else has the same problem with vpshaq.
I am writing a function library in 64-bit assembly language, and the vpshaq instruction appears to not be working correctly. Either that or the documentation is wrong.
What happens is, the low 128-bits of the destination register are written to, but only the low 64-bits of the source register are shifted by the instruction. The upper 64-bits of the source register are not shifted, but are simply passed through to the destination register unmodified.
.quad -63, -63 # need two 64-bit shift-counts here!!!
vmovapd (%rsi), %xmm0 # ymm00.01 = arg1 : ymm00.23 = zero
vmovsd s64_m63, %xmm3 # xmm03.0 = -63
vpshaq %xmm3, %xmm0, %xmm1 # xmm01.01 = msbit of arg1
In a typical test, the first instruction loads 0x55555555555555555555555555555555 into register xmm00. That's two 64-bit values, each of which == 0x5555555555555555. The second instruction loads a shift-count into register xmm03. My original purpose is to right shift by 63 bits to fill the destination register with all 0 bits or all 1 bits depending on the most-significant bit of the two 64-bit values. But I've tried many other shift counts between -1 to -63, and some positive shift counts too (which left shift instead of right shift).
In every case, when my code executes the vpshaq instruction, the low 64-bits of the source register (xmm0) ends up in the low 64-bits of the destination register (xmm1) shifted as expected, but the next higher 64-bits of the destination register (xmm1) ends up containing the original unshifted contents of the source register (xmm0).
This is not what the documentation says, and not what would normally be expected.
I thought maybe the gcc compiler/assembler might be assembling the instruction wrong, and I suppose that is still a possibility. However, off hand I do not know of any other instruction shifts right the correct number of places (specified in the count register), and perform the sign-extension that does in fact occur (on the low 64-bits). So... it appears more like the instruction isn't working properly.
Can anyone verify this for me?
Who do I need to report this to, and how do I do that?
I am compiling on 64-bit ubuntu linux with up-to-date gcc tools. I develop and debug my code with codeblocks IDE (which invokes standard tools like gdb).
For anyone familiar with intel syntax, note that the operand order in this assembler is reversed, so the source registers come first and the destination register comes last on each line of assembly language.
My CPU is an FX-8150 bulldozer.
I have tried many other values in the lower and upper 64-bit portions of the source xmm register (as well as different shift counts), but always only the low 64-bits is shifted and the upper 64-bits is unmodified.
I am familiar with the SIMD instruction set, and have written many functions with AVX, FMA and other advanced instruction sets that work with the xmm and ymm registers, so while I may be doing something stupid here, I would normally be able to recognize it myself. This is, however, the first time I put the vpshaq instruction in my code.
05:12 PM by