Dear all,
I need to conduct some 64-bit integers multiplications.
In Xeon CPU, for the scalar multiplication operation, I have the mulq assembly instruction. What it dose is it multiplies 2 64-bit integers and stores the 128-bit result in 2 64-bit integers for the low 64 bits and high 64 bits of the result. I want to conduct the calculation on Phi and vectorize it.
I looked through the Xeon Phi Coprocessor Instruction Set Reference Mannual, and I found the vectorizable VPMULHD and VPMULLD instructions (and the equivalent intrinsics). However, they only operates on 32-bit integers. And I have to use 2 instructions (VPMULHD and VPMULLD) to get the low and high part of the 64-bit result separately. Obviously, they are less efficient than the mulq instruction for 64-bit integers multiplication if I can vectorize the mulq instruction on Phi.
My question is:
1. Is there an equavilent instruction to mulq in Xeon Phi, so that I can vectorize the operations?
2. If I have to use the VPMULHD and VPMULLD instructions for 64-bit integers multiplication, I need to use those instructions to multiply 2 32-bit integers first. They multiply 2 32-bit integers but only store the high part and low part of the 64-bit result reperately. So I need two instructions for 32-bit integers multiplication. Is there a way that I can conduct the 32-bit integers multiplication in only one instruction, and I can vectorize it at the same time?
Thanks very much for the help.