Hi all,
I tried out the new Intel compiler (15.0.0 20140723) with a big intrinsics kernel on the MIC. The programs runs 20% slower than compared to icpc 14.0.3 20140422. I analyzed and attached the generated assembler code using icc14 (block_icc14.s) and icc15 (block_icc15.s) for a large block of the kernel. The programs were compiled with -O3.
There is no big difference between both assembler files. The number of instructions for
prefetching icc14 = 105; icc15 = 105
fmadd icc14 = 63 ; icc15 = 64
is equal. Also the order of the arithmetic and align/blend instructions are mostly equivalent, but the Intel compiler 15 produces a lot of nop-instructions in the form of (mov al, al). Why? In total icc15 generates 350 lines of assembler with 11 nop-instructions. Icc14 generates only 333 lines of assembler.
The biggest difference seems to be caused by the order of the prefetch instructions. It is totally different. Also it seems to me that something has changed from icc14 to icc15? At least the syntax is different
vprefetchnta ZMMWORD PTR [2048+r8+r9*4] // icc14 vprefetchnta BYTE PTR [2048+r8+r9*4] // icc15
I insert prefetches by hand using Intrinsics. If I remove all my prefetch Intrinsics there is no performance difference between icc14 and icc15. Is there some information what has change from icc14 to icc15 espacially for MIC intrinsics. The information I found on the Intel website is quiet sparse.
Thanks,
Patrick