Hi all,
the Intel Compiler 14.0.3 does not insert software prefetches for the following linear test program
#include <iostream> #include <immintrin.h> int main() { const int elements = 1e7; const int mem_size = 16 * elements * sizeof(float); // 640 MB float *vec_a = (float*)_mm_malloc( mem_size, 64 ); float *vec_b = (float*)_mm_malloc( mem_size, 64 ); // initialization for ( int i = 0; i < 16*elements ; ++i ) { vec_a[i] = 0.8f; vec_b[i] = 0.6f; } #pragma omp parallel { const __m512 mass_ = _mm512_set1_ps( 0.123f ); __m512 vec_a_, vec_b_; #pragma omp for schedule(static) for ( int i = 0; i < 16*elements ; i += 16 ) { vec_a_ = _mm512_load_ps( vec_a + i ); vec_b_ = _mm512_load_ps( vec_b + i ); vec_a_ = _mm512_fmadd_ps( mass_, vec_a_, vec_b_ ); _mm512_storenrngo_ps( vec_b + i, vec_a_ ); } } // prevent deadcode optimizations float delta = 0.0f; for ( int i = 0; i < 16*elements ; ++i ) { delta += vec_b[i]; } std::cout << delta << std::endl; _mm_free( vec_a ); _mm_free( vec_b ); }
The Compiler generates the following assembler (icpc -O3 -mmic -openmp -S -masm=intel linear.cpp)
..B1.36: mov r8, QWORD PTR [r13] add rcx, 16 vmovaps zmm0, ZMMWORD PTR [r8+rax] mov dl, dl mov r9, QWORD PTR [r14] vfmadd213ps zmm0, zmm1, ZMMWORD PTR [r9+rax] vmovnrngoaps ZMMWORD PTR [r9+rax], zmm0 add rax, 64 cmp rcx, rdx jle ..B1.36
so... no software prefetches. Of course, I could insert prefetch intrinsics, but I guess that the compiler should be much better in doing that for a linear memory access? I did try to use #pragma prefetch and -opt-prefetch=4 with no success. It seems to be a compiler problem, because the Intel compiler 15.0b does insert prefetch instructions.
However, the current 15.0b compiler generates a 30% slower code for my bigger program.
So my question is: How can I force the 14.0 compiler to insert software prefetches for linear intrinsics code?
Thanks,
Patrick