Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

missing compiler prefetches in intrinsics code with linear memory access

$
0
0

Hi all,

the Intel Compiler 14.0.3 does not insert software prefetches for the following linear test program

#include <iostream>
#include <immintrin.h>

int main() {

    const int elements = 1e7;

    const int mem_size = 16 * elements * sizeof(float); // 640 MB

    float *vec_a = (float*)_mm_malloc( mem_size, 64 );
    float *vec_b = (float*)_mm_malloc( mem_size, 64 );

    // initialization
    for ( int i = 0; i < 16*elements ; ++i ) {

        vec_a[i] = 0.8f;
        vec_b[i] = 0.6f;
    }

    #pragma omp parallel
    {
        const __m512 mass_ = _mm512_set1_ps( 0.123f );

        __m512 vec_a_, vec_b_;

        #pragma omp for schedule(static)
        for ( int i = 0; i < 16*elements ; i += 16 ) {

            vec_a_ = _mm512_load_ps( vec_a + i );
            vec_b_ = _mm512_load_ps( vec_b + i );

            vec_a_ = _mm512_fmadd_ps( mass_, vec_a_, vec_b_ );

            _mm512_storenrngo_ps( vec_b + i, vec_a_ );
        }
    }

    // prevent deadcode optimizations
    float delta = 0.0f;

    for ( int i = 0; i < 16*elements ; ++i ) {

        delta += vec_b[i];
    }

    std::cout << delta << std::endl;

    _mm_free( vec_a );
    _mm_free( vec_b );
}

The Compiler generates the following assembler (icpc -O3 -mmic -openmp -S -masm=intel linear.cpp)

..B1.36:
             mov       r8, QWORD PTR [r13]
             add       rcx, 16
             vmovaps   zmm0, ZMMWORD PTR [r8+rax]
             mov       dl, dl
             mov       r9, QWORD PTR [r14]
             vfmadd213ps zmm0, zmm1, ZMMWORD PTR [r9+rax]
             vmovnrngoaps ZMMWORD PTR [r9+rax], zmm0
             add       rax, 64
             cmp       rcx, rdx
             jle       ..B1.36  

so... no software prefetches. Of course, I could insert prefetch intrinsics, but I guess that the compiler should be much better in doing that for a linear memory access? I did try to use #pragma prefetch and -opt-prefetch=4 with no success. It seems to be a compiler problem, because the Intel compiler 15.0b does insert prefetch instructions.

However, the current 15.0b compiler generates a 30% slower code for my bigger program. 

So my question is: How can I force the 14.0 compiler to insert software prefetches for linear intrinsics code?

 

Thanks,

Patrick


Viewing all articles
Browse latest Browse all 1347

Latest Images

Trending Articles



Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>