Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

Intrinsic bad performance

$
0
0

Hi. I write aplication for Intel MIC witch doing stencil computation (5-point stencil) using 2D matrix. I would like to achieve good performance. I wrote code where 4 HW threads running on the same core do calculation around the same L2 cache. In this way i want to reduce cache miss. After running aplication parallel version of algorithm without SIMD was faster than serial under 230 times (in this way i measure performance). When I added Intrinsic to code I expected that parallem algorithm with SIMD will be faster (significantly), but version with SIMD was slower then version without SIMD (nera 187 times faster then serial version).

I caculate stencil using intrinstic in this way:

for(int j=8; j<n_real-8; j+=8)
{
   __m512d v_c = _mm512_load_pd(&mIn[i * n_real + j]);
   __m512d v_u = _mm512_load_pd(&mIn[(i - 1) * n_real + j]);
   __m512d v_d = _mm512_load_pd(&mIn[(i + 1) * n_real + j]);
   __m512d v_l = _mm512_loadu_pd(&mIn[i * n_real + (j - 1)]);
   __m512d v_r = _mm512_loadu_pd(&mIn[i * n_real + (j + 1)]);

   __m512d v_max = _mm512_max_pd(v_c, v_u);
   v_max = _mm512_max_pd(v_max, v_d);
   v_max = _mm512_max_pd(v_max, v_l);
   v_max = _mm512_max_pd(v_max, v_r);

  _mm512_storeu_pd(&mOut[i * n_real + j], v_max);
}

Matrix is create in this way

double* matrix = (double*)_mm_malloc(m_real*n_real*sizeof(double), 64);

where m_real is row count and n_real is row size and it is modulo 8.

In my code i start the calculation from j=8, becouse the first eight elements are "halo elemnts" just like the last eight elements (one DP vector).

Could You explain me where is the problem? And how i can resolve it? Regards.

 


Viewing all articles
Browse latest Browse all 1347

Trending Articles