Intrinsic bad performance

Hi. I write aplication for Intel MIC witch doing stencil computation (5-point stencil) using 2D matrix. I would like to achieve good performance. I wrote code where 4 HW threads running on the same core do calculation around the same L2 cache. In this way i want to reduce cache miss. After running aplication parallel version of algorithm without SIMD was faster than serial under 230 times (in this way i measure performance). When I added Intrinsic to code I expected that parallem algorithm with SIMD will be faster (significantly), but version with SIMD was slower then version without SIMD (nera 187 times faster then serial version).

I caculate stencil using intrinstic in this way:

for(int j=8; j<n_real-8; j+=8)
{
   __m512d v_c = _mm512_load_pd(&mIn[i * n_real + j]);
   __m512d v_u = _mm512_load_pd(&mIn[(i - 1) * n_real + j]);
   __m512d v_d = _mm512_load_pd(&mIn[(i + 1) * n_real + j]);
   __m512d v_l = _mm512_loadu_pd(&mIn[i * n_real + (j - 1)]);
   __m512d v_r = _mm512_loadu_pd(&mIn[i * n_real + (j + 1)]);

   __m512d v_max = _mm512_max_pd(v_c, v_u);
   v_max = _mm512_max_pd(v_max, v_d);
   v_max = _mm512_max_pd(v_max, v_l);
   v_max = _mm512_max_pd(v_max, v_r);

  _mm512_storeu_pd(&mOut[i * n_real + j], v_max);
}

Matrix is create in this way

double* matrix = (double*)_mm_malloc(m_real*n_real*sizeof(double), 64);

where m_real is row count and n_real is row size and it is modulo 8.

In my code i start the calculation from j=8, becouse the first eight elements are "halo elemnts" just like the last eight elements (one DP vector).

Could You explain me where is the problem? And how i can resolve it? Regards.

Intrinsic bad performance

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...