Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

Why it is so difficult to write AVX code on MIC!

$
0
0

Hello,

I am writing an AVX code to calculate the complex multiplication. The code is listed below,

  1 typedef std::complex<float> Value;

  2 void Benchmark::gridKernel(const int support,

  3                            const Value C[],

  4                            Value grid[], const int gSize)

  5 {

  6     int Nvec=8;

  7     int nBlock,nrest,sSize_b;

  8

  9     nrest=sSize%Nvec;

 10     nBlock=(sSize-nrest)/Nvec;

 11     sSize_b=sSize-nrest;

 12 …

 13     for (int dind = bs; dind <= be; ++dind) {

 14 …

 15                 gind=…

 16                 cind=…

 17             Value gridc[sSize_b],Cc[sSize_b];

 18             for (int suppu = 0; suppu < sSize_b; suppu++) {

 19                gridc[suppu] = grid[gind+suppu];

 20                Cc[suppu]    = C[cind+suppu];

 21             }

 22             const Value d = samples[dind].data;

 23             for (int suppu = 0; suppu < nBlock; suppu++) {

 24               int sl=suppu*Nvec;

 25               __m512 sam = _mm512_load_ps(( Real *) &Cc[sl]);

 26               __m512 *gridptr = (__m512 *) &gridc[sl];

 27               __m512 data_r = _mm512_set1_ps(d.real());

 28               __m512 data_i = _mm512_set1_ps(d.imag());

 29               __m512 t7 = _mm512_mul_ps(data_r, sam);

 30               __m512 t6 = _mm512_mul_ps(data_i, sam);

 31               __m512 t8 = _mm512_swizzle_ps(t6,_MM_SWIZ_REG_CDAB);

 32               __m512 t7c= t7;

 33               __m512 t9 = _mm512_mask_sub_ps(t7c, 0x5555, t7, t8);

 34               __m512 t9c= t9;

35               __m512 t10= _mm512_mask_add_ps(t9c, 0xAAAA, t9, t8);

 36               gridptr[0] = _mm512_add_ps(gridptr[0], t10);

 37             }//end suppu

 38

 39             for(int suppu=0;suppu<sSize_b;suppu++){

 40                 grid[gind+suppu]=gridc[suppu];

 41             }

 42

 43             for (int suppu = sSize_b; suppu < sSize; suppu++) {

 44                 grid[gind+suppu] += d * C[cind+suppu];

 45             }

 46     }//end dind

 47 }

As you see above, this code calculates the multiplication of “C” and “d”,and the results are added into array “grid”. The memory of array “grid” and “C” are allocated in another function with the following codes,

grid = (Value *) _mm_malloc(gSize*gSize*sizeof(Value),64);

if(grid == NULL) exit (1);

C = (Value *) _mm_malloc(sizeofC*sizeof(Value),64);

if(C == NULL) exit (1);

These two arrays are 64 bytes aligned. This code can be running on MIC correctly.

You may be very curious about why I use two temporary array “gridc” and “Cc” to hold pieces of array “grid” and “C” before the computation. That will add many memory copy and memory set operations and will reduce performance. Because if I delete these codes, including the codes from row 17 to row 21, and codes from row 39 to row 41, and replace codes from row 25 to row 26 with the following codes,

__m512 sam = _mm512_load_ps(( Real *) &C[cind + sl]);

__m512 *gridptr = (__m512 *) &grid[gind + sl];

There will be a “Segmentation fault (signal 11)” error when it is running on MIC card. The icpc version is 14.0.2.144 Build 20140120.

I don’t know where this error comes from, and how to solve it.

Any advice?

Shaohua

 

 

 

 

 

 

 


Viewing all articles
Browse latest Browse all 1347

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>