Hi,
I am trying to figure out the effect of using array alignment in vectorization of MIC code.
Here is a simple piece of offload code from Intel
#pragma offload target(mic:cardId) #pragma omp parallel for private(j,k) for (i=0; i<numthreads; i++) { int offset = i*LOOP_COUNT; for (j=0; j<MAXFLOPS_ITERS; j++) { #pragma vector aligned for (k=0; k<LOOP_COUNT; k++) { fa[k+offset]=a*fa[k+offset]+fb[k+offset]; } } }
This program gets ~1900 GFlops, which is very promising. However, if I changed line 8 to be
#pragma vector unaligned
or
#pragma simd
The performance significantly drops to ~60 GFlops.
From documentations I learnt that"aligned" indicates "compilers to use aligned data movement instructions for all array references when vectorizing". Could you elaborate this explanation a little bit?
Also, i noticed that the change I mentioned above doesn't make too much difference in the program running on host machine. Why is that?
Thanks!