Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

How to further optimize?

$
0
0

Hello everyone,

I have started overcoming the performance issues that I described in a previous thread (https://software.intel.com/en-us/forums/topic/516335). As suggested there, I enabled reports for vectorization and of course nothing was being vectorized. I started changing the code and yesterday I scored a small victory. I reduced execution time from 430 sec to 308 sec on the Phi, which is also faster than the 330 sec required on the 8 threads of the CPU (1x i7-3770 @3.40GHz). Now I am trying to further optimize the code and I have identified two points where I believe a lot of time can be saved.

The main computation in my application is to calculate small (3x3) rotation matrices and multiply them with another 3x3 matrix. For the current experiment I use, about 15 billion such multiplications are performed. For the CPU, I found that simply unrolling all loops and writing manually all calculations for this multiplication has the best performance. I was wondering, however, whether such a small calculation can be efficiently vectorized on the Phi. Each matrix does not even fill a single vector register (I have float values, so 16 values can fit into a vector register). I have searched a lot, but couldn't find anything relevant to this. I would like to hear your opinion whether it is worth or not. And if so, how can this actually be vectorized? Can pragmas provide better performance for such small calculations or would it be better to use directly intrinsics?

The second point is a calculation of the type:

for (i = 0; i < M; i++) {
    C[i] = sqrtf(A[i] * A[i] + B[i] * B[i]);
}

However, the vectorization report says that the sqrtf() function cannot be vectorized. Reading more on this I found that sqrtf() can be vectorized. Therefore, I started simplifying my application and indeed at some point the above loop gets vectorized. However, I have still not identified what exactly may hinder vectorization in the original code.

If I enforce vectorization in the original code through a #pragma simd I get messages of the type "remark: *MIC* vectorization support: gather was generated for the variable (unknown):  indirect access, 64bit indexed" and "remark: *MIC* vectorization support: scatter was generated for the variable C:  indirect access, 64bit indexed". The performance gets worse in this case.

I would appreciate any help on the above issues.

Ioannis E. Venetis


Viewing all articles
Browse latest Browse all 1347

Latest Images

Trending Articles



Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>