Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

Optimizing Intel Performance

$
0
0

Hello,
I am writing beacuse I for some time I have been working with Intel Xeon E3 processor as well as with Intel Xeon Phi MIC card. I focused on the following books:
Structured Parallel Programming: Patterns for Efficient Computation
INTEL XEON PHI COPROCESSOR HIGH PERFORMANCE PROGRAMMING BOOK

However, when comparing efficiency of both Xeon Phi and Xeon (when using OpenMP, TBB and Intel Cilk Plus) I noticed either slowdown or slight imprvement of Phi over Intel Xeon. Having used examples from Structured Parallel Programming, I noticed that for instance SAXPY operation was faster on Intel Xeon when using float 3 vectors of size 90 MB. Iam quite aware that the largest speedup is achieved when data resides in cache qand is reused as long ass possible(just like in stencil operation example given in latter mentioned book where the memory access pattern rendered 60 times speedup- the hellofolps example, on a contrary, had comparable execution time in Xeon and Xeon Phi). However, I am looking for other clues or suggestions on which to focus in order to achieve highest boost of performance. It turns out that embarassingly parallel algorithms such as SAXPY render either no speedup or slowdown even with vectorization enabled. Is it a reason of architecture or is there a method to achieve high speedup ? How to make TBB and Cilk Plus parallel fors fit into cache lines of xeon Phi to avoid using same line by many cores thereby limiting performance. I have seen tutorials in developers zone of Xeon Phi, but I do not recognize such clues were given.

I also attempted to use Intel Advisor, but despite calculating approximate execution time, it does not give clues how to improve algorithm on Xeon Phi (focusing mainly on vectorization)
Which is also interesting, when scaling problem (i.e. setting number of OMP_NUM_THREADS) the highest perfoemance is not achieved tof 56 threads or its multiplication, but somewhere between 16 and 32 threads, when not all cores are even used to compute (i used only native mode and calculated only computation - not data transfer).

If you have any clues, I am welcome to hear them, as in terms of speedup on Xeon Phi I only achieved insignificant speedups or slowdowns.

Best Regards


Viewing all articles
Browse latest Browse all 1347

Latest Images

Trending Articles



Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>