Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

Ring bus problem?

$
0
0

Hi,

I'm trying to move a Monte-Carlo Fortran program using OpenMP from Xeons to the PHI.  The program scales very well on Xeons but very badly on the PHI.  I've attached some plots to quantify our results.  The program is not huge but the data that the threads access -- only as reads -- are several GBytes.  The data are mostly contained in two arrays which are not accessed sequentially while the code is running.  The plots are all normalized to the same time for one thread as the focus is on the shapes of the curves, not the absolute values.

The plot "Intel Comparison plot a" summarizes the results for two Xeon computers (dual E5 w/ 32 threads & an I7 w/ 12 threads) and the PHI.  The PHI is much slower than the Xeons.  It starts off 20x slower for the first thread, which is not so bad but gets much worse quite rapidly.  The two types of Xeons show little sign of fading, even with all the threads available on the I7 & E5 and the fact that the code is very (80%) back-end bound.  The code was designed to be quite scalable (although not very vectorizable).  This is shown in more detail in the E5I timing plot.  The grey line shows perfect scaling and the Xeons show no sign of flattening out with the number of threads.  The bumps at 13, 18 threads, and the few in  the mid 20s are real, not statistical fluctuations. Don't understand those.  Interestingly, the I7 is faster than the E5.

However, as shown in more detail in the "PHI timing 1" plot, at about 8 threads, the PHI results get quite flat and stay so until somewhere over 100 threads when they pick up steam again.  In the end, the PHI ends up being about 100x slower than would be a Xeon with the same number of threads and using all the PHI threads is 4 - 5x slower than the 32-thread E5 setup.  These results are pretty much the same no matter how we set the affinity (doesn't matter anyway, I think, as we want to run with all threads).

Our only thought so far is that there is something about the ring bus that is hosing up the access to the large arrays.  We have not been able to think of a way to get around this or to decide exactly what is causing the problem way before we even get to one thread per core.  We expected to get good scaling at least until we got to 56 threads.  Any thoughts on this would be greatly appreciated.

thanks,

 


Viewing all articles
Browse latest Browse all 1347

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>