Ring bus problem?

Hi,

I'm trying to move a Monte-Carlo Fortran program using OpenMP from Xeons to the PHI. The program scales very well on Xeons but very badly on the PHI. I've attached some plots to quantify our results. The program is not huge but the data that the threads access -- only as reads -- are several GBytes. The data are mostly contained in two arrays which are not accessed sequentially while the code is running. The plots are all normalized to the same time for one thread as the focus is on the shapes of the curves, not the absolute values.

The plot "Intel Comparison plot a" summarizes the results for two Xeon computers (dual E5 w/ 32 threads & an I7 w/ 12 threads) and the PHI. The PHI is much slower than the Xeons. It starts off 20x slower for the first thread, which is not so bad but gets much worse quite rapidly. The two types of Xeons show little sign of fading, even with all the threads available on the I7 & E5 and the fact that the code is very (80%) back-end bound. The code was designed to be quite scalable (although not very vectorizable). This is shown in more detail in the E5I timing plot. The grey line shows perfect scaling and the Xeons show no sign of flattening out with the number of threads. The bumps at 13, 18 threads, and the few in the mid 20s are real, not statistical fluctuations. Don't understand those. Interestingly, the I7 is faster than the E5.

However, as shown in more detail in the "PHI timing 1" plot, at about 8 threads, the PHI results get quite flat and stay so until somewhere over 100 threads when they pick up steam again. In the end, the PHI ends up being about 100x slower than would be a Xeon with the same number of threads and using all the PHI threads is 4 - 5x slower than the 32-thread E5 setup. These results are pretty much the same no matter how we set the affinity (doesn't matter anyway, I think, as we want to run with all threads).

Our only thought so far is that there is something about the ring bus that is hosing up the access to the large arrays. We have not been able to think of a way to get around this or to decide exactly what is causing the problem way before we even get to one thread per core. We expected to get good scaling at least until we got to 56 threads. Any thoughts on this would be greatly appreciated.

thanks,

Allegato	Dimensione
Scarica Intel comparison plot a.jpg	338.81 KB
Scarica PHI timing 1.jpg	582.2 KB
Scarica E5I timing 1.jpg	522.13 KB

Ring bus problem?

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List