Understanding bad performances for an offloaded hybrid MPI-OpenMP application

Hello,

I have ported my application on nodes with two Intel's Xeon Phi cards. I notice that performances are very disappointing.
As it is a MPI application, I have to give some more informations about how it works (sorry for the long text).

MPI parallelization is done with a classical 3D domain decomposition using a cartesian grid of subdomains (one process per subdomain). They have ghost cells (26 neighbours) which need to be refreshed several times per time iteration (explicit multi step scheme in time).

Next, hybridation is done with a large OpenMP parallel region and quite everything is done in parallel through ''openmp collapsed'' nested loops. On a cluster of multicore nodes, everything's running fine.

With the technique of offloading, one has to change things. As MPI communications are done on the host server, one has to transfer data between the host and the MIC (the goal is to use several servers with MIC cards). To minimize this amount of data, I create buffer arrays that I fill with ghost cells values for where there are neighbours. And only these buffer arrays commute between each MIC and the host. All large data arrays, are copied on the MIC at the beginning and stay there till the end. There is no more one large OpenMP parallel region but several OpenMP offloaded regions, only with the MPI communications between them. Everything is computed in parallel in these regions.

I check ifort's report to verify that every loops are vectorized. I put options ans directives so that data are correctly aligned.

I pick up a mesh (800x300x100 cells) that fills memory of two MIC cards (13 GB) in order to compare performances between the three versions of the code. On a 20 cores node (Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz) with two MIC (5110p) cards, I get :
on node, MPI (20 processes) : 605 sec, CPI rate : 0.83
on node, hybrid MPI/OpenMP (4 processes, each with 5 threads) : 747 sec, CPI rate : 0.77
Offload MPI/OpenMP (4 processes, each with 118 offloaded threads, 2x118 threads on each MIC) : 2615 sec, CPI rate : 4.0

With the offloaded version, I try several combinaisons (number of processes, number of threads per processes) and this one seems to give the best results I can get.

I use Vtune to profile the behaviour of the application, and I notice quite large time consumed by the system or external libraries (please see attached snapshots). Moreover, CPI rate of time-consuming routines are worse on the MIC than on the host.
Could you please give me some advices of what I should check in my application ?

Thanks in advance.

Guy.

Allegato	Dimensione
Download micsmc_03_4.png	99.11 KB
Download micsmc_02_4.png	90.62 KB
Download vtune_mpi_advanced-hotspots_02_4.png	288.6 KB
Download vtune_hyb_advanced-hotspots_04_02_4.png	299.22 KB
Download vtune_off_advanced-hotspots_06_02_4.png	118.13 KB
Download vtune_off_advanced-hotspots_06_01_4.png	178.28 KB
Download vtune_off_advanced-hotspots_06_03_4.png	428.55 KB

Understanding bad performances for an offloaded hybrid MPI-OpenMP application

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List