Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

Understanding bad performances for an offloaded hybrid MPI-OpenMP application

$
0
0

Hello,

I have ported my application on nodes with two Intel's Xeon Phi cards. I notice that performances are very disappointing.
As it is a MPI application, I have to give some more informations about how it works (sorry for the long text).

 

MPI parallelization is done with a classical 3D domain decomposition using a cartesian grid of subdomains (one process per subdomain). They have ghost cells (26 neighbours) which need to be refreshed several times per time iteration (explicit multi step scheme in time).

Next, hybridation is done with a large OpenMP parallel region and quite everything is done in parallel through ''openmp collapsed'' nested loops. On a cluster of multicore nodes, everything's running fine.

With the technique of offloading, one has to change things. As MPI communications are done on the host server, one has to transfer data between the host and the MIC (the goal is to use several servers with MIC cards). To minimize this amount of data, I create buffer arrays that I fill with ghost cells values for where there are neighbours. And only these buffer arrays commute between each MIC and the host. All large data arrays, are copied on the MIC at the beginning and stay there till the end. There is no more one large OpenMP parallel region but several OpenMP offloaded regions, only with the MPI communications between them. Everything is computed in parallel in these regions.

 

I check ifort's report to verify that every loops are vectorized. I put options ans directives so that data are correctly aligned.

I pick up a mesh (800x300x100 cells) that fills memory of two MIC cards (13 GB) in order to compare performances between the three versions of the code. On a 20 cores node (Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz) with two MIC (5110p) cards, I get :
on node, MPI (20 processes) : 605 sec, CPI rate : 0.83
on node, hybrid MPI/OpenMP (4 processes, each with 5 threads) : 747 sec, CPI rate : 0.77
Offload MPI/OpenMP (4 processes, each with 118 offloaded threads, 2x118 threads on each MIC) : 2615 sec, CPI rate : 4.0

With the offloaded version, I try several combinaisons (number of processes, number of threads per processes) and this one seems to give the best results I can get.

I use Vtune to profile the behaviour of the application, and I notice quite large time consumed by the system or external libraries (please see attached snapshots). Moreover, CPI rate of time-consuming routines are worse on the MIC than on the host.
Could you please give me some advices of what I should check in my application ?

Thanks in advance.

   Guy.

 


Viewing all articles
Browse latest Browse all 1347

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>