Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

hybrid MPI/OpenMP with MICs - I cannot execute across MICs inside different nodes

$
0
0

Dear All,

I am working on a cluster with several MICs attached to it. The co-processors are distributed in four HP Proliant SL250s Gen8 computing nodes, with 2x Intel Xeon E-2660 and 3x Intel Xeon Phi 5110P MICs each, for a total of 12 co-processors on the entire cluster. The workload of the cluster is controlled using the SLURM Workload Manager and the manager was also compiled within the MICs, and therefore they can be treated as independent computing nodes.

Well, I have been able to execute an hybrid MPI/OpenMP code with success both in a single MIC and in a group of MICs inside the same node. For example, the following script is used to execute my code across three MICs inside the same computing node (cnf001):

#!/bin/bash
#SBATCH -J omp_tutor7_mpi-MIC
#SBATCH -p mics
#SBATCH -N 3
#SBATCH -w cnf001-mic[0-2]
#SBATCH -o omp_tutor7_mpi-MIC-%j.out
#SBATCH -e omp_tutor7_mpi-MIC-%j.err

export PATH=/home/apps/intel/2016/impi/5.1.2.150/mic/bin/:$PATH
export LD_LIBRARY_PATH=/home/apps/intel/2016/impi/5.1.2.150/mic/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/home/apps/intel/2016/lib/mic/:$LD_LIBRARY_PATH
export I_MPI_FABRICS=shm:tcp

export KMP_PLACE_THREADS=60c,4t
export KMP_AFFINITY=scatter

mpiexec.hydra -n 3 ./omp_tutor7_mpi -i omp_tutor7_mpi -p 521icru -o omp_tutor7_mpi -b

 

In this case, using MPI the program is distributed across the three MICs, and then multi-threading within each MIC is enabled using OpenMP. The problem arises when I try to use MICs that are located inside different nodes (cnf001 and cnf002), for example, using the following script:

#!/bin/bash
#SBATCH -J omp_tutor7_mpi-MIC
#SBATCH -p mics
#SBATCH -N 2
#SBATCH -w cnf001-mic0,cnf002-mic0
#SBATCH -o omp_tutor7_mpi-MIC-%j.out
#SBATCH -e omp_tutor7_mpi-MIC-%j.err

export PATH=/home/apps/intel/2016/impi/5.1.2.150/mic/bin/:$PATH
export LD_LIBRARY_PATH=/home/apps/intel/2016/impi/5.1.2.150/mic/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/home/apps/intel/2016/lib/mic/:$LD_LIBRARY_PATH
export I_MPI_FABRICS=shm:tcp

export KMP_PLACE_THREADS=60c,4t
export KMP_AFFINITY=scatter

mpiexec.hydra -n 2 ./omp_tutor7_mpi -i omp_tutor7_mpi -p 521icru -o omp_tutor7_mpi -b

 

In this case I obtain no output from the MICs. The workload manager shows that the co-processors are running, however the execution does not end and I obtain not output neither from my code nor communication errors between the MICs, therefore I suppose that the co-processors are "hanged"and they are not executing my program. I have tried with different values for the I_MPI_DEBUG variable but again I do not obtain any output from the execution. The only "success" that I have obtained so far was using the following command to execute using MPI:

mpiexec.hydra -n 2 -hosts cnf001-mic0,cnf002-mic0 ./omp_tutor7_mpi -i omp_tutor7_mpi -p 521icru -o omp_tutor7_mpi -b

 

However, in that case the code really is executed in the first listed MIC (cnf001-mic0) and the second is simply ignored. About the communication, I am able to ssh between host and all MICs, and between MICs of both the same computing node and across nodes, therefore it does not seems to be an obvious communication problem. I would like to kindly ask any hint of where should I look to solve this problem. I am quite new to the computing world using MICs and I am very lost with this issue. Thanks for your help!


Viewing all articles
Browse latest Browse all 1347

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>