Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

Problems when trying to run symmetric MPI jobs with MPSS 3.2, MLNX HCA and ofed-3.5.1-mic-beta1

$
0
0

Hi,

We have been struggling to get symmetric MPI jobs running on our cluster. MPI works fine on host to host and also mic native MPI works between compute nodes. Intra node host <-> mic communication also works but internode just hangs. It won't get "PMI response: cmd=barrier_out". Is it supposed to work at all with this HW/SW combination?

Centos 6.5, MPSS 3.2, Slurm 2.6.7 and OFED 3.5.1.MIC.beta1. Mellanox ConnectX3 HCA and mpxyd is running.

I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u

I_MPI_FABRICS=shm:dapl

    Network:       Static bridge br0
        MIC IP:    10.10.5.X
        Host IP:   10.10.4.X
        Net Bits:  16
        NetMask:   255.255.0.0
        MtuSize:   1500

net.ipv4.ip_forward = 1

Here are last lines from the debug output

[mpiexec@m41] Launch arguments: /usr/bin/ssh -x -q m42-mic0 sh -c 'export I_MPI_ROOT="/appl/opt/cluster_studio_xe2013/impi/4.1.3.045" ; export PATH="/appl/opt/cluster_studio_xe2013/impi/4.1.3.045/intel64/bin//../../mic/bin:${I_MPI_ROOT}:${I_MPI_ROOT}/mic/bin:${PATH}" ; exec "$0""$@"' pmi_proxy --control-port 10.10.4.41:33072 --debug --pmi-connect lazy-cache --pmi-aggregate -s 0 --enable-mic --i_mpi_base_path /appl/opt/cluster_studio_xe2013/impi/4.1.3.045/intel64/bin/ --i_mpi_base_arch 0 --rmk slurm --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1532590129 --proxy-id 3

[mpiexec@m41] STDIN will be redirected to 1 fd(s): 9
[proxy:0:0@m41] Start PMI_proxy 0
[proxy:0:0@m41] STDIN will be redirected to 1 fd(s): 15
[proxy:0:0@m41] got pmi command (from 10): init
pmi_version=1 pmi_subversion=1
[proxy:0:0@m41] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@m41] got pmi command (from 10): get_maxes

[proxy:0:0@m41] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:0@m41] got pmi command (from 10): barrier_in

[proxy:0:0@m41] forwarding command (cmd=barrier_in) upstream
[mpiexec@m41] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:2@m41-mic0] Start PMI_proxy 2
[proxy:0:1@m42] Start PMI_proxy 1
[proxy:0:2@m41-mic0] got pmi command (from 6): init
pmi_version=1 pmi_subversion=1
[proxy:0:2@m41-mic0] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:2@m41-mic0] got pmi command (from 6): get_maxes

[proxy:0:2@m41-mic0] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[mpiexec@m41] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:2@m41-mic0] got pmi command (from 6): barrier_in

[proxy:0:2@m41-mic0] forwarding command (cmd=barrier_in) upstream
[proxy:0:3@m42-mic0] Start PMI_proxy 3
[proxy:0:3@m42-mic0] got pmi command (from 6): init
pmi_version=1 pmi_subversion=1
[proxy:0:3@m42-mic0] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:3@m42-mic0] got pmi command (from 6): get_maxes

[proxy:0:3@m42-mic0] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[mpiexec@m41] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:3@m42-mic0] got pmi command (from 6): barrier_in

[proxy:0:3@m42-mic0] forwarding command (cmd=barrier_in) upstream

Hangs here.

 


Viewing all articles
Browse latest Browse all 1347

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>