Hi
i`m execute WRF in symetric mode in one coprocessor succesfully but obtain this error on two copprocessors. can help me?:
[21] MPI startup(): shm and dapl data transfer modes
[17] MPI startup(): DAPL provider ofa-v2-scif0
[16] MPI startup(): DAPL provider ofa-v2-scif0
[17] MPI startup(): shm and dapl data transfer modes
[16] MPI startup(): shm and dapl data transfer modes
Meteo-Xeon-Phi-mic1:SCM:2dbb:f305e500: 216177 us(216177 us): modify_qp_state: ERR type 2 qpn 0xe gid 0x2b3cf40229ec (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0
Meteo-Xeon-Phi-mic1:SCM:2dbb:f305e500: 216348 us(171 us): DAPL ERR modify_qp_state Invalid argument
Meteo-Xeon-Phi-mic1:SCM:2dbb:f305e500: 216391 us(43 us): ACCEPT_USR: QPS_RTR ERR Invalid argument -> 10.10.10.1
Meteo-Xeon-Phi-mic1:SCM:2db8:1f47b500: 186585 us(186585 us): modify_qp_state: ERR type 2 qpn 0x14 gid 0x2b32240229ec (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0
Meteo-Xeon-Phi-mic1:SCM:2db8:1f47b500: 186763 us(178 us): DAPL ERR modify_qp_state Invalid argument
Meteo-Xeon-Phi-mic1:SCM:2db8:1f47b500: 186845 us(82 us): ACCEPT_USR: QPS_RTR ERR Invalid argument -> 10.10.10.1
[15:10.10.10.2][../../dapl_conn_rc.c:620] error(0x40000): ofa-v2-scif0: could not accept DAPL connection request: DAT_INTERNAL_ERROR()
Assertion failed in file ../../dapl_conn_rc.c at line 620: 0
internal ABORT - process 0
[16:10.10.10.2][../../dapl_conn_rc.c:620] error(0x40000): ofa-v2-scif0: could not accept DAPL connection request: DAT_INTERNAL_ERROR()
Assertion failed in file ../../dapl_conn_rc.c at line 620: 0
internal ABORT - process 0
Meteo-Xeon-Phi-mic1:SCM:2dbe:36708500: 196925 us(196925 us): modify_qp_state: ERR type 2 qpn 0x1a gid 0x2aae3c0229ec (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0
Meteo-Xeon-Phi-mic1:SCM:2dbe:36708500: 197101 us(176 us): DAPL ERR modify_qp_state Invalid argument
Meteo-Xeon-Phi-mic1:SCM:2dbe:36708500: 197182 us(81 us): ACCEPT_USR: QPS_RTR ERR Invalid argument -> 10.10.10.1
Meteo-Xeon-Phi-mic1:SCM:2dbc:15493500: 225066 us(225066 us): modify_qp_state: ERR type 2 qpn 0x21 gid 0x2acb180229ec (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0
Meteo-Xeon-Phi-mic1:SCM:2dbc:15493500: 225237 us(171 us): DAPL ERR modify_qp_state Invalid argument
Meteo-Xeon-Phi-mic1:SCM:2dbc:15493500: 225315 us(78 us): ACCEPT_USR: QPS_RTR ERR Invalid argument -> 10.10.10.1
[17:10.10.10.2][../../dapl_conn_rc.c:620] error(0x40000): ofa-v2-scif0: could not accept DAPL connection request: DAT_INTERNAL_ERROR()
Assertion failed in file ../../dapl_conn_rc.c at line 620: 0
internal ABORT - process 0
[18:10.10.10.2][../../dapl_conn_rc.c:620] error(0x40000): ofa-v2-scif0: could not accept DAPL connection request: DAT_INTERNAL_ERROR()
Assertion failed in file ../../dapl_conn_rc.c at line 620: 0
internal ABORT - process 0
Meteo-Xeon-Phi-mic1:SCM:2db9:60277500: 199595 us(199595 us): modify_qp_state: ERR type 2 qpn 0x27 gid 0x2aff640229ec (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0
Meteo-Xeon-Phi-mic1:SCM:2db9:60277500: 199760 us(165 us): DAPL ERR modify_qp_state Invalid argument
Meteo-Xeon-Phi-mic1:SCM:2db9:60277500: 199860 us(100 us): ACCEPT_USR: QPS_RTR ERR Invalid argument -> 10.10.10.1
[19:10.10.10.2][../../dapl_conn_rc.c:620] error(0x40000): ofa-v2-scif0: could not accept DAPL connection request: DAT_INTERNAL_ERROR()
Assertion failed in file ../../dapl_conn_rc.c at line 620: 0
internal ABORT - process 0
Meteo-Xeon-Phi-mic1:SCM:2dba:73fc9500: 231631 us(231631 us): modify_qp_state: ERR type 2 qpn 0x2e gid 0x2b84780229ec (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0
Meteo-Xeon-Phi-mic1:SCM:2dba:73fc9500: 231800 us(169 us): DAPL ERR modify_qp_state Invalid argument
Meteo-Xeon-Phi-mic1:SCM:2dba:73fc9500: 231904 us(104 us): ACCEPT_USR: QPS_RTR ERR Invalid argument -> 10.10.10.1
[20:10.10.10.2][../../dapl_conn_rc.c:620] error(0x40000): ofa-v2-scif0: could not accept DAPL connection request: DAT_INTERNAL_ERROR()
Assertion failed in file ../../dapl_conn_rc.c at line 620: 0
internal ABORT - process 0
Meteo-Xeon-Phi-mic1:SCM:2dbd:56c6500: 234974 us(234974 us): modify_qp_state: ERR type 2 qpn 0x36 gid 0x2b0d000229ec (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0
Meteo-Xeon-Phi-mic1:SCM:2dbd:56c6500: 235152 us(178 us): DAPL ERR modify_qp_state Invalid argument
Meteo-Xeon-Phi-mic1:SCM:2dbd:56c6500: 235195 us(43 us): ACCEPT_USR: QPS_RTR ERR Invalid argument -> 10.10.10.1
[21:10.10.10.2][../../dapl_conn_rc.c:620] error(0x40000): ofa-v2-scif0: could not accept DAPL connection request: DAT_INTERNAL_ERROR()
Assertion failed in file ../../dapl_conn_rc.c at line 620: 0
internal ABORT - process 0
[12:10.10.10.1] unexpected DAPL event 0x4006
Assertion failed in file ../../dapl_init_rc.c at line 1402: 0
internal ABORT - process 0
[13:10.10.10.1] unexpected DAPL event 0x4006
Assertion failed in file ../../dapl_init_rc.c at line 1402: 0
internal ABORT - process 0
[8:10.10.10.1] unexpected DAPL event 0x4006
Assertion failed in file ../../dapl_init_rc.c at line 1402: 0
internal ABORT - process 0
[10:10.10.10.1] unexpected DAPL event 0x4006
Assertion failed in file ../../dapl_init_rc.c at line 1402: 0
internal ABORT - process 0
[3:10.10.10.254] unexpected disconnect completion event from [10:10.10.10.1]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 3
[7:10.10.10.254] unexpected disconnect completion event from [10:10.10.10.1]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 7
[1:10.10.10.254] unexpected disconnect completion event from [10:10.10.10.1]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 1
[5:10.10.10.254] unexpected disconnect completion event from [10:10.10.10.1]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 5
My configuration is:
wrf_start.sh
#!/bin/bash
ulimit -s unlimited
ulimit -l unlimited
export I_MPI_PIN_MODE=mpd
export I_MPI_PIN_DOMAIN=auto
export I_MPI_MIC=1
export I_MPI_DEVICE=rdssm
export I_MPI_DEBUG=5
rm rsl.*
rm wrfout*
mpiexec.hydra -host 10.10.10.254 -n 8 ./wrf_sandy.sh : -host 10.10.10.1 -n 8 ./wrf_phi.sh : -host 10.10.10.2 -n 8 ./wrf_phi.sh
phi.envars
#!/bin/sh
source /opt/intel/impi/4.1.3.048/mic/bin/mpivars.sh
export LD_LIBRARY_PATH=/opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/mic
export OMP_NUM_THREADS=30
export KMP_LIBRARY=turnaround
export KMP_BLOCKTIME=infinite
export KMP_STACKSIZE=32M
export OMP_SCHEDULE=STATIC
export KMP_AFFINITY=balanced
sandy.envars
#!/bin/sh
export OMP_NUM_THREADS=2
export KMP_LIBRARY=turnaround
export KMP_BLOCKTIME=infinite
export KMP_STACKSIZE=32M
export OMP_SCHEDULE=DYNAMIC
wrf_phi.sh
#!/bin/sh
ulimit -s unlimited
ulimit -l unlimited
source ./phi.envvars
./wrf.mic
wrf_sandy.sh
#!/bin/sh
ulimit -s unlimited
ulimit -l unlimited
source ./sandy.envvars
./wrf.exe
My system is one host with 2 coprocesor internal bridge, OS is SLES SP3 kernel 3.0.76-0.11 with OFED 1.5.4.1 and mpss 3.4
Thx in advance.