Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

CentOS 7 + MPSS 3.4.x + OFED 3.1x: Bug in ibp_server?

$
0
0

Hi,

I'm currently in the process of setting up the OS for a diskless cluster with two Xeon Phi Cards per host.

Currently working with CentOS 7.0, MPSS 3.4.3, OFED 3.12-1 and Lustre 2.7.0.

Installation and booting host and two Xeon Phis works fine so far, except that as soon as I try load Lustre (using o2ib) on the second Xeon Phi the complete system crashes due to an error within the ibp_server module (logs can be found a. Using only one Xeon Phi lustre works fine, including mount over Infiniband.

Anybody got any experience with setting up Lustre on a similar system?

I already tried different versions of MPSS  (3.4.x and 3.3.x), OFED (3.12-1, 3.18-rc1), Lustre (2.6.0, 2.7.0).

For Lustre installation on Xeon Phi, the information posted here has been used: https://software.intel.com/de-de/blogs/2014/11/06/lustre-on-intel-xeon-p...

Any help is highly appreciated.

Host + micx log files (MPSS 3.4.3, OFED 3.18-rc1, Lustre 2.7.0):

Mar 30 13:15:15 mac-node-015 systemd: Starting Intel(R) MPSS control service...
Mar 30 13:15:40 mac-node-015 kernel: mic0: Transition from state ready to booting
Mar 30 13:15:40 mac-node-015 kernel: mic image: /usr/share/mpss/boot/bzImage-knightscorner
Mar 30 13:15:40 mac-node-015 kernel: MIC 0 Booting
Mar 30 13:15:40 mac-node-015 kernel: mic1: Transition from state ready to booting
Mar 30 13:15:40 mac-node-015 kernel: mic image: /usr/share/mpss/boot/bzImage-knightscorner
Mar 30 13:15:40 mac-node-015 kernel: MIC 1 Booting
Mar 30 13:15:45 mac-node-015 kernel: Waiting for MIC 0 boot 5
Mar 30 13:15:45 mac-node-015 kernel: Waiting for MIC 1 boot 5
Mar 30 13:15:50 mac-node-015 kernel: Waiting for MIC 0 boot 10
Mar 30 13:15:50 mac-node-015 kernel: Waiting for MIC 1 boot 10
Mar 30 13:15:55 mac-node-015 kernel: Waiting for MIC 0 boot 15
Mar 30 13:15:55 mac-node-015 kernel: Waiting for MIC 1 boot 15
Mar 30 13:16:00 mac-node-015 kernel: Waiting for MIC 0 boot 20
Mar 30 13:16:00 mac-node-015 kernel: Waiting for MIC 1 boot 20
Mar 30 13:16:01 mac-node-015 kernel: MIC 0 Network link is up
Mar 30 13:16:01 mac-node-015 kernel: MIC 1 Network link is up
Mar 30 13:16:03 mac-node-015 kernel: mic0: Transition from state booting to online
Mar 30 13:16:03 mac-node-015 kernel: mic1: Transition from state booting to online
Mar 30 13:16:04 mac-node-015 mpss: Starting Intel(R) MPSS: [  OK  ]
Mar 30 13:16:04 mac-node-015 mpss: mic0: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner)
Mar 30 13:16:04 mac-node-015 mpss: mic1: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner)
Mar 30 13:16:04 mac-node-015 systemd: Started Intel(R) MPSS control service.
Mar 30 13:16:04 mac-node-015 kernel: device mic0 entered promiscuous mode
Mar 30 13:16:04 mac-node-015 kernel: br0: port 2(mic0) entered forwarding state
Mar 30 13:16:04 mac-node-015 kernel: br0: port 2(mic0) entered forwarding state
Mar 30 13:16:04 mac-node-015 kernel: device mic1 entered promiscuous mode
Mar 30 13:16:04 mac-node-015 kernel: br0: port 3(mic1) entered forwarding state
Mar 30 13:16:04 mac-node-015 kernel: br0: port 3(mic1) entered forwarding state
Mar 30 13:16:07 mac-node-015 systemd: Starting LSB: Start ofed layer on top of mpss...
Mar 30 13:16:07 mac-node-015 ofed-mic: Starting OFED Stack:
Mar 30 13:16:07 mac-node-015 kernel: CCL Direct Server v1.0
Mar 30 13:16:07 Copyright (c) 2011-2013 Intel Corporation
Mar 30 13:16:07 mac-node-015 kernel: CCL Direct CM Server v1.0
Mar 30 13:16:07 Copyright (c) 2011-2013 Intel Corporation
Mar 30 13:16:07 mac-node-015 kernel: CCL Direct SA Server v1.0
Mar 30 13:16:07 Copyright (c) 2011-2013 Intel Corporation
Mar 30 13:16:08 mac-node-015 kernel: ibscif: OpenFabrics IBSCIF Driver v0.1 built Mar 30 2015 10:38:18
Mar 30 13:16:08 mac-node-015 kernel: ibscif: max_pinned=50, window_size=40, blocking_send=0, blocking_recv=1, fast_rdma=1, host_proxy=0, rma_threshold=1024, scif_loopback=1, new_ib_type=1, verbose=0, check_grh=1
Mar 30 13:16:08 mac-node-015 kernel: ibscif: ibscif_add_one: my node_id is 0
Mar 30 13:16:08 mac-node-015 rdma-ndd[1345]: Device event: infiniband, scif0, add
Mar 30 13:16:08 mac-node-015 rdma-ndd[1345]: scif0: change (OpenFabrics IBSCIF Driver v0.1) -> (mac-node-015 scif0)
Mar 30 13:16:08 mac-node-015 rdma-ndd[1345]: scif0: change (OpenFabrics IBSCIF Driver v0.1) -> (mac-node-015 scif0)
Mar 30 13:16:08 mac-node-015 rdma-ndd[1345]: mlx4_0: change (mac-node-015 HCA-1) -> (mac-node-015 mlx4_0)
Mar 30 13:16:08 mac-node-015 ofed-mic: host[  OK  ]
Mar 30 13:16:08 mac-node-015 ntpd[654]: Listen normally on 9 mic0 fe80::4e79:baff:fe24:f79 UDP 123
Mar 30 13:16:08 mac-node-015 ntpd[654]: Listen normally on 10 mic1 fe80::4e79:baff:fe24:e59 UDP 123
Mar 30 13:16:09 mac-node-015 ibpd: pid 1682 /dev/ibp1 started 4 threads
Mar 30 13:16:13 mac-node-015 ofed-mic: mic0 : ib0 [  OK  ]
Mar 30 13:16:13 mac-node-015 ibpd: pid 1709 /dev/ibp2 started 4 threads
Mar 30 13:16:17 mac-node-015 ofed-mic: mic1 ib0 [  OK  ]
Mar 30 13:16:17 mac-node-015 systemd: Started LSB: Start ofed layer on top of mpss.
Mar 30 13:16:19 mac-node-015 ntpd[654]: Listen normally on 11 mic0:ib 192.0.2.100 UDP 123
Mar 30 13:19:57 mac-node-015 kernel: ibp_server: ibp_cmd_reg_user_mr(2670) ib_reg_user_mr returned -12
Mar 30 13:19:57 mac-node-015 kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
Mar 30 13:19:57 mac-node-015 kernel: IP: [<ffffffff812d0399>] __list_del_entry+0x29/0xd0
Mar 30 13:19:57 mac-node-015 kernel: PGD 4627b2067 PUD 457c0e067 PMD 0
Mar 30 13:19:57 mac-node-015 kernel: Oops: 0000 [#1] SMP
Mar 30 13:19:26 mac-node-015-mic0 kernel: [  218.607348] Module libcfs loaded at 0xffffffffa0164000
Mar 30 13:19:27 mac-node-015-mic0 kernel: [  218.731611] LNet: HW CPU cores: 228, npartitions: 12
Mar 30 13:19:27 mac-node-015-mic0 kernel: [  218.739217] Module crc32c loaded at 0xffffffffa01c6000
Mar 30 13:19:27 mac-node-015-mic0 kernel: [  218.742417] alg: No test for adler32 (adler32-zlib)
Mar 30 13:19:27 mac-node-015-mic0 kernel: [  218.742855] alg: No test for crc32 (crc32-table)
Mar 30 13:19:32 mac-node-015-mic0 kernel: [  223.826335] Module lnet loaded at 0xffffffffa01cc000
Mar 30 13:19:32 mac-node-015-mic0 kernel: [  223.922522] Module obdclass loaded at 0xffffffffa0226000
Mar 30 13:19:32 mac-node-015-mic0 kernel: [  224.203083] Lustre: Lustre: Build Version: v2_7_0_0--PRISTINE-2.6.38.8+mpss3.4.3
Mar 30 13:19:32 mac-node-015-mic0 kernel: [  224.271101] Module ptlrpc loaded at 0xffffffffa030d000
Mar 30 13:19:32 mac-node-015-mic0 kernel: [  224.699319] Module ko2iblnd loaded at 0xffffffffa042f000
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.097243] LNet: Added LNI 10.100.22.15@o2ib [8/768/0/180]
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.196620] Module fld loaded at 0xffffffffa046b000
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.228555] Module lmv loaded at 0xffffffffa047b000
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.264301] Module fid loaded at 0xffffffffa04b2000
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.296241] Module mdc loaded at 0xffffffffa04bf000
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.416045] Module lov loaded at 0xffffffffa04f2000
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.534093] Module lustre loaded at 0xffffffffa0541000
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.688858] modprobe used greatest stack depth: 4656 bytes left
Mar 30 13:19:50 mac-node-015-mic1 kernel: [  241.947416] Module libcfs loaded at 0xffffffffa0164000
Mar 30 13:19:50 mac-node-015-mic1 kernel: [  242.071345] LNet: HW CPU cores: 228, npartitions: 12
Mar 30 13:19:50 mac-node-015-mic1 kernel: [  242.079014] Module crc32c loaded at 0xffffffffa01c6000
Mar 30 13:19:50 mac-node-015-mic1 kernel: [  242.082280] alg: No test for adler32 (adler32-zlib)
Mar 30 13:19:50 mac-node-015-mic1 kernel: [  242.082724] alg: No test for crc32 (crc32-table)
Mar 30 13:19:55 mac-node-015-mic1 kernel: [  247.164349] Module lnet loaded at 0xffffffffa01cc000
Mar 30 13:19:55 mac-node-015-mic1 kernel: [  247.259874] Module obdclass loaded at 0xffffffffa0226000
Mar 30 13:19:55 mac-node-015-mic1 kernel: [  247.540370] Lustre: Lustre: Build Version: v2_7_0_0--PRISTINE-2.6.38.8+mpss3.4.3
Mar 30 13:19:55 mac-node-015-mic1 kernel: [  247.608764] Module ptlrpc loaded at 0xffffffffa030d000
Mar 30 13:19:56 mac-node-015-mic1 kernel: [  248.036393] Module ko2iblnd loaded at 0xffffffffa042f000

 


Viewing all articles
Browse latest Browse all 1347

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>