Hi,
I'm currently in the process of setting up the OS for a diskless cluster with two Xeon Phi Cards per host.
Currently working with CentOS 7.0, MPSS 3.4.3, OFED 3.12-1 and Lustre 2.7.0.
Installation and booting host and two Xeon Phis works fine so far, except that as soon as I try load Lustre (using o2ib) on the second Xeon Phi the complete system crashes due to an error within the ibp_server module (logs can be found a. Using only one Xeon Phi lustre works fine, including mount over Infiniband.
Anybody got any experience with setting up Lustre on a similar system?
I already tried different versions of MPSS (3.4.x and 3.3.x), OFED (3.12-1, 3.18-rc1), Lustre (2.6.0, 2.7.0).
For Lustre installation on Xeon Phi, the information posted here has been used: https://software.intel.com/de-de/blogs/2014/11/06/lustre-on-intel-xeon-p...
Any help is highly appreciated.
Host + micx log files (MPSS 3.4.3, OFED 3.18-rc1, Lustre 2.7.0):
Mar 30 13:15:15 mac-node-015 systemd: Starting Intel(R) MPSS control service... Mar 30 13:15:40 mac-node-015 kernel: mic0: Transition from state ready to booting Mar 30 13:15:40 mac-node-015 kernel: mic image: /usr/share/mpss/boot/bzImage-knightscorner Mar 30 13:15:40 mac-node-015 kernel: MIC 0 Booting Mar 30 13:15:40 mac-node-015 kernel: mic1: Transition from state ready to booting Mar 30 13:15:40 mac-node-015 kernel: mic image: /usr/share/mpss/boot/bzImage-knightscorner Mar 30 13:15:40 mac-node-015 kernel: MIC 1 Booting Mar 30 13:15:45 mac-node-015 kernel: Waiting for MIC 0 boot 5 Mar 30 13:15:45 mac-node-015 kernel: Waiting for MIC 1 boot 5 Mar 30 13:15:50 mac-node-015 kernel: Waiting for MIC 0 boot 10 Mar 30 13:15:50 mac-node-015 kernel: Waiting for MIC 1 boot 10 Mar 30 13:15:55 mac-node-015 kernel: Waiting for MIC 0 boot 15 Mar 30 13:15:55 mac-node-015 kernel: Waiting for MIC 1 boot 15 Mar 30 13:16:00 mac-node-015 kernel: Waiting for MIC 0 boot 20 Mar 30 13:16:00 mac-node-015 kernel: Waiting for MIC 1 boot 20 Mar 30 13:16:01 mac-node-015 kernel: MIC 0 Network link is up Mar 30 13:16:01 mac-node-015 kernel: MIC 1 Network link is up Mar 30 13:16:03 mac-node-015 kernel: mic0: Transition from state booting to online Mar 30 13:16:03 mac-node-015 kernel: mic1: Transition from state booting to online Mar 30 13:16:04 mac-node-015 mpss: Starting Intel(R) MPSS: [ OK ] Mar 30 13:16:04 mac-node-015 mpss: mic0: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner) Mar 30 13:16:04 mac-node-015 mpss: mic1: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner) Mar 30 13:16:04 mac-node-015 systemd: Started Intel(R) MPSS control service. Mar 30 13:16:04 mac-node-015 kernel: device mic0 entered promiscuous mode Mar 30 13:16:04 mac-node-015 kernel: br0: port 2(mic0) entered forwarding state Mar 30 13:16:04 mac-node-015 kernel: br0: port 2(mic0) entered forwarding state Mar 30 13:16:04 mac-node-015 kernel: device mic1 entered promiscuous mode Mar 30 13:16:04 mac-node-015 kernel: br0: port 3(mic1) entered forwarding state Mar 30 13:16:04 mac-node-015 kernel: br0: port 3(mic1) entered forwarding state Mar 30 13:16:07 mac-node-015 systemd: Starting LSB: Start ofed layer on top of mpss... Mar 30 13:16:07 mac-node-015 ofed-mic: Starting OFED Stack: Mar 30 13:16:07 mac-node-015 kernel: CCL Direct Server v1.0 Mar 30 13:16:07 Copyright (c) 2011-2013 Intel Corporation Mar 30 13:16:07 mac-node-015 kernel: CCL Direct CM Server v1.0 Mar 30 13:16:07 Copyright (c) 2011-2013 Intel Corporation Mar 30 13:16:07 mac-node-015 kernel: CCL Direct SA Server v1.0 Mar 30 13:16:07 Copyright (c) 2011-2013 Intel Corporation Mar 30 13:16:08 mac-node-015 kernel: ibscif: OpenFabrics IBSCIF Driver v0.1 built Mar 30 2015 10:38:18 Mar 30 13:16:08 mac-node-015 kernel: ibscif: max_pinned=50, window_size=40, blocking_send=0, blocking_recv=1, fast_rdma=1, host_proxy=0, rma_threshold=1024, scif_loopback=1, new_ib_type=1, verbose=0, check_grh=1 Mar 30 13:16:08 mac-node-015 kernel: ibscif: ibscif_add_one: my node_id is 0 Mar 30 13:16:08 mac-node-015 rdma-ndd[1345]: Device event: infiniband, scif0, add Mar 30 13:16:08 mac-node-015 rdma-ndd[1345]: scif0: change (OpenFabrics IBSCIF Driver v0.1) -> (mac-node-015 scif0) Mar 30 13:16:08 mac-node-015 rdma-ndd[1345]: scif0: change (OpenFabrics IBSCIF Driver v0.1) -> (mac-node-015 scif0) Mar 30 13:16:08 mac-node-015 rdma-ndd[1345]: mlx4_0: change (mac-node-015 HCA-1) -> (mac-node-015 mlx4_0) Mar 30 13:16:08 mac-node-015 ofed-mic: host[ OK ] Mar 30 13:16:08 mac-node-015 ntpd[654]: Listen normally on 9 mic0 fe80::4e79:baff:fe24:f79 UDP 123 Mar 30 13:16:08 mac-node-015 ntpd[654]: Listen normally on 10 mic1 fe80::4e79:baff:fe24:e59 UDP 123 Mar 30 13:16:09 mac-node-015 ibpd: pid 1682 /dev/ibp1 started 4 threads Mar 30 13:16:13 mac-node-015 ofed-mic: mic0 : ib0 [ OK ] Mar 30 13:16:13 mac-node-015 ibpd: pid 1709 /dev/ibp2 started 4 threads Mar 30 13:16:17 mac-node-015 ofed-mic: mic1 ib0 [ OK ] Mar 30 13:16:17 mac-node-015 systemd: Started LSB: Start ofed layer on top of mpss. Mar 30 13:16:19 mac-node-015 ntpd[654]: Listen normally on 11 mic0:ib 192.0.2.100 UDP 123 Mar 30 13:19:57 mac-node-015 kernel: ibp_server: ibp_cmd_reg_user_mr(2670) ib_reg_user_mr returned -12 Mar 30 13:19:57 mac-node-015 kernel: BUG: unable to handle kernel NULL pointer dereference at (null) Mar 30 13:19:57 mac-node-015 kernel: IP: [<ffffffff812d0399>] __list_del_entry+0x29/0xd0 Mar 30 13:19:57 mac-node-015 kernel: PGD 4627b2067 PUD 457c0e067 PMD 0 Mar 30 13:19:57 mac-node-015 kernel: Oops: 0000 [#1] SMP
Mar 30 13:19:26 mac-node-015-mic0 kernel: [ 218.607348] Module libcfs loaded at 0xffffffffa0164000 Mar 30 13:19:27 mac-node-015-mic0 kernel: [ 218.731611] LNet: HW CPU cores: 228, npartitions: 12 Mar 30 13:19:27 mac-node-015-mic0 kernel: [ 218.739217] Module crc32c loaded at 0xffffffffa01c6000 Mar 30 13:19:27 mac-node-015-mic0 kernel: [ 218.742417] alg: No test for adler32 (adler32-zlib) Mar 30 13:19:27 mac-node-015-mic0 kernel: [ 218.742855] alg: No test for crc32 (crc32-table) Mar 30 13:19:32 mac-node-015-mic0 kernel: [ 223.826335] Module lnet loaded at 0xffffffffa01cc000 Mar 30 13:19:32 mac-node-015-mic0 kernel: [ 223.922522] Module obdclass loaded at 0xffffffffa0226000 Mar 30 13:19:32 mac-node-015-mic0 kernel: [ 224.203083] Lustre: Lustre: Build Version: v2_7_0_0--PRISTINE-2.6.38.8+mpss3.4.3 Mar 30 13:19:32 mac-node-015-mic0 kernel: [ 224.271101] Module ptlrpc loaded at 0xffffffffa030d000 Mar 30 13:19:32 mac-node-015-mic0 kernel: [ 224.699319] Module ko2iblnd loaded at 0xffffffffa042f000 Mar 30 13:19:34 mac-node-015-mic0 kernel: [ 226.097243] LNet: Added LNI 10.100.22.15@o2ib [8/768/0/180] Mar 30 13:19:34 mac-node-015-mic0 kernel: [ 226.196620] Module fld loaded at 0xffffffffa046b000 Mar 30 13:19:34 mac-node-015-mic0 kernel: [ 226.228555] Module lmv loaded at 0xffffffffa047b000 Mar 30 13:19:34 mac-node-015-mic0 kernel: [ 226.264301] Module fid loaded at 0xffffffffa04b2000 Mar 30 13:19:34 mac-node-015-mic0 kernel: [ 226.296241] Module mdc loaded at 0xffffffffa04bf000 Mar 30 13:19:34 mac-node-015-mic0 kernel: [ 226.416045] Module lov loaded at 0xffffffffa04f2000 Mar 30 13:19:34 mac-node-015-mic0 kernel: [ 226.534093] Module lustre loaded at 0xffffffffa0541000 Mar 30 13:19:34 mac-node-015-mic0 kernel: [ 226.688858] modprobe used greatest stack depth: 4656 bytes left
Mar 30 13:19:50 mac-node-015-mic1 kernel: [ 241.947416] Module libcfs loaded at 0xffffffffa0164000 Mar 30 13:19:50 mac-node-015-mic1 kernel: [ 242.071345] LNet: HW CPU cores: 228, npartitions: 12 Mar 30 13:19:50 mac-node-015-mic1 kernel: [ 242.079014] Module crc32c loaded at 0xffffffffa01c6000 Mar 30 13:19:50 mac-node-015-mic1 kernel: [ 242.082280] alg: No test for adler32 (adler32-zlib) Mar 30 13:19:50 mac-node-015-mic1 kernel: [ 242.082724] alg: No test for crc32 (crc32-table) Mar 30 13:19:55 mac-node-015-mic1 kernel: [ 247.164349] Module lnet loaded at 0xffffffffa01cc000 Mar 30 13:19:55 mac-node-015-mic1 kernel: [ 247.259874] Module obdclass loaded at 0xffffffffa0226000 Mar 30 13:19:55 mac-node-015-mic1 kernel: [ 247.540370] Lustre: Lustre: Build Version: v2_7_0_0--PRISTINE-2.6.38.8+mpss3.4.3 Mar 30 13:19:55 mac-node-015-mic1 kernel: [ 247.608764] Module ptlrpc loaded at 0xffffffffa030d000 Mar 30 13:19:56 mac-node-015-mic1 kernel: [ 248.036393] Module ko2iblnd loaded at 0xffffffffa042f000