I'm trying to setup a new system:
SuperMicro 5018GR-T
2 Intel Xeon Phis:
Coprocessor Stepping : B1
Board SKU : B1PRQ-31S1P
MPSS 3.5 and Scientific Linux 7.1
# micflash -update -device all -smcbootloader No image path specified - Searching: /usr/share/mpss/flash mic0: Flash image: /usr/share/mpss/flash/EXT_HP2_B1_0391-02.rom.smc mic1: Flash image: /usr/share/mpss/flash/EXT_HP2_B1_0391-02.rom.smc mic0: SMC boot-loader image: /usr/share/mpss/flash/EXT_HP2_SMC_Bootloader_1_8_4326.css_ab mic1: SMC boot-loader image: /usr/share/mpss/flash/EXT_HP2_SMC_Bootloader_1_8_4326.css_ab mic1: SMC boot-loader update started mic0: SMC boot-loader update started mic1: SMC boot-loader update done mic1: Transitioning to ready state mic0: SMC boot-loader update done mic0: Transitioning to ready state mic1: Flash update started mic1: Flash update done mic1: SMC update started mic0: Flash update started mic0: Flash update done mic0: SMC update started mic1: SMC update done mic1: Transitioning to ready state mic0: SMC update done mic0: Transitioning to ready state Please restart host for flash changes to take effect
I start up mpss fine. But then at some point I loose a mic:
/var/log/messages:
May 21 15:15:44 smmic1 kernel: ------------[ cut here ]------------ May 21 15:15:44 smmic1 kernel: WARNING: at /home/build/rpmbuild/BUILD/mpss-modules-3.5/micscif/mi cscif_smpt.c:392 mic_map+0xf1/0x110 [mic]() May 21 15:15:44 smmic1 kernel: micscif_handle_lostnode 1445 node 1 May 21 15:15:44 smmic1 kernel: Warning: Core image elf header not found May 21 15:15:44 smmic1 kernel: Kdump: vmcore not initialized May 21 15:15:44 smmic1 kernel: micscif_handle_lostnode 1457 node 1 crash dump failed status -22 May 21 15:15:44 smmic1 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd sunrpc fscache intel_powerclamp coretemp intel_rapl kvm crct10dif_pclmul pcspkr crc32_p clmul i2c_i801 crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper sb_edac ablk _helper cryptd iTCO_wdt iTCO_vendor_support edac_core lpc_ich mfd_core wmi ipmi_devintf ipmi_si i pmi_msghandler acpi_power_meter ioatdma mei_me acpi_pad mei shpchp mic(OF) binfmt_misc xfs libcrc 32c raid1 raid0 sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillrect sysimgblt drm_kms_ helper ttm drm ahci libahci igb libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region _hash dm_log dm_mod May 21 15:15:44 smmic1 kernel: CPU: 3 PID: 3799 Comm: micinfo Tainted: GF IO------------- - 3.10.0-229.el7.x86_64 #1 May 21 15:15:44 smmic1 kernel: Hardware name: Supermicro SYS-5018GR-T/X10SRG-F, BIOS 1.0 10/21/20 14 May 21 15:15:44 smmic1 kernel: 0000000000000000 May 21 15:15:44 smmic1 kernel: 00000000482cfbb1 May 21 15:15:44 smmic1 kernel: ffff8810053b7b30 May 21 15:15:44 smmic1 kernel: ffffffff81603f36 May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: ffff8810053b7b68 May 21 15:15:44 smmic1 kernel: ffffffff8106e28b May 21 15:15:44 smmic1 kernel: 0000000027be7000 May 21 15:15:44 smmic1 kernel: 0000000000001000 May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: 0000001027be7000 May 21 15:15:44 smmic1 kernel: ffff881028255000 May 21 15:15:44 smmic1 kernel: 0000000000000000 May 21 15:15:44 smmic1 kernel: ffff8810053b7b78 May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: Call Trace: May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffff81603f36>] dump_stack+0x19/0x1b May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffff8106e28b>] warn_slowpath_common+0x6b/0xb0 May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffff8106e3da>] warn_slowpath_null+0x1a/0x20 May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02e6871>] mic_map+0xf1/0x110 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02e799f>] ? va_gen_init+0x6f/0x90 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02df88d>] ? micscif_rma_ep_init+0xed/0x150 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02c97a3>] ? __scif_open+0x93/0x110 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02d2ed2>] ? scif_fdopen+0x32/0x70 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6f68>] ? mic_open+0x48/0x50 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02e698d>] mic_map_single+0xfd/0x160 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02d9a1a>] micscif_setup_qp_connect+0x13a/0x240 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02c8ea0>] scif_conn_func+0x50/0x8c0 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffff8126ecee>] ? selinux_capable+0x2e/0x40 May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02cafdc>] __scif_connect+0x1fc/0x3c0 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02d3517>] scif_process_ioctl+0x537/0xe60 [mic] May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffff8160f294>] ? __do_page_fault+0x204/0x520 May 21 15:15:44 smmic1 kernel: mic0: Transition from state online to lost May 21 15:15:44 smmic1 kernel: May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6fad>] mic_ioctl+0x3d/0x60 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffff811d9a75>] do_vfs_ioctl+0x2e5/0x4c0 May 21 15:15:44 smmic1 kernel: [<ffffffff8126ef4e>] ? file_has_perm+0xae/0xc0 May 21 15:15:44 smmic1 kernel: [<ffffffff811d9cf1>] SyS_ioctl+0xa1/0xc0 May 21 15:15:44 smmic1 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b May 21 15:15:44 smmic1 kernel: micscif_handle_lostnode 1472 stopping node 1 to recover lost node! May 21 15:15:44 smmic1 kernel: ---[ end trace 2eb53c750e757832 ]--- May 21 15:15:44 smmic1 kernel: mic_map failed board id 0 addr 0x00001027be7000 size 0x00000000001000 May 21 15:15:44 smmic1 kernel: micscif_setup_qp_connect 159 error -12 May 21 15:15:44 smmic1 kernel: scif_conn_func err -12 qp_offset 0x0 May 21 15:15:44 smmic1 kernel: micscif_dec_node_refcnt 158 dec dev ffffffffa0301210 node 1 ref -9 223372036854775807 caller ffffffffa02d2f38 Lost Node?? May 21 15:15:44 smmic1 kernel: ------------[ cut here ]------------ May 21 15:15:44 smmic1 kernel: WARNING: at /home/build/rpmbuild/BUILD/mpss-modules-3.5/micscif/mi cscif_smpt.c:392 mic_map+0xf1/0x110 [mic]() May 21 15:15:44 smmic1 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd sunrpc fscache intel_powerclamp coretemp intel_rapl kvm crct10dif_pclmul pcspkr crc32_p clmul i2c_i801 crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper sb_edac ablk _helper cryptd iTCO_wdt iTCO_vendor_support edac_core lpc_ich mfd_core wmi ipmi_devintf ipmi_si i pmi_msghandler acpi_power_meter ioatdma mei_me acpi_pad mei shpchp mic(OF) binfmt_misc xfs libcrc 32c raid1 raid0 sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillrect sysimgblt drm_kms_ helper ttm drm ahci libahci igb libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region _hash dm_log dm_mod May 21 15:15:44 smmic1 kernel: CPU: 3 PID: 3799 Comm: micinfo Tainted: GF W IO------------- - 3.10.0-229.el7.x86_64 #1 May 21 15:15:44 smmic1 kernel: Hardware name: Supermicro SYS-5018GR-T/X10SRG-F, BIOS 1.0 10/21/20 14 May 21 15:15:44 smmic1 kernel: 0000000000000000 00000000482cfbb1 ffff8810053b7b30 ffffffff81603f3 6 May 21 15:15:44 smmic1 kernel: ffff8810053b7b68 ffffffff8106e28b 000000002257e000 000000000000100 0 May 21 15:15:44 smmic1 kernel: 000000102257e000 ffff881028255000 0000000000000000 ffff8810053b7b7 8 May 21 15:15:44 smmic1 kernel: Call Trace: May 21 15:15:44 smmic1 kernel: [<ffffffff81603f36>] dump_stack+0x19/0x1b May 21 15:15:44 smmic1 kernel: [<ffffffff8106e28b>] warn_slowpath_common+0x6b/0xb0 May 21 15:15:44 smmic1 kernel: [<ffffffff8106e3da>] warn_slowpath_null+0x1a/0x20 May 21 15:15:44 smmic1 kernel: [<ffffffffa02e6871>] mic_map+0xf1/0x110 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02e799f>] ? va_gen_init+0x6f/0x90 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02df88d>] ? micscif_rma_ep_init+0xed/0x150 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02c97a3>] ? __scif_open+0x93/0x110 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02d2ed2>] ? scif_fdopen+0x32/0x70 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6f68>] ? mic_open+0x48/0x50 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02e698d>] mic_map_single+0xfd/0x160 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02d9a1a>] micscif_setup_qp_connect+0x13a/0x240 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02c8ea0>] scif_conn_func+0x50/0x8c0 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffff8126ecee>] ? selinux_capable+0x2e/0x40 May 21 15:15:44 smmic1 kernel: [<ffffffffa02cafdc>] __scif_connect+0x1fc/0x3c0 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02d3517>] scif_process_ioctl+0x537/0xe60 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6fad>] mic_ioctl+0x3d/0x60 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffff811d9a75>] do_vfs_ioctl+0x2e5/0x4c0 May 21 15:15:44 smmic1 kernel: [<ffffffff8126ef4e>] ? file_has_perm+0xae/0xc0 May 21 15:15:44 smmic1 kernel: [<ffffffff811d9cf1>] SyS_ioctl+0xa1/0xc0 May 21 15:15:44 smmic1 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b May 21 15:15:44 smmic1 kernel: ---[ end trace 2eb53c750e757833 ]--- May 21 15:15:44 smmic1 kernel: mic_map failed board id 0 addr 0x0000102257e000 size 0x00000000001000 May 21 15:15:44 smmic1 kernel: micscif_setup_qp_connect 159 error -12 May 21 15:15:44 smmic1 kernel: scif_conn_func err -12 qp_offset 0x0 May 21 15:15:44 smmic1 kernel: micscif_dec_node_refcnt 158 dec dev ffffffffa0301210 node 1 ref -9 223372036854775806 caller ffffffffa02d2f38 Lost Node?? May 21 15:15:44 smmic1 kernel: ------------[ cut here ]------------ May 21 15:15:44 smmic1 kernel: WARNING: at /home/build/rpmbuild/BUILD/mpss-modules-3.5/micscif/mi cscif_smpt.c:392 mic_map+0xf1/0x110 [mic]() May 21 15:15:44 smmic1 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd sunrpc fscache intel_powerclamp coretemp intel_rapl kvm crct10dif_pclmul pcspkr crc32_p clmul i2c_i801 crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper sb_edac ablk _helper cryptd iTCO_wdt iTCO_vendor_support edac_core lpc_ich mfd_core wmi ipmi_devintf ipmi_si i pmi_msghandler acpi_power_meter ioatdma mei_me acpi_pad mei shpchp mic(OF) binfmt_misc xfs libcrc 32c raid1 raid0 sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillrect sysimgblt drm_kms_ helper ttm drm ahci libahci igb libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region _hash dm_log dm_mod May 21 15:15:44 smmic1 kernel: CPU: 3 PID: 3799 Comm: micinfo Tainted: GF W IO------------- - 3.10.0-229.el7.x86_64 #1 May 21 15:15:44 smmic1 kernel: Hardware name: Supermicro SYS-5018GR-T/X10SRG-F, BIOS 1.0 10/21/20 14 May 21 15:15:44 smmic1 kernel: 0000000000000000 00000000482cfbb1 ffff8810053b7b30 ffffffff81603f3 6 May 21 15:15:44 smmic1 kernel: ffff8810053b7b68 ffffffff8106e28b 0000000022578000 000000000000100 0 May 21 15:15:44 smmic1 kernel: 0000001022578000 ffff881028255000 0000000000000000 ffff8810053b7b7 8 May 21 15:15:44 smmic1 kernel: Call Trace: May 21 15:15:44 smmic1 kernel: [<ffffffff81603f36>] dump_stack+0x19/0x1b May 21 15:15:44 smmic1 kernel: [<ffffffff8106e28b>] warn_slowpath_common+0x6b/0xb0 May 21 15:15:44 smmic1 kernel: [<ffffffff8106e3da>] warn_slowpath_null+0x1a/0x20 May 21 15:15:44 smmic1 kernel: [<ffffffffa02e6871>] mic_map+0xf1/0x110 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02e799f>] ? va_gen_init+0x6f/0x90 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02df88d>] ? micscif_rma_ep_init+0xed/0x150 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02c97a3>] ? __scif_open+0x93/0x110 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02d2ed2>] ? scif_fdopen+0x32/0x70 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6f68>] ? mic_open+0x48/0x50 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02e698d>] mic_map_single+0xfd/0x160 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02d9a1a>] micscif_setup_qp_connect+0x13a/0x240 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02c8ea0>] scif_conn_func+0x50/0x8c0 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffff8126ecee>] ? selinux_capable+0x2e/0x40 May 21 15:15:44 smmic1 kernel: [<ffffffffa02cafdc>] __scif_connect+0x1fc/0x3c0 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02d3517>] scif_process_ioctl+0x537/0xe60 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffffa02b6fad>] mic_ioctl+0x3d/0x60 [mic] May 21 15:15:44 smmic1 kernel: [<ffffffff811d9a75>] do_vfs_ioctl+0x2e5/0x4c0 May 21 15:15:44 smmic1 kernel: [<ffffffff8126ef4e>] ? file_has_perm+0xae/0xc0 May 21 15:15:44 smmic1 kernel: [<ffffffff811d9cf1>] SyS_ioctl+0xa1/0xc0 May 21 15:15:44 smmic1 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b May 21 15:15:44 smmic1 kernel: ---[ end trace 2eb53c750e757834 ]--- May 21 15:15:44 smmic1 kernel: mic_map failed board id 0 addr 0x00001022578000 size 0x00000000001000 May 21 15:15:44 smmic1 kernel: micscif_setup_qp_connect 159 error -12 May 21 15:15:44 smmic1 kernel: scif_conn_func err -12 qp_offset 0x0 May 21 15:15:44 smmic1 kernel: micscif_dec_node_refcnt 158 dec dev ffffffffa0301210 node 1 ref -9 223372036854775805 caller ffffffffa02d2f38 Lost Node??
/var/log/mpssd:
Thu May 21 15:11:46 2015: MPSS Daemon start Thu May 21 15:11:47 2015: mic1: Command line: quiet root=ramfs console=hvc0 cgroup_disable=memory highres=off Thu May 21 15:11:47 2015: mic0: Command line: quiet root=ramfs console=hvc0 cgroup_disable=memory highres=off Thu May 21 15:11:47 2015: mic1: Debug log buffer addr ffffffff818a3320 len @ ffffffff81724cc0 Thu May 21 15:11:47 2015: mic1: Generate /var/mpss/mic1.image.gz Thu May 21 15:11:47 2015: mic0: Debug log buffer addr ffffffff818a3320 len @ ffffffff81724cc0 Thu May 21 15:11:47 2015: mic0: Generate /var/mpss/mic0.image.gz Thu May 21 15:11:50 2015: mic0: State ready -> booting Thu May 21 15:11:50 2015: mic0: Booting /usr/share/mpss/boot/bzImage-knightscorner initrd /var/mpss/mic0.image.gz Thu May 21 15:11:52 2015: mic1: State ready -> booting Thu May 21 15:11:52 2015: mic1: Booting /usr/share/mpss/boot/bzImage-knightscorner initrd /var/mpss/mic1.image.gz Thu May 21 15:12:15 2015: mic1: Monitor connection established Thu May 21 15:12:16 2015: mic0: Monitor connection established Thu May 21 15:12:16 2015: mic1: State booting -> online Thu May 21 15:12:17 2015: mic0: State booting -> online Thu May 21 15:15:44 2015: mic0: State online -> lost Thu May 21 15:15:44 2015: mic0: [SaveCrashdump] Aborted - open /proc/mic_vmcore/mic0 failed: No such file or directory Thu May 21 15:16:14 2015: mic1: State online -> lost Thu May 21 15:16:14 2015: mic1: [SaveCrashdump] Aborted - open /proc/mic_vmcore/mic1 failed: No such file or directory Thu May 21 15:17:29 2015: mic0: State lost -> resetting Thu May 21 15:17:29 2015: mic0: [SaveCrashDump] Waiting for reset Thu May 21 15:17:31 2015: mic0: [SaveCrashDump] Waiting for reset Thu May 21 15:17:31 2015: mic0: State resetting -> reset failed Thu May 21 15:17:33 2015: mic0: [SaveCrashDump] Failed to reset card. Aborting reboot