Hi,
Has anyone successfully built R to run natively on Phi?
Thanks,
George
Hi,
Has anyone successfully built R to run natively on Phi?
Thanks,
George
I found in past topics that mm512_unpacklo_* is not supported on phi. In my own implementation, it seems mm512_permute* and mm512_shuffle* is also not supported. So far all matrix transpose operation in past posts seems implemented by using mm512_swizzle* and mm512_blend* instructions. However, use these two operations requires two times more element movement, seems low efficiency. Is their any other choices to do matrix transpose?
In this page you will find the last releases of the Intel(R) Manycore Platform Software Stack (MPSS) Long Term Support product (LTS). The most recent release is found here: http://software.intel.com/en-us/articles/intel-many-integrated-core-architecture-intel-mic-architecture-platform-software-stack and we recommend customers use the latest release wherever possible.
MPSS version | Downloads available | Size (range) | MD5 Checksum |
---|---|---|---|
mpss-3.4.4 (released: June 2, 2015) | Linux (mpss-3.4.4-linux.tar) for RedHat 6.3, RedHat 6.4, RedHat 6.5, RedHat 6.6, RedHat 7.0, SuSE SLES11 SP2, SuSE SLES11 SP3, SuSE SLES12 | ~420MB | 603fee578662bd83ac78cb0293c0b4df |
Software for Coprocessor OS (k1om) (mpss-3.4.4-k1om.tar) | ~700MB | 42c2eba4d727991e4e8f99dababeba63 | |
SOURCE (mpss-src-3.4.4.tar) | ~270MB | 0030c519e7740ad9d8552aa8bedc4e94 | |
Download Cache (mpss-downloadcache-3.4.4.tar) | ~1.1GB | 47031c23014ce5a0f43ff093ad42251d |
Documentation link | Description | Last Updated On | Size (approx) |
---|---|---|---|
releaseNotes-linux.txt | English - Release Notes | June 2015 | ~54KB |
readme.txt | Readme (includes installation instructions) for Linux (English) | June 2015 | ~20KB |
MPSS_Users_Guide.pdf | Complete Users Guide for MPSS for Linux (English) | June 2015 | ~2MB |
SCIF_UserGuide.pdf | SCIF User guide | June 2015 | ~700KB |
license.txt | INTEL SOFTWARE LICENSE AGREEMENT for Intel® Manycore Platform Software Stack (Intel® MPSS) | June 2015 | ~30KB |
MPSS version | Downloads available | Size (range) | MD5 Checksum |
---|---|---|---|
mpss-3.4.3 (released: February 20, 2015) | Linux (mpss-3.4.3-linux.tar) for RedHat 6.3, RedHat 6.4, RedHat 6.5, RedHat 6.6, RedHat 7.0, SuSE SLES11 SP2, SuSE SLES11 SP3, SuSE SLES12 | ~400MB | fa960e90045a1ab16e1b68920030233c |
Software for Coprocessor OS (k1om) (mpss-3.4.3-k1om.tar) | ~700MB | 85b4f4b6873a8ec21cc9e1d6d95cec04 | |
SOURCE (mpss-src-3.4.3.tar) | ~270MB | 1fdd717f025ee6c6c999f991e76dde9f | |
Download Cache (mpss-downloadcache-3.4.3.tar) | ~1.1GB | 1ec83289d06ec8c12dea80f7a5482034 |
Documentation link | Description | Last Updated On | Size (approx) |
---|---|---|---|
releaseNotes-linux.txt | English - Release Notes | February 2015 | ~62KB |
readme.txt | Readme (includes installation instructions) for Linux (English) | February 2015 | ~20KB |
MPSS_Users_Guide.pdf | Complete Users Guide for MPSS for Linux (English) | February 2015 | ~2MB |
SCIF_UserGuide.pdf | SCIF User guide | February 2015 | ~700KB |
license.txt | INTEL SOFTWARE LICENSE AGREEMENT for Intel® Manycore Platform Software Stack (Intel® MPSS) | February 2015 | ~30KB |
MPSS version | Downloads available | Size | MD5 Checksum |
---|---|---|---|
mpss-3.4.3-windows.zip (released: February 20, 2015) | Microsoft* Windows | ~310MB | 588c1431fa0803f5b478aa771703efa2 |
Software for Coprocessor OS (k1om) (mpss-3.4.3-k1om.tar) | ~700MB | 85b4f4b6873a8ec21cc9e1d6d95cec04 | |
Documentation link | Description | Last Updated On | Size |
---|---|---|---|
releaseNotes-windows.txt | English - release notes | February 2015 | ~25KB |
readme-windows.pdf | English (includes installation instructions) for Microsoft* Windows | February 2015 | ~550KB |
MPSS_Users_Guide-windows.pdf | User, Cluster and Advanced Configuration Guide for MPSS | February 2015 | ~2 |
MPSS version | Downloads available | Size (range) | MD5 Checksum |
---|---|---|---|
mpss-3.4.2 (released: December 3, 2014) | Linux (mpss-3.4.2-linux.tar) for RedHat 6.3, RedHat 6.4, RedHat 6.5, RedHat 6.6, RedHat 7.0, SuSE SLES11 SP2, SuSE SLES11 SP3 | ~400MB | 40896e317418fd20a758fd7ce2408aac |
Software for Coprocessor OS (k1om) (mpss-3.4.2-k1om.tar) | ~700MB | 27004c1423bb3e29010de2284577d024 | |
SOURCE (mpss-src-3.4.2.tar) | ~270MB | b5031821ac8d4faaf12b4fbb1728e97a | |
Download Cache (mpss-downloadcache-3.4.2.tar) | ~1.1GB | 4d937079b4ef2a8eef821e12f2e61ebd |
Documentation link | Description | Last Updated On | Size (approx) |
---|---|---|---|
releaseNotes-linux.txt | English - Release Notes | December 2014 | ~75KB |
readme.txt | Readme (includes installation instructions) for Linux (English) | December 2014 | ~20KB |
MPSS_Users_Guide.pdf | Complete Users Guide for MPSS for Linux (English) | December 2014 | ~2MB |
SCIF_UserGuide.pdf | SCIF User guide | December 2014 | ~700KB |
license.txt | INTEL SOFTWARE LICENSE AGREEMENT for Intel® Manycore Platform Software Stack (Intel® MPSS) | September 2013 | ~30KB |
MPSS version | Downloads available | Size | MD5 Checksum |
---|---|---|---|
mpss-3.4.2-windows.zip (released: December 3, 2014) | Microsoft* Windows | ~310MB | 64b2bb347ce870098b2e8dafa10e5d67 |
Software for Coprocessor OS (k1om) (mpss-3.4.2-k1om.tar) | ~700MB | 27004c1423bb3e29010de2284577d024 | |
Documentation link | Description | Last Updated On | Size |
---|---|---|---|
releaseNotes-windows.txt | English - release notes | December 2014 | ~30KB |
readme-windows.pdf | English (includes installation instructions) for Microsoft* Windows | December 2014 | ~620KB |
MPSS_Users_Guide-windows.pdf | User, Cluster and Advanced Configuration Guide for MPSS | December 2014 | ~2MB |
MPSS version | Downloads available | Size (range) | MD5 Checksum |
---|---|---|---|
mpss-3.4.1 (released: October 22 2014) | Linux (mpss-3.4.1-linux.tar) for RedHat 6.3, RedHat 6.4, RedHat 6.5, RedHat 6.6, RedHat 7.0, SuSE SLES11 SP2, SuSE SLES11 SP3 | ~400MB | e985afee031baf542090883d3752fcfa |
Software for Coprocessor OS (k1om) (mpss-3.4.1-k1om.tar) | ~700MB | 23d3db962c2abc659945598aa6793374 | |
SOURCE (mpss-src-3.4.1.tar) | ~270MB | 73ecb48cf74bd815ae8c3753868c80d8 | |
Download Cache (mpss-downloadcache-3.4.1.tar) | ~1.1GB | 3bdc15046dbd4b23a58cb1684d73e05f |
Documentation link | Description | Last Updated On | Size (approx) |
---|---|---|---|
releasenotes-linux.txt | English - Release Notes | October 2014 | ~75KB |
readme.txt | Readme (includes installation instructions) for Linux (English) | October 2014 | ~20KB |
MPSS_Users_Guide.pdf | Complete Users Guide for MPSS for Linux (English) | October 2014 | ~2MB |
SCIF_UserGuide.pdf | SCIF User guide | October 2014 | ~700KB |
license.txt | INTEL SOFTWARE LICENSE AGREEMENT for Intel® Manycore Platform Software Stack (Intel® MPSS) | September 2013 | ~30KB |
MPSS version | Downloads available | Size | MD5 Checksum |
---|---|---|---|
mpss-3.4.1-windows.zip (released: October 22 2014) | Microsoft* Windows | ~310MB | 27b8c2ced28569b58c9d00255bc3219f |
Software for Coprocessor OS (k1om) (mpss-3.4.1-k1om.tar) | ~700MB | 23d3db962c2abc659945598aa6793374 | |
Documentation link | Description | Last Updated On | Size |
---|---|---|---|
releaseNotes-windows.txt | English - release notes | October 2014 | ~30KB |
readme-windows.pdf | English (includes installation instructions) for Microsoft* Windows | October 2014 | ~620KB |
MPSS_Users_Guide-windows.pdf | User, Cluster and Advanced Configuration Guide for MPSS | October 2014 | ~2MB |
Hello, everyone. I've been lurking on the forums for a few days now while I schemed up a cooling solution for my shiny new 31S1P.
I'm pretty sure I've conquered the cooling requirements. Check!
However, I cannot get the card to work correctly. I'm using a Z97-WS motherboard with "4G Decoding" enabled in the BIOS settings. The CPU is a Celeron G1820 which is a cheap little lga1150 socket CPU that seemed to be enough for this rig. I'm running the latest BIOS (2403, I believe from 2015-06-18 or thereabouts), latest version of CentOS 7.1, which is 7.1.1503 (Core).
I've followed all of the advice and forums I could find online about this issue, to little avail. Here is a piece of my console log showing the relevant information I am likely to be asked to provide if I don't do it here:
----------------------------------------------------------------------------------------
[root@x mpss-3.5.2]# dmesg | grep MSI
[ 0.102438] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[ 0.408378] pcieport 0000:00:01.0: irq 40 for MSI/MSI-X
[ 0.408786] pcieport 0000:01:00.0: irq 41 for MSI/MSI-X
[ 0.408881] pcieport 0000:02:08.0: irq 42 for MSI/MSI-X
[ 0.408972] pcieport 0000:02:10.0: irq 43 for MSI/MSI-X
[ 0.409070] pcieport 0000:06:00.0: irq 44 for MSI/MSI-X
[ 0.409184] pcieport 0000:07:01.0: irq 45 for MSI/MSI-X
[ 0.409349] pcieport 0000:07:02.0: irq 46 for MSI/MSI-X
[ 0.409465] pcieport 0000:07:03.0: irq 47 for MSI/MSI-X
[ 0.409579] pcieport 0000:07:04.0: irq 48 for MSI/MSI-X
[ 0.409692] pcieport 0000:07:05.0: irq 49 for MSI/MSI-X
[ 0.409808] pcieport 0000:07:06.0: irq 50 for MSI/MSI-X
[ 0.409920] pcieport 0000:07:07.0: irq 51 for MSI/MSI-X
[ 0.452551] xhci_hcd 0000:00:14.0: irq 52 for MSI/MSI-X
[ 0.518593] xhci_hcd 0000:10:00.0: irq 53 for MSI/MSI-X
[ 0.518597] xhci_hcd 0000:10:00.0: irq 54 for MSI/MSI-X
[ 0.518600] xhci_hcd 0000:10:00.0: irq 55 for MSI/MSI-X
[ 0.710232] e1000e 0000:00:19.0: irq 56 for MSI/MSI-X
[ 0.825566] igb 0000:0d:00.0: irq 57 for MSI/MSI-X
[ 0.825570] igb 0000:0d:00.0: irq 58 for MSI/MSI-X
[ 0.825573] igb 0000:0d:00.0: irq 59 for MSI/MSI-X
[ 0.825577] igb 0000:0d:00.0: irq 60 for MSI/MSI-X
[ 0.825581] igb 0000:0d:00.0: irq 61 for MSI/MSI-X
[ 0.855040] igb 0000:0d:00.0: Using MSI-X interrupts. 2 rx queue(s), 2 tx queue(s)
[ 0.984604] i915 0000:00:02.0: irq 62 for MSI/MSI-X
[ 1.187177] ahci 0000:00:1f.2: irq 63 for MSI/MSI-X
[ 1.189283] ahci 0000:0a:00.0: irq 64 for MSI/MSI-X
[ 1.190251] ahci 0000:0f:00.0: irq 65 for MSI/MSI-X
[ 12.487762] mei_me 0000:00:16.0: irq 66 for MSI/MSI-X
[ 12.702815] snd_hda_intel 0000:00:03.0: irq 67 for MSI/MSI-X
[ 12.702983] snd_hda_intel 0000:00:1b.0: irq 68 for MSI/MSI-X
[root@x mpss-3.5.2]# lspci | grep -i coproc
03:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 31S1 (rev 11)
[root@x mpss-3.5.2]# lspci -s 03:00.0 -vv
03:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 31S1 (rev 11)
Subsystem: Intel Corporation Device 2500
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 255
Region 0: Memory at <unassigned> (64-bit, prefetchable) [disabled] [size=8G]
Region 4: Memory at bf200000 (64-bit, non-prefetchable) [disabled] [size=128K]
Capabilities: [44] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [4c] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <4us, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [98] MSI-X: Enable- Count=16 Masked-
Vector table: BAR=4 offset=00017000
PBA: BAR=4 offset=00018000
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
[root@x mpss-3.5.2]# dmesg | grep mic
[ 0.000000] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.061159] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.068898] atomic64 test passed for x86-64 platform with CX8 and with SSE
[ 0.089803] ACPI: Dynamic OEM Table Load:
[ 0.091965] ACPI: Dynamic OEM Table Load:
[ 0.093790] ACPI: Dynamic OEM Table Load:
[ 0.387895] microcode: CPU0 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.387899] microcode: CPU1 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.387920] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
[ 0.526732] mousedev: PS/2 mouse device common for all mice
[ 0.710216] e1000e 0000:00:19.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[ 3.071226] usb 5-2: ep 0x81 - rounding interval to 1024 microframes, ep desc says 2040 microframes
[ 3.439111] usb 5-2.1: ep 0x81 - rounding interval to 64 microframes, ep desc says 80 microframes
[ 3.439113] usb 5-2.1: ep 0x82 - rounding interval to 1024 microframes, ep desc says 2040 microframes
[root@x mpss-3.5.2]# micinfo
MicInfo Utility Log
Created Mon Aug 17 04:01:04 2015
System Info
HOST OS : Linux
OS Version : 3.10.0-229.el7.x86_64
Driver Version : NotAvailable
MPSS Version : 3.5.2
Host Physical Memory : 16141 MB
micinfo: No devices found : host driver is not loaded: No such file or directory
[root@x j]# depmod
[root@x j]# modprobe mic
modprobe: FATAL: Module mic not found.
[root@x j]# service mpss start
Starting mpss (via systemctl): [ OK ]
[root@x j]# micctrl -s
[Error] micrasrelmond: State failed - non existent MIC device
[root@x j]#
----------------------------------------------------------------------------------------
If I can get it to show up in a dmesg | grep mic output again, I'll post it here. i've gotten that output of that to vary a little.
I don't have a special BIOS from ASUS, but as I said above, it is the latest available and it only came out a few weeks ago. Could this be an instance of what Frances was talking about here? https://software.intel.com/en-us/forums/topic/538897#comment-1811230
In other words, the fact that MSI-X doesn't appear to be operative for my 31S1P. I cannot for the life of me figure out how to force to to be enabled. Is this going to require recompiling my kernel?
If anyone has any ideas, I'm all ears/eyes.
Thanks!
Hello,
Could you please take a look at this problem? My machine has 16 CPUs and 4 MICs (47 coprocessors each), and I run my program with 8 MPI processors (mpi_comm_size = 8) and want to use MKL routines with automatic offload (AO) mode. As you can see in the test code attached, I tried three different methods.
METHOD-1: I allocate the 4 MICs to the first 4 CPUs each and let the other CPUs run w/o MIC. In this case the program works well as expected and I got the following performance test result when solving zgemm for 5k*5k size of complex & dense matrices.
CPU_ID 0 1 2 3 4 5 6 7
time(s) 1.67 1.93 1.97 1.93 13.85 12.94 12.94 12.93
METHOD-2: Now, this is the problematic situation. I want all the 8 CPUs to share the 4 MICs equally expecting that the CPUs show a performance of about 4 seconds for the same zgemm problem as method-1. However, this method does not work well but gives error messages right away or after solving its first zgemm problem,
*** glibc detected *** ../../../bin/test: malloc(): memory corruption: 0x00007f59fc000010 ***
or
CPU_ID 0 1 2 3 4 5 6 7
time(s) 101 10 101 95 26 25 14 14
*** glibc detected *** ../../../bin/test: free(): corrupted unsorted chunks: 0x0000000009f47270 ***
METHOD-3: If I replace mkl_mic_set_workdivision() with mkl_mic_set_resource_limit(), then the program does not crash but there's no response at all. I see that the CPU and MIC usages are almost zero.
Please take a look at a piece of my code attached and give some advices.
Thank you.
Hello,
I am having a really hard time figuring out how to use the Xeon Phi offload mode from within MATLAB MEX files under Linux. I have managed to force MATLAB to use icc for compilation and verified that the mex files run fine. The problems start when using the offload pragma - as far as I can tell, nobody has tried that yet and I suspect this is some (fixable?) issue with libraries. Can someone here help me with this?
Consider the following simple code
int main() { __attribute__((target(mic : 0))) int vsize; #pragma offload target(mic:0) vsize = 10; }
When I execute this with OFFLOAD_REPORT=3, I get the following output
$ ./test [Offload] [HOST] [State] Initialize logical card 0 = physical card 0 [Offload] [MIC 0] [File] test.c [Offload] [MIC 0] [Line] 23 [Offload] [MIC 0] [Tag] Tag 0 [Offload] [HOST] [Tag 0] [State] Start target [Offload] [HOST] [Tag 0] [State] Setup target entry: __offload_entry_test_c_23mainicc0101288930704RqbsVt [Offload] [HOST] [Tag 0] [State] Host->target pointer data 0 [Offload] [HOST] [Tag 0] [Signal] signal : none [Offload] [HOST] [Tag 0] [Signal] waits : none [Offload] [HOST] [Tag 0] [State] Host->target pointer data 0 [Offload] [HOST] [Tag 0] [State] Host->target copyin data 4 [Offload] [HOST] [Tag 0] [State] Execute task on target [Offload] [HOST] [Tag 0] [State] Target->host pointer data 0 [Offload] [MIC 0] [Tag 0] [State] Start target entry: __offload_entry_test_c_23mainicc0101288930704RqbsVt [Offload] [MIC 0] [Tag 0] [Var] vsize INOUT [Offload] [HOST] [Tag 0] [CPU Time] 0.301827(seconds) [Offload] [MIC 0] [Tag 0] [CPU->MIC Data] 4 (bytes) [Offload] [MIC 0] [Tag 0] [MIC Time] 0.000171(seconds) [Offload] [MIC 0] [Tag 0] [MIC->CPU Data] 4 (bytes) [Offload] [MIC 0] [Tag 0] [State] Target->host copyout data 4
I have written a MEX file that does the same thing. FYI, a MEX file is essentially a dynamic .so library with one specific symbol exported. The result of running the MEX file under MATLAB is as follows
>> mictest [Offload] [HOST] [State] Initialize logical card 0 = physical card 0 offload error: cannot load library to the device 0 (error code 5) ------------------------------------------------------------------------ Segmentation violation detected at Fri Aug 21 14:57:31 2015 ------------------------------------------------------------------------ [...]
I have looked around and tried to set the OFFLOAD_INIT=on_start variable before starting MATLAB. The results were VERY promising, but still some problems remain unsolved:
[Offload] [MIC 0] [File] mictest_mex.c [Offload] [MIC 0] [Line] 41 [Offload] [MIC 0] [Tag] Tag 0 [Offload] [HOST] [Tag 0] [State] Start target [Offload] [HOST] [Tag 0] [State] Setup target entry: __offload_entry_mictest_mex_c_41mexFunctionicc0104735023118W8NJ2 [Offload] [HOST] [Tag 0] [State] Host->target pointer data 0 [Offload] [HOST] [Tag 0] [Signal] signal : none [Offload] [HOST] [Tag 0] [Signal] waits : none [Offload] [HOST] [Tag 0] [State] Host->target pointer data 0 [Offload] [HOST] [Tag 0] [State] Host->target copyin data 4 [Offload] [HOST] [Tag 0] [State] Execute task on target offload error: cannot create pipeline on the device 0 (error code 14)
So it seems that MIC is indeed doing something, but one last step is missing to make this work. The library paths and the whole bash environment is the same in both cases. I have also looked at output of the nm command and it seems that in both cases (C standalone and MATLAB MEX) the number and names of symbols that contain work 'offload' are the same/similar.
I think this can be solved: I have seen a document about MKL using offload inside MATLAB, alas for Windows. Does anybody have a clue where to start?
Thanks a lot!
Marcin Krotkiewski
Hi Intel forums,
I've had difficulty reproducing the performance reported on the following page:
https://www-ssl.intel.com/content/www/us/en/benchmarks/server/xeon-phi/xeon-phi-sgemm-dgemm.html
Using the mkl sgemm routine on my 3120 series Xeon Phi, I haven't even approached the 1.7 TFLOP/S level claimed above. The best performance I achieve is ~0.7 TFLOP/S. Presumably, this is because I don't fully understand the threading and vectorization APIs, and I'm not using them optimally. I was wondering if anyone knows where to find the source & environment details used for Intel's official benchmark. Maybe I could compare "correct" usage with my code to better understand the tools.
Thanks,
Chris
Hello,
Could you please provide us with some matrix: what OFED should be used for what OS distribution in case of the latest MPSS (3.5.2, linux)? We need to know in what cases MPSS supports the OFED and in what cases better to use alternative OFED distribution.
Best regards,
Taras
Hello,
I have been working with Knights Corner platform for some time. Like they do with libnuma and DPDK, I have been wondering if I could write a cache and memory controller-aware memory allocation code for Xeon Phi. Last time I asked, I didn't get much information on the subject (https://software.intel.com/en-us/comment/1799811#comment-1799811), but then I came across this while browsing through the datasheet (http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/x...).
"Communication around the ring follows a Shortest Distance Algorithm (SDA). Coresident with each core structure is a portion of a distributed tag directory. These tags are hashed to distribute workloads across the enabled cores. Physical addresses are also hashed to distribute memory accesses across the memory controllers."
I believe a full description of this scheme is the answer I'm seeking.
If someone from Intel could provide me with a more detailed explanation, or where I could find one, I would be very grateful. More honestly, I NEED to know this.
For instance,
a) how does it hash physical addresses? Does it divide 40-bit physical address space by cache line size (64B) and distribute 0x400000000 (2^34) cache lines to DTD by performing a modulo operation on ordinal number of each cache line?
b) Is L2 address space segmentation in PA space somewhat preserved in VA space as well? For instance, would every 60th cache line belong in a specific core's L2 tag directory?
c) which one of the following does "enabled cores" mean? i) all on-board cores, ii) cores with any executing threads, iii) cores not disabled by some means I'm not aware of. If iii) is the case, how do you disable the core?
For your information, I am currently using Xeon Phi 5110P, and could possibly be purchasing/using more 31S1Ps.
Thank you for your attention.
Jun
Intel® Parallel Studio XE 2016, launched on August 25, 2015, is the latest installment in our developer toolkit for high performance computing (HPC) and technical computing applications. This suite of compilers, libraries, debugging facilities, and analysis tools, targets Intel® architecture, including support for the latest Intel® Xeon® processors (codenamed Skylake) and Intel® Xeon Phi™ processors (codenamed Knights Landing). Intel® Parallel Studio XE 2016 helps software developers design, build, verify and tune code in Fortran, C++, C, and Java.
There are four things that I like to highlight when I describe this year's tool release:
Data Scientists are finding Intel® DAAL very exciting because it helps speed big data analytics. It’s designed for use with popular data platforms including Hadoop, Spark, R, and Matlab, for highly efficient data access. We’ve seen Intel DAAL accelerate PCA by 4-7X ,and a customer that has seen 200X for the Alternating Least Square prediction algorithm, when compared with the latest open source Spark + MLlib. (details for both claims are in my blog about DAAL). Intel DAAL was created by the renowned team that creates the Intel® Math Kernel Library (Intel® MKL). Intel DAAL can be thought of as “Intel MKL for Big Data” – but it is actually much more! Many more details on Intel DAAL, including ways to download it today for free are in my blog about DAAL. Intel DAAL is available for Linux, OS X and Windows.
Vectorization is the process of using SIMD instructions in processors. In the quest to “modernize” application to get top performance out of any modern processor, a software developer needs to tackle multithreading, vectorization and fabric scaling. Intel® Advisor XE 2016 provides tools to help with multithreading and vectorization:
Threading Advisor has gained a reputation in the past five years for helping find the right choice for multithreading an application more quickly and without costly oversights. The experience of refining this ‘advisor’ has helped us to create this new advisor for vectorization with our knowledge of the best ways to give advice based on a program analysis.
Vector Advisor cannot tell you anything we could not show you how to do yourself. However, when I teach ‘vectorization’ I tend to rattle off a list of things to check. Each item that I suggest to “check” involves using a tool in a particular way. Bringing that into one tool, make life easier and definitely makes the process faster and more efficient. One of the key Vectorization Advisor features is a Survey Report that offers integrated compiler report data and performance data all in one place, including GUI-embedded advice on how to fix vectorization issues specific to your code. This page augments that GUI-embedded advice with links to web-based vectorization resources.
An excellent 12 minute introduction to the Vectorization Advisor is available as a video online.
The MPI Performance Snapshot is a scalable lightweight performance tool for MPI applications. It collects a variety of MPI application statistics (such as communication, activity, and load balance) and presents it in an easy-to-read format. The tool is not available separately but is provided as part of the Intel® Parallel Studio XE 2016 Cluster Edition.
The MPI Performance Snapshot is trying to solve the following problems as it relates to analysis of MPI application when scaling out to thousands of ranks:
By addressing these three items, MPI Performance Snapshot improves scaling to at least 32K ranks which is an order of magnitude above what is tolerable with the prior Intel Trace Analyzer and Collector. Therefore, we can now recommend when aiming to optimize a large scale run (anything above 1000 MPI ranks), we suggesting starting with the MPI Performance Snapshot capability first and figure out where you need to dig deeper (which processes are slowing you down, where are the peaks in your memory usage, etc.). Then, do another run with the Intel Trace Analyzer and Collector on a subset of selected ranks to get a more detailed per-process information in order to visualize how a communication algorithm is implemented and if see if there are apparent bottlenecks.
MPI Performance Snapshot combines lightweight statistics from the Intel® MPI Library with OS and hardware-level counters to provide you with high-level categorization of your application: MPI vs. OpenMP load imbalance info, memory usage, and a break-down of MPI vs. computation vs. serial time.
For more details, you should check out the full MPI Performance Snapshot User's Guide and Analyzing MPI Applications with MPI Performance Snapshot on the Intel Trace Analyzer and Collector documentation page.
are supported including support for the Skylake microarchitecture and Knight Landing microarchitecture.
We take pride in having very strong support for industry standards – we aim to be a leader and maintain our reputation of being second-to-none.
Our Fortran support even includes a feature from the draft Fortran 2015 standard which can help MPI-3 users. The current status of features of Fortran can be found in Dr. Fortran’s blog “Intel® Fortran Compiler - Support for Fortran language standards.”
The current status of C/C++ standard support features can be found in Jennifer’s blogs “C++14 Features Supported by Intel® C++ Compiler” and “C11 Support in Intel C++ Compiler.”
Our OpenMP support is detailed in the latest user guide for the C/C++ compiler and the latest user guide for the Fortran compiler.
Operating system support includes Debian 7.0, 8.0; Fedora 21, 22; Red Hat Enterprise Linux 5, 6, 7; SuSE LINUX Enterprise Server 11,12; Ubuntu 12.04 LTS (64-bit only), 13.10, 14.04 LTS, 15.04; OS X 10.10; Windows 7 thru 10, Windows Server 2008-2012. These are just the versions we have tested, many additional operating systems should work (for instance, CentOS).
There is a series of webinars being held starting in September 2015 which cover many topics related to Intel Parallel Studio XE 2016. The webinars can be attended live, and offer interactive question and answer time. The webinars will also be available for replay after the live webinar is held. The first webinar is on September 1 – “What’s New in Intel® Parallel Studio XE 2016?”
Many more ways to learn more are on the Intel® Parallel Studio XE 2016 website. A number of benchmarks illustrating performance measurements are online as well.
There are many new features that I did not dive into, including great new support for MPI+OpenMP tuning with Intel VTune Amplifier XE, as well as a number of enhancements to Intel® Threading Building Blocks including the incresingly popular flow graph capabilities and task arenas,
An evaluation copy can be obtained by requesting an evaluation copy of Intel® Parallel Studio XE 2016. It is available for purchase worldwide.
Students, educators, academic researchers and open source contributors may qualify for some free tools.
The Intel Performance Libraries are also available via the Community Licensing for Intel Performance Libraries. Under this option, the library is free for any one who registers, with no royalties, and no restrictions on company or project size. The community licensing program offers the current versions of libraries without Intel Premier Support access (Intel Premier Support offers exclusive 1-on-1 support via an interactive and secure web site where you can submit questions or problems and monitor previously submitted issues. Intel® Premier Support requires registration after purchase of the software, or special qualification offered to students, educators, academic researchers and open source contributors.).
Intel® Parallel Studio XE is a very popular product from Intel that includes the Intel Compilers, Intel Performance Libraries, tools for analysis, debugging and tuning, tools for MPI and the Intel MPI Library. Did you know that some of these are available for free?
Here is a guide to “what is available free” from the Intel Parallel Studio XE suites.
Who | What is Free? | Information | Where? |
---|---|---|---|
Community Licenses for Everyone | Intel® Math Kernel Library Intel® Data Acceleration Library Intel® Threading Building Blocks Intel® Integrated Performance Primitives | Community Licensing for Intel Performance Libraries – free for all, registration required, no royalties, no restrictions on company or project size, current versions of libraries, no Intel Premier Support access. (Linux, Windows or OS X versions) | Community Licensing for Intel Performance Libraries |
Evaluation Copies for Everyone | Compilers, libraries and analysis tools (most everything!) | Evaluation Copies – Try before you buy. (Linux, Windows or OS X versions) | Try before you buy |
Use as an Academic Researcher | Linux, Windows or OS X versions of: Intel® Math Kernel Library Intel® Data Acceleration Library Intel® Threading Building Blocks Intel® Integrated Performance Primitives Intel® MPI Library (not available for OS X) | If you will use in conjunction with academic research at institutions of higher education. (Linux, Windows or OS X versions, expect Intel® MPI Library which is not supported on OS X) | Qualify for Use as an Academic Researcher |
Student | Compilers, libraries and analysis tools (most everything!) | If you are a current student at a degree-granting institutions. (Linux, Windows or OS X versions) | Qualify for Use as a Student |
Teacher | Compilers, libraries and analysis tools (most everything!) | If you will use in a teaching curriculum. (Linux, Windows or OS X versions) | Qualify for Use as an Educator |
Use as an Open Source Contributor | Intel® Parallel Studio XE Professional Edition for Linux | If you are a developer actively contributing to a open source projects – and that is why you will utilize the tools. (Linux versions) | Qualify for Use as an Open Source Contributor |
Free licenses for certain users has always been an important dimension in our offerings. One thing that really distinguishes Intel is that we sell excellent tools and provide second-to-none support for software developers who buy our tools. We provide multiple options - and we hope you will exactly what you need in one or our options.
Intel® Xeon Phi™ coprocessor is a product based on the Intel® Many Integrated Core Architecture (Intel® MIC). Intel® offers a debug solution for this architecture that can debug applications running on an Intel® Xeon Phi™ coprocessor.
There are many reasons for the need of a debug solution for Intel® MIC. Some of the most important ones are the following:
For Linux* host, Intel offers a debug solution for Intel® MIC which is based on GNU* GDB. It can be used on the command line for both host and coprocessor. There is also an Eclipse* IDE integration that eases debugging of applications with hundreds of threads thanks to its user interface. It also supports debugging offload enabled applications.
There are currently two ways to obtain Intel’s debug solution for Intel® MIC Architecture on Linux* host:
Both packages contain debug solutions for Intel® MIC Architecture!
Attention:
Never mix debugging tools from Intel® Parallel Studio XE with the ones from Intel® Manycore Platform Software Stack! Use all tools from the very same package. Different packages might have different debugger versions with different feature sets.
Note:
Intel® Composer XE 2013 SP1 contains GNU* GDB 7.5. With Intel® Parallel Studio XE 2015 GNU* GDB 7.7, and with Intel® Parallel Studio XE 2015 Update 2 GNU* GDB 7.8 (host only; 7.7 for coprocessor) is available. Intel® Parallel Studio XE 2016 contains GNU* GDB 7.8 for both host & coprocessor.
MPSS versions have different versions of GNU* GDB – please check the Release Notes of the individual MPPS releases.
There has been a change in product naming: Intel® Parallel Studio XE Composer Edition is the successor of Intel® Composer XE, starting with 2015.
Latest Intel related HW support and features are provided in the debug solution from Intel!
The command line with GNU* GDB has the following advantages:
Using the Eclipse* IDE provides more features:
Intel’s GNU* GDB, starting with version 7.5, provides additional extensions that are available:
The features for Intel® MIC Architecture highlighted above are described in the following.
Note that newer GNU* GDB versions with more features are already available, but those do not add anything in addition for Intel® MIC Architecture.
Compared to Intel® architecture on host systems, Intel® MIC Architecture comes with a different instruction and register set. Intel’s GNU* GDB comes with transparently integrated support for those. Use is no different than with host systems, e.g.:
(gdb) disassemble $pc, +10 Dump of assembler code from 0x11 to 0x24: 0x0000000000000011 <foobar+17>: vpackstorelps %zmm0,-0x10(%rbp){%k1} 0x0000000000000018 <foobar+24>: vbroadcastss -0x10(%rbp),%zmm0 ⁞
(gdb) info registers zmm k0 0x0 0 ⁞ zmm31 {v16_float = {0x0 <repeats 16 times>}, v8_double = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v64_int8 = {0x0 <repeats 64 times>}, v32_int16 = {0x0 <repeats 32 times>}, v16_int32 = {0x0 <repeats 16 times>}, v8_int64 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_uint128 = {0x0, 0x0, 0x0, 0x0}}
If you use the Eclipse* IDE integration you’ll get the same information in dedicated windows:
A quick excursion about what data races are:
int a = 1; int b = 2; | t int thread1() { int thread2() { | i return a + b; b = 42; | m } } | e v
What are typical symptoms of data races?
GDB data race detection points out unsynchronized data accesses. Not all of them might incur data races. It is the responsibility of the user to decide which ones are not expected and filter them (see next).
Due to technical limitations not all unsynchronized data accesses can be found, e.g.: 3rd party libraries or any object code not compiled with –debug parallel (see next).
How to detect data races?
(gdb) pdbx enable (gdb) c data race detected 1: write shared, 4 bytes from foo.c:36 3: read shared, 4 bytes from foo.c:40 Breakpoint -11, 0x401515 in L_test_..._21 () at foo.c:36 *var = 42; /* bp.write */
Data race detection requires an additional library libpdbx.so.5:
Supported parallel programming models:
Data race detection can be enabled/disabled at any time
There is finer grained control for minimizing overhead and selecting code sections to analyze by using filter sets.
More control about what to analyze with filters:
(gdb) pdbx filter line foo.c:36 (gdb) pdbx filter code 0x40518..0x40524 (gdb) pdbx filter var shared (gdb) pdbx filter data 0x60f48..0x60f50 (gdb) pdbx filter reads # read accesses
(gdb) pdbx fset suppress
(gdb) pdbx fset focus
(gdb) help pdbx
Use cases for filters:
Some additional hints using PDBX:
(gdb) run data race detected 1: write question, 4 bytes from foo.c:36 3: read question, 4 bytes from foo.c:40 Breakpoint -11, 0x401515 in foo () at foo.c:36 *answer = 42; (gdb)
Note:
PDBX is not available for Eclipse* IDE and will only work for remote debugging of native coprocessor applications. See section Debugging Remotely with PDBX for more information on how to use it.
There are multiple versions available:
Debug natively on Intel® Xeon Phi™ coprocessor
This version of Intel’s GNU* GDB runs natively on the coprocessor. It is included in Intel® MPSS only and needs to be made available on the coprocessor first in order to run it. Depending on the MPSS version it can be found at the provided location:
Execute GNU* GDB on host and debug remotely
There are two ways to start GNU* GDB on the host and debug remotely using GDBServer on the coprocessor:
$ source compilervars.[sh|csh] [ia32|intel64] $ gdb-mic
The sourcing of the debugger environment is only needed once. If you already sourced the according compilervars.[sh|csh] script you can omit this step and gdb-mic should already be in your default search paths.
Attention: Do not mix GNU* GDB & GDBServer from different packages! Always use both from either Intel® MPSS or Intel® Parallel Studio XE Composer Edition!
$ scp /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdb mic0:/tmp
$ ssh –t mic0 /tmp/gdb
$ ssh –t mic0 /usr/bin/gdb
(gdb) attach <pid>
(gdb) file <path_to_application>
Some additional hints:
(gdb) set env LD_LIBRARY_PATH=/tmp/
(gdb) set substitute-path <from> <to>
Debugging is no different than on host thanks to a real Linux* environment on the coprocessor!
$ scp /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdbserver mic0:/tmp
$ scp <install-dir>/debugger_2016/gdb/targets/mic/bin/gdbserver mic0:/tmp
$ source compilervars.[sh|csh] [ia32|intel64] $ gdb-mic
(gdb) target extended-remote | ssh -T mic0 /tmp/gdbserver --multi –
(gdb) set sysroot /opt/mpss/3.1.4/sysroots/k1om-mpss-linux/
(gdb) file <path_to_application> (gdb) attach <pid>
(gdb) file <path_to_application> (gdb) set remote exec-file <remote_path_to_application>
Some additional hints:
(gdb) target extended-remote | ssh mic0 LD_LIBRARY_PATH=/tmp/ /tmp/gdbserver --multi -
(gdb) set substitute-path <from> <to>
(gdb) set solib-search-path <lib_paths>
Debugging is no different than on host thanks to a real Linux* environment on the coprocessor!
PDBX has some pre-requisites that must be fulfilled for proper operation. Use pdbx check command to see whether PDBX is working:
(gdb) pdbx check checking inferior...failed.
(gdb) pdbx check checking inferior...passed. checking libpdbx...failed.
(gdb) pdbx check checking inferior...passed. checking libpdbx...passed. checking environment...failed.
Intel offers an Eclipse* IDE debugger plug-in for Intel® MIC that has the following features:
The plug-in is part of both Intel® MPSS and Intel® Parallel Studio XE Composer Edition.
In order to use the provided plug-in the following pre-requisites have to be met:
We recommend: Eclipse* IDE for C/C++ Developers (4.5)
Install Intel® C++ Compiler plug-in (optional):
Add plug-in via “Install New Software…”:
This Plug-in is part of Intel® Parallel Studio XE Composer Edition (<install-dir>/ide_support_2016/eclipse/compiler_xe/). It adds Intel® C++ Compiler support which is not mandatory for debugging. For Fortran the counterpart is the Photran* plug-in. These plug-ins are recommended for the best experience.
Note:
Uncheck “Group items by category”, as the list will be empty otherwise!
In addition, it is recommended to disable checking for latest versions. If not done, installation could take unnecessarily long and newer components might be installed that did not come with the vanilla Eclipse package. Those could cause problems.
Add plug-in via “Install New Software…”:
Plug-in is part of:
Note:
Uncheck “Group items by category”, as the list will be empty otherwise!
In addition, it is recommended to disable checking for latest versions. If not done, installation could take unnecessarily long and newer components might be installed that did not come with the vanilla Eclipse package. Those could cause problems.
Debugging offload enabled applications is not much different than applications native for the host:
This is an example (Fortran) of what offload debugging looks like. On the left side we see host & mic0 threads running. One thread (11) from the coprocessor has hit the breakpoint we set inside the loop of the offloaded code. Run control (stepping, continuing, etc.), setting breakpoints, evaluating variables/memory, … work as they used to.
For debugging offload enabled applications additional environment variables need to be set:
Set those variables before starting Eclipse* IDE!
Those are currently needed but might become obsolete in the future.
For MPSS 2.1, please be aware that the debugger cannot and should not be used in combination with Intel® VTune™ Amplifier XE (COI_SEP_DISABLE=FALSE). Hence, disabling SEP (as part of Intel® VTune™ Amplifier XE) is valid.
For MPSS 3.*, AMPLXE_COI_DEBUG_SUPPORT=TRUE extracts K1OM object code map files from fat SOs (with host & K1OM object code) and places it under /tmp/coi_procs/<card #>/<process ID>/load_lib/ on the coprocessor. This is not only required for Intel® VTune™ Amplifier XE but also for the debugger. Additionally, use the mic_extract tool to extract K1OM object code from fat SOs on the host (where Eclipse IDE* runs on). Otherwise the current debugger won’t find the K1OM object code on the host, e.g.:
$ mic_extract libx.so
If libx.so contains K1OM object code as well, another file is created aside libx.so, like libxMIC.so. The latter contains the K1OM object code. See https://software.intel.com/en-us/node/524818 for more information.
In addition, the watchdog monitor must be disabled because a debugger can stop execution for an unspecified amount of time. Hence the system watchdog might assume that a debugged application, if not reacting anymore, is dead and will terminate it. For debugging we do not want that.
Note:
Do not set those variables for a production system!
For Intel® MPSS 3.2 and later:
MYO debug libraries are no longer installed with Intel MPSS 3.2 by default. This is a change from earlier Intel MPSS versions. Users must install the MYO debug libraries manually in order to debug MYO enabled applications using the Eclipse plug-in for offload debugging. For Intel MPSS 3.2 (and later) the MYO debug libraries can be found in the package mpss-myo-dbg-* which is included in the mpss-*.tar file.
MPSS 3.2 and 3.2.1 do not support offload debugging with Intel® Composer XE 2013 SP1, please see Errata for more information!
Configure Remote System Explorer
To debug native coprocessor applications we need to configure the Remote System Explorer (RSE).
Note:
Before you continue, make sure SSH works (e.g. via command line). You can also specify different credentials (user account) via RSE and save the password.
The basic steps are quite simple:
Repeat this step for each coprocessor!
Transfer GDBServer
Transfer of the GDBServer to the coprocessor is required for remote debugging. We choose /tmp/gdberver as target on the coprocessor here (important for the following sections).
Copy GDBServer to coprocessor, e.g.:
$ scp /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdbserver mic0:/tmp
$ scp <install-dir>/debugger_2016/gdb/targets/mic/bin/gdbserver mic0:/tmp
Debug Configuration
To create a new debug configuration for a native coprocessor application (here: native_c++) create a new one for C/C++ Remote Application.
Set Connection to the coprocessor target configured with RSE before (here: mic0).
Specify the remote path of the application, wherever it was copied to (here: /tmp/native_c++). We’ll address how to manually transfer files later.
Set the flag for “Skip download to target path.” if you don’t want the debugger to upload the executable to the specified path. This can be meaningful if you have complex projects with external dependencies (e.g. libraries) and don’t want to manually transfer the binaries.
(for MPSS 3.1.2 or 3.1.4, please see Errata)
Note that we use C/C++ Remote Application here. This is also true for Fortran applications because there’s no remote debug configuration section provided by the Photran* plug-in!
In Debugger tab, specify the provided Intel GNU* GDB for Intel® MIC (here: gdb-mic).
In the above example, set sysroot from MPSS installation in .gdbinit, e.g.:
set sysroot /opt/mpss/3.1.4/sysroots/k1om-mpss-linux/
Note:
See section Debugging on Command Line above for the correct path of GDBServer, depending on the chosen package (Intel® MPSS or Intel® Parallel Studio XE Composer Edition)!
In Debugger/Gdbserver Settings tab, specify the uploaded GDBServer (here: /tmp/gdbserver).
Configuration depends on the installed plug-ins. For C/C++ applications we recommend to install the Intel® C++ Compiler XE plug-in that comes with Intel® Parallel Studio XE Composer Edition. For Fortran, install Photran* (3rd party) and select the Intel® Fortran Compiler manually.
Make sure to use the debug configuration and provide options as if debugging on the host (-g). Optionally, disabling optimizations by –O0 can make the instruction flow comprehendible when debugging.
The only difference compared to host builds is that you need to cross-compile for the coprocessor: Use the –mmic option, e.g.:
After configuration, clean your build. This is needed because Eclipse* IDE might not notice all dependencies. And finally, build.
Note:
That the configuration dialog shown only exists for the Intel® C++ Compiler plug-in. For Fortran, users need to install the Photran* plug-in and switch the compiler/linker to ifort by hand plus adding -mmic manually. This has to be done for both the compiler & linker!
Transfer the executable to the coprocessor, e.g.:
Note:
It is crucial that the executable can be executed on the coprocessor. In some cases the execution bits might not be set after copying.
Start debugging using the C/C++ Remote Application created in the earlier steps. It should connect to the coprocessor target and launch the specified application via the GDBServer. Debugging is the same as for local/host applications.
Note:
This works for coprocessor native Fortran applications the exact same way!
More information can be found in the official documentation:
The PDF gdb.pdf is the original GNU* GDB manual for the base version Intel ships, extended by all features added. So, this is the place to get help for new commands, behavior, etc.
README-INTEL from Intel® MPSS contains a short guide how to install and configure the Eclipse* IDE plug-in.
PDF eclmigdb_config_guide.pdf provides an overall step-by-step guide how to debug with the command line and with Eclipse* IDE.
Using Intel® C++ Compiler with the Eclipse* IDE on Linux*:
http://software.intel.com/en-us/articles/intel-c-compiler-for-linux-using-intel-compilers-with-the-eclipse-ide-pdf/
The knowledgebase article (Using Intel® C++ Compiler with the Eclipse* IDE on Linux*) is a step-by step guide how to install, configure and use the Intel® C++ Compiler with Eclipse* IDE.
Intel® Xeon Phi™ coprocessor is a product based on the Intel® Many Integrated Core Architecture (Intel® MIC). Intel® offers a debug solution for this architecture that can debug applications running on an Intel® Xeon Phi™ coprocessor.
There are many reasons for the need of a debug solution for Intel® MIC. Some of the most important ones are the following:
For Windows* host, Intel offers a debug solution, the Intel® Debugger Extension for Intel® MIC Architecture Applications. It supports debugging offload enabled application as well as native Intel® MIC applications running on the Intel® Xeon Phi™ coprocessor.
To obtain Intel® Debugger Extension for Intel® MIC Architecture on Windows* host, you need the following:
Debug solution from Intel® based on GNU* GDB:
Note:
Pure native debugging on the coprocessor is also possible by using Intel’s version of GNU* GDB for the coprocessor. This is covered in the following article for Linux* host:
http://software.intel.com/en-us/articles/debugging-intel-xeon-phi-applications-on-linux-host
Why integration into Microsoft Visual Studio*?
The following components are required to develop and debug for Intel® MIC Architecture:
It is crucial to make sure that the coprocessor setup is correctly working. Otherwise the debugger might not be fully functional.
Setup Intel® MPSS:
Before debugging applications with offload extensions:
It is crucial to make sure that the coprocessor setup is correctly working. Otherwise the debugger might not be fully functional.
Debugger integration for Intel® MIC Architecture only works when debug information is being available:
Applications can only be debugged in 64 bit
Start Microsoft Visual Studio* IDE and open or create an Intel® Xeon Phi™ project with offload extensions. Examples can be found in the Samples directory of Intel® Parallel Studio XE Composer Edition (former Intel® Composer XE), that is:
C:\Program Files (x86)\IntelSWTools\samples_2016\en
We’ll use intro_SampleC from the official C++ examples in the following.
Compile the project with Intel® C++/Fortran Compiler.
Note the mixed breakpoints here:
The ones set in the normal code (not offloaded) apply to the host. Breakpoints on offloaded code apply to the respective coprocessor(s) only.
The Breakpoints window shows all breakpoints (host & coprocessor(s)).
Start debugging as usual via menu (shown) or <F5> key:
While debugging, continue till you reach a set breakpoint in offloaded code to debug the coprocessor code.
Information of host and coprocessor(s) is mixed. In the example above, the threads window shows two processes with their threads. One process comes from the host, which does the offload. The other one is the process hosting and executing the offloaded code, one for each coprocessor.
For debugging offload enabled applications additional environment variables need to be set:
Set those variables before starting Visual Studio* IDE!
Those are currently needed but might become obsolete in the future. Please be aware that the debugger cannot and should not be used in combination with Intel® VTune™ Amplifier XE. Hence disabling SEP (as part of Intel® VTune™ Amplifier XE) is valid. The watchdog monitor must be disabled because a debugger can stop execution for an unspecified amount of time. Hence the system watchdog might assume that a debugged application, if not reacting anymore, is dead and will terminate it. For debugging we do not want that.
Note:
Do not set those variables for a production system!
Create a native Intel® Xeon Phi™ coprocessor application and transfer & execute the application to the coprocessor target:
micnativeloadex.exe transfers the specified application to the specified coprocessor and directly executes it. The command itself will be blocked until the transferred application terminates.
Using micnativeloadex.exe also takes care about dependencies (i.e. libraries) and transfers them, too.
Other ways to transfer and execute native applications are also possible (but more complex):
Debugging native applications with Start Visual Studio* IDE is only possible via Attach to Process…:
static int lockit = 1; while(lockit) { sleep(1); }
Only one coprocessor at a time can be debugged this way.
Open the options via TOOLS/Options… menu:
It tells the debugger extension where to find the binary and sources. This needs to be changed every time a different coprocessor native application is being debugged.
The entry solib-search-path directories works the same as for the analogous GNU* GDB command. It allows to map paths from the build system to the host system running the debugger.
The entry Host Cache Directory is used for caching symbol files. It can speed up lookup for big sized applications.
Open the options via TOOLS/Attach to Process… menu:
Specify the Intel(R) Debugger Extension for Intel(R) MIC Architecture. Set the IP and port the GDBServer should be executed with. The usual port for GDBServer is 2000 but we recommend to use a non-privileged port (e.g. 16000).
After a short delay the processes of the coprocessor card are listed. Select one to attach.
Note:
Checkbox Show processes from all users does not have a function for the coprocessor as user accounts cannot be mapped from host to target and vice versa (Linux* vs. Windows*).
More information can be found in the official documentation from Intel® Parallel Studio XE Composer Edition:
C:\Program Files (x86)\IntelSWTools\documentation_2016\en\debugger\ps2016\get_started.htm
hi all,
I'm trying to build something for the Phi that depends on iconv; the library routines are present , but the following application fails when run on the Phi:
#include <stdlib.h> #include <iconv.h> int main () { iconv_t cd; cd = iconv_open("latin1","UTF-8"); if(cd == (iconv_t)(-1)) exit(1); iconv_close(cd); exit(0); }
if I build this using "icc -o iconv_test iconv_test.c" and run it on the host it return no error (exit code 0).
However, if I build this for the Phi "icc -mmic -o iconv_test iconv_test.c" it always returns exitcode 1. An strace shows the following
open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = -1 ENOENT (No such file or directory) brk(0) = 0x714000 brk(0x735000) = 0x735000 open("/usr/lib64/gconv/gconv-modules", O_RDONLY) = -1 ENOENT (No such file or directory) exit_group(1)
and indeed, those module files are missing - where can I find them?
I have a system with 2 PHI cards installed running on redhat 7.0. I am able to run code on the cards as pure offload and I can ssh into the cards. I am trying to get symmetric mode to work.
1) Does symmetric mode require OFED, or is OFED only required when there is a physical Infiniband card?
2) What are the proper steps to verify that the SCIF driver is properly loaded? mic shows up as a driver but there is no indication of anything named SCIF.
[root@infinity ~]# lsmod
Module Size Used by
mic 666166 16
vtsspp 372813 0
sep3_15 527535 0
pax 13181 0
bridge 115385 0
stp 12976 1 bridge
llc 14552 2 stp,bridge
ipt_REJECT 12541 2
xt_comment 12504 2
nf_conntrack_ipv4 14862 2
nf_defrag_ipv4 12729 1 nf_conntrack_ipv4
xt_conntrack 12760 2
nf_conntrack 105702 2 xt_conntrack,nf_conntrack_ipv4
iptable_filter 12810 1
ip_tables 27239 1 iptable_filter
intel_powerclamp 18764 0
coretemp 13435 0
intel_rapl 18773 0
kvm 461126 0
iTCO_wdt 13480 0
crct10dif_pclmul 14289 0
crc32_pclmul 13113 0
crc32c_intel 22079 0
ghash_clmulni_intel 13259 0
iTCO_vendor_support 13718 1 iTCO_wdt
cryptd 20359 1 ghash_clmulni_intel
mei_me 18646 0
sb_edac 26819 0
pcspkr 12718 0
nfsd 290215 13
mei 82723 1 mei_me
edac_core 57650 1 sb_edac
lpc_ich 21073 0
mfd_core 13435 1 lpc_ich
i2c_i801 18135 0
auth_rpcgss 59343 1 nfsd
nfs_acl 12837 1 nfsd
lockd 93977 1 nfsd
ipmi_si 53353 0
ipmi_msghandler 45603 1 ipmi_si
sunrpc 295293 15 nfsd,auth_rpcgss,lockd,nfs_acl
shpchp 37032 0
ioatdma 67762 0
acpi_power_meter 18087 0
acpi_pad 116305 0
ext4 562391 7
mbcache 14958 1 ext4
jbd2 102940 1 ext4
raid10 48128 2
sd_mod 45499 12
crc_t10dif 12714 1 sd_mod
crct10dif_common 12595 2 crct10dif_pclmul,crc_t10dif
ast 56119 1
syscopyarea 12529 1 ast
sysfillrect 12701 1 ast
sysimgblt 12640 1 ast
nvidia 8374856 0
drm_kms_helper 98226 1 ast
ttm 93488 1 ast
drm 311588 5 ast,ttm,drm_kms_helper,nvidia
igb 192078 0
ahci 29870 8
libahci 32009 1 ahci
ptp 18933 1 igb
libata 218854 2 ahci,libahci
pps_core 19106 1 ptp
dca 15130 2 igb,ioatdma
i2c_algo_bit 13413 2 ast,igb
i2c_core 40325 7 ast,drm,igb,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia
wmi 19070 0
dm_mirror 22135 0
dm_region_hash 20862 1 dm_mirror
dm_log 18411 2 dm_region_hash,dm_mirror
dm_mod 104038 25 dm_log,dm_mirror
Dear Intel Staff,
I just got to know some details of your great presentation of Knight's Landing (KNL) at Hot Chips this year. Information about KNL on the website is still sparse. From your slides I understand that there will be a version of KNL that is socked and can be used as a primary CPU in a rack. However, this raises quite some questions that I cannot find satisfying answers.
Our scenario:
We have a research cluster that consists mostly of 2 socket systems with normal Ivy-Bridge Xeon CPUs. Our main application is a JVM based machine learning system that uses the MKL via JNI to accelerate computations. We intend to extend this cluster soon and would like to utilize Phi processors. But whether we can use them depends on a few things. (see below)
What I would like to know from you:
Many thanks in advance,
Matt
Hello,
I'm attempting to run a simple offload example:
#include <stdio.h> #include <omp.h> int main(){ double sum; int i,n, nt; n=2000000000; sum=0.0e0; #pragma offload target(mic:0) { #pragma omp parallel for reduction(+:sum) for(i=1;i<=n;i++){ sum = sum + i; } //nt = omp_get_max_threads(); #pragma omp parallel { #pragma omp single nt = omp_get_num_threads(); } #ifdef __MIC__ printf("Hello MIC reduction %f threads: %d\n",sum,nt); #else printf("Hello CPU reduction %f threads: %d\n",sum,nt); #endif } }
This program ran fine previously but we recently rebooted our Phi nodes in our cluster and since then this offloading example will not run. The native compiled MIC binaries still run without a problem since the reboot.
Before running I type:
. /usr/local/intel/ClusterStudioXE_2013/composer_xe_2013_sp1/bin/compilervars.sh intel64 make export MIC_OMP_NUM_THREADS=120 export MIC_ENV_PREFIX=MIC export OFFLOAD_REPORT=3
Here is my Makefile:
CC=icc CFLAGS=-std=c99 -O3 -vec-report3 -openmp -offload EXE=reduce_offload_mic $(EXE) : reduce_omp_mic.c $(CC) -o $@ $< $(CFLAGS) .PHONY: clean clean: rm $(EXE)
However, when I run the program here is the output:
[frenchwr@vmp903 Offload]$ ./reduce_offload_mic offload error: cannot offload to MIC - device is not available [Offload] [HOST] [State] Unregister data tables
I have ensured that mpss is running and even restarted the service with:
sudo service mpss restart
but still the same error (even after re-building the executable).
All of my mic tests pass:
[frenchwr@vmp903 Offload]$ miccheck MicCheck 3.4-r1 Copyright 2013 Intel Corporation All Rights Reserved Executing default tests for host Test 0: Check number of devices the OS sees in the system ... pass Test 1: Check mic driver is loaded ... pass Test 2: Check number of devices driver sees in the system ... pass Test 3: Check mpssd daemon is running ... pass Executing default tests for device: 0 Test 4 (mic0): Check device is in online state and its postcode is FF ... pass Test 5 (mic0): Check ras daemon is available in device ... pass Test 6 (mic0): Check running flash version is correct ... pass Test 7 (mic0): Check running SMC firmware version is correct ... pass Executing default tests for device: 1 Test 8 (mic1): Check device is in online state and its postcode is FF ... pass Test 9 (mic1): Check ras daemon is available in device ... pass Test 10 (mic1): Check running flash version is correct ... pass Test 11 (mic1): Check running SMC firmware version is correct ... pass Status: OK
Here's the output from micinfo:
[frenchwr@vmp903 Offload]$ micinfo MicInfo Utility Log Created Fri Aug 28 18:14:23 2015 System Info HOST OS : Linux OS Version : 2.6.32-431.29.2.el6.x86_64 Driver Version : 3.4-1 MPSS Version : 3.4 Host Physical Memory : 132110 MB Device No: 0, Device Name: mic0 Version Flash Version : 2.1.02.0390 SMC Firmware Version : 1.16.5078 SMC Boot Loader Version : 1.8.4326 uOS Version : 2.6.38.8+mpss3.4 Device Serial Number : ADKC42900304 Board Vendor ID : 0x8086 Device ID : 0x225c Subsystem ID : 0x7d95 Coprocessor Stepping ID : 2 PCIe Width : Insufficient Privileges PCIe Speed : Insufficient Privileges PCIe Max payload size : Insufficient Privileges PCIe Max read req size : Insufficient Privileges Coprocessor Model : 0x01 Coprocessor Model Ext : 0x00 Coprocessor Type : 0x00 Coprocessor Family : 0x0b Coprocessor Family Ext : 0x00 Coprocessor Stepping : C0 Board SKU : C0PRQ-7120 P/A/X/D ECC Mode : Enabled SMC HW Revision : Product 300W Passive CS Cores Total No of Active Cores : 61 Voltage : 1037000 uV Frequency : 1238095 kHz Thermal Fan Speed Control : N/A Fan RPM : N/A Fan PWM : N/A Die Temp : 46 C GDDR GDDR Vendor : Samsung GDDR Version : 0x6 GDDR Density : 4096 Mb GDDR Size : 15872 MB GDDR Technology : GDDR5 GDDR Speed : 5.500000 GT/s GDDR Frequency : 2750000 kHz GDDR Voltage : 1501000 uV Device No: 1, Device Name: mic1 Version Flash Version : 2.1.02.0390 SMC Firmware Version : 1.16.5078 SMC Boot Loader Version : 1.8.4326 uOS Version : 2.6.38.8+mpss3.4 Device Serial Number : ADKC42900319 Board Vendor ID : 0x8086 Device ID : 0x225c Subsystem ID : 0x7d95 Coprocessor Stepping ID : 2 PCIe Width : Insufficient Privileges PCIe Speed : Insufficient Privileges PCIe Max payload size : Insufficient Privileges PCIe Max read req size : Insufficient Privileges Coprocessor Model : 0x01 Coprocessor Model Ext : 0x00 Coprocessor Type : 0x00 Coprocessor Family : 0x0b Coprocessor Family Ext : 0x00 Coprocessor Stepping : C0 Board SKU : C0PRQ-7120 P/A/X/D ECC Mode : Enabled SMC HW Revision : Product 300W Passive CS Cores Total No of Active Cores : 61 Voltage : 1040000 uV Frequency : 1238095 kHz Thermal Fan Speed Control : N/A Fan RPM : N/A Fan PWM : N/A Die Temp : 47 C GDDR GDDR Vendor : Samsung GDDR Version : 0x6 GDDR Density : 4096 Mb GDDR Size : 15872 MB GDDR Technology : GDDR5 GDDR Speed : 5.500000 GT/s GDDR Frequency : 2750000 kHz GDDR Voltage : 1501000 uV
From searching online I see a few other users who have run into the:
offload error: cannot offload to MIC - device is not available [Offload] [HOST] [State] Unregister data tables
issue, but I don't see any good resolution (other than by restarting mpss, which does not resolve the issue for me).
MIC requires strict 64Byte data alignment to utilize vpu, but why? I found Sparc also have such an requirement. But other multi-core CPU can handle unaligned data.
As MIC can automatically vectorize a for loop of data(with compiler optimization), what if the data is unaligned in this case? will the auto optimization still work? if yes, how?
Hello,
I would like to pre-allocate a number of buffers for later data transfers from CPU to MIC, using explicit offloading in C++.
It works nicely if each buffer corresponds to an explicit variable name, as e.g. in the double-buffering examples. However, I would like to have a configurable number of such buffers (more than 2), i.e. an array of buffers. (the buffers are used for asynchronous processing on the MIC, and I need quite a few of them).
I do have a workaround, i.e. allocate a single very big buffer, and cut it into pieces (by using offsets and 'into' for transfers), but as the buffers do not need to be to be contiguous, I'm afraid adding this constraint may cause problems to find a big block available at runtime. So I would prefer to have several smaller buffers if possible.
The code below will probably describe easily the issue. In the first part, it works fine with 2 variable names. But in the second part, with an array, I don't find how to proceed (or is it simply not possible?). I tried without success various syntaxes, but could not find one accepted by the compiler.
I would be glad if someone could help on this matter. Thanks in advance for any feedback on this!
cheers, Sylvain
#pragma offload_attribute (push,target(mic)) #include <stdio.h> #pragma offload_attribute (pop) #define ALLOC alloc_if(1) free_if(0) #define FREE alloc_if(0) free_if(1) #define REUSE alloc_if(0) free_if(0) int main() { int size=100; // size of buffer char input[size]; // buffer for input data on the CPU char *ptr1=NULL; // reference to MIC buffer 1 char *ptr2=NULL; // reference to MIC buffer 2 // pre-allocate MIC buffers #pragma offload_transfer target(mic:0) nocopy(ptr1 : length(size) ALLOC) #pragma offload_transfer target(mic:0) nocopy(ptr2 : length(size) ALLOC) // test use of buffer 1 snprintf(input,size,"valPtr1"); #pragma offload target(mic:0) in(input[0:size] : REUSE into(ptr1[0:size])) { printf("MIC: %p = %s\n",ptr1,ptr1); } // test use of buffer 2 snprintf(input,size,"valPtr2"); #pragma offload target(mic:0) in(input[0:size] : REUSE into(ptr2[0:size])) { printf("MIC: %p = %s\n",ptr2,ptr2); } // try to do same as above, but with an array instead of fixed variable names ptr1,ptr2 // so that number of elements can be increased and iterated // e.g. instead of ptr1 and ptr2, use ptrX[1], ptrX[2] ... ptrX[N] // compiler does not seem to complain for the allocation // but it crashes at runtime char *ptrX[2]={NULL,NULL}; for (int i=0;i<2;i++) { #pragma offload_transfer target(mic:0) nocopy(ptrX[i] : length(size) ALLOC) } // and then, how to use the buffers ??? /* for (int i=0;i<2;i++) { snprintf(input,size,"valPtrX%d",i); #pragma offload target(mic:0) in(input[0:size] : REUSE into((???)[0:size])) { printf("MIC: %p = %s\n",???,???); } } */ return 0; }
The Intel® Math Kernel Library (Intel® MKL), the high performance math library for x86 and x86-64, is available for free for everyone (click here now to register and download). Purchasing is only necessary if you want access to Intel® Premier Support (direct 1:1 private support from Intel), older versions of the library or access to other tools in Intel® Parallel Studio XE. Intel continues to actively develop and support this very powerful library - and everyone can benefit from that!
Intel® Math Kernel Library (Intel® MKL) is a very popular library product from Intel that accelerates math processing routines to increase application performance. Intel® MKL includes highly vectorized and threaded Linear Algebra, Fast Fourier Transforms (FFT), Vector Math and Statistics functions. The easiest way to take advantage of all of that processing power is to use a carefully optimized computing math library; even the best compiler can’t compete with the level of performance possible from a hand-optimized library. If your application already relies on the BLAS or LAPACK functionality, simply re-link with Intel® MKL to get better performance on Intel and compatible architectures.
Intel® MKL is most often obtained with the Intel® Compilers and all the other Intel® Performance Libraries in various products from Intel. It can be obtained with tools for analysis, debugging and tuning, tools for MPI and the Intel® MPI Library by acquiring the Intel® Parallel Studio XE. Did you know that some of these are available for free?
Here is a guide to various ways to obtain the latest version of the Intel® Math Kernel Library (Intel® MKL) for free without access to Intel® Premier Support (get support by posting to the Intel Math Kernel Library forum). Anytime you want, the full suite of tools (Intel® Parallel Studio XE) with Intel® Premier Support and access to previous library versions can be purchased worldwide.
Who | What is Free? | Information | Where? |
---|---|---|---|
Community Licenses for Everyone | Intel® Math Kernel Library (Intel® MKL) Intel® Data Analytics Acceleration Library Intel® Threading Building Blocks Intel® Integrated Performance Primitives (Intel® IPP) | Community Licensing for Intel® Performance Libraries – free for all, registration required, no royalties, no restrictions on company or project size, current versions of libraries, no Intel Premier Support access. (Linux*, Windows* or OS X* versions) Forums for discussion and support are open to everyone: | Community Licensing for Intel Performance Libraries |
Evaluation Copies for Everyone | Intel® Math Kernel Library (Intel® MKL) | Evaluation Copies – Try before you buy. (Linux, Windows or OS X versions) | Try before you buy |
Use as an Academic Researcher | Linux, Windows or OS X versions of: Intel® Math Kernel Library Intel® Data Analytics Acceleration Library Intel® Threading Building Blocks Intel® Integrated Performance Primitives Intel® MPI Library (not available for OS X) | If you will use in conjunction with academic research at institutions of higher education. (Linux, Windows or OS X versions, except the Intel® MPI Library which is not supported on OS X) | Qualify for Use as an Academic Researcher |
Student | Intel® Math Kernel Library (Intel® MKL) | If you are a current student at a degree-granting institutions. (Linux, Windows or OS X versions) | Qualify for Use as a Student |
Teacher | Intel® Math Kernel Library (Intel® MKL) | If you will use in a teaching curriculum. (Linux, Windows or OS X versions) | Qualify for Use as an Educator |
Use as an Open Source Contributor | Intel® Math Kernel Library (Intel® MKL) | If you are a developer actively contributing to a open source projects – and that is why you will utilize the tools. (Linux versions) | Qualify for Use as an Open Source Contributor |
Free licenses for certain users has always been an important dimension in our offerings. One thing that really distinguishes Intel is that we sell excellent tools and provide second-to-none support for software developers who buy our tools. We provide multiple options - and we hope you will exactly what you need in one or our options.