Building R for Phi

August 11, 2015, 11:51 am

Latest and popular articles on Intel Technologies

≫ Next: Phi seems not fully support AVX512? Any way to do MATRIX transpose?

≪ Previous: Optimization Techniques for the Intel® MIC Architecture: Part 1 of 3

Hi,

Has anyone successfully built R to run natively on Phi?

Thanks,

George

↧

Phi seems not fully support AVX512? Any way to do MATRIX transpose?

August 14, 2015, 7:39 am

Latest and popular articles on Intel Technologies

≫ Next: Intel(R) Manycore Platform Software Stack (MPSS) - Long-Term-Support Archive

≪ Previous: Building R for Phi

I found in past topics that mm512_unpacklo_* is not supported on phi. In my own implementation, it seems mm512_permute* and mm512_shuffle* is also not supported. So far all matrix transpose operation in past posts seems implemented by using mm512_swizzle* and mm512_blend* instructions. However, use these two operations requires two times more element movement, seems low efficiency. Is their any other choices to do matrix transpose?

↧

Intel(R) Manycore Platform Software Stack (MPSS) - Long-Term-Support Archive

August 14, 2015, 1:26 pm

Latest and popular articles on Intel Technologies

≫ Next: 31S1P problems (MSI-X Enable-, or 4G Decoding, probably)

≪ Previous: Phi seems not fully support AVX512? Any way to do MATRIX transpose?

In this page you will find the last releases of the Intel(R) Manycore Platform Software Stack (MPSS) Long Term Support product (LTS). The most recent release is found here: http://software.intel.com/en-us/articles/intel-many-integrated-core-architecture-intel-mic-architecture-platform-software-stack and we recommend customers use the latest release wherever possible.

N-1 release for MPSS 3.4.x series

MPSS 3.4.4 release for Linux

MPSS version	Downloads available	Size (range)	MD5 Checksum
mpss-3.4.4 (released: June 2, 2015)	Linux (mpss-3.4.4-linux.tar) for RedHat 6.3, RedHat 6.4, RedHat 6.5, RedHat 6.6, RedHat 7.0, SuSE SLES11 SP2, SuSE SLES11 SP3, SuSE SLES12	~420MB	603fee578662bd83ac78cb0293c0b4df
	Software for Coprocessor OS (k1om) (mpss-3.4.4-k1om.tar)	~700MB	42c2eba4d727991e4e8f99dababeba63
	SOURCE (mpss-src-3.4.4.tar)	~270MB	0030c519e7740ad9d8552aa8bedc4e94
	Download Cache (mpss-downloadcache-3.4.4.tar)	~1.1GB	47031c23014ce5a0f43ff093ad42251d

Documentation link	Description	Last Updated On	Size (approx)
releaseNotes-linux.txt	English - Release Notes	June 2015	~54KB
readme.txt	Readme (includes installation instructions) for Linux (English)	June 2015	~20KB
MPSS_Users_Guide.pdf	Complete Users Guide for MPSS for Linux (English)	June 2015	~2MB
SCIF_UserGuide.pdf	SCIF User guide	June 2015	~700KB
license.txt	INTEL SOFTWARE LICENSE AGREEMENT for Intel® Manycore Platform Software Stack (Intel® MPSS)	June 2015	~30KB

N-2 release for MPSS 3.4.x series

MPSS 3.4.3 release for Linux

MPSS version	Downloads available	Size (range)	MD5 Checksum
mpss-3.4.3 (released: February 20, 2015)	Linux (mpss-3.4.3-linux.tar) for RedHat 6.3, RedHat 6.4, RedHat 6.5, RedHat 6.6, RedHat 7.0, SuSE SLES11 SP2, SuSE SLES11 SP3, SuSE SLES12	~400MB	fa960e90045a1ab16e1b68920030233c
	Software for Coprocessor OS (k1om) (mpss-3.4.3-k1om.tar)	~700MB	85b4f4b6873a8ec21cc9e1d6d95cec04
	SOURCE (mpss-src-3.4.3.tar)	~270MB	1fdd717f025ee6c6c999f991e76dde9f
	Download Cache (mpss-downloadcache-3.4.3.tar)	~1.1GB	1ec83289d06ec8c12dea80f7a5482034

Documentation link	Description	Last Updated On	Size (approx)
releaseNotes-linux.txt	English - Release Notes	February 2015	~62KB
readme.txt	Readme (includes installation instructions) for Linux (English)	February 2015	~20KB
MPSS_Users_Guide.pdf	Complete Users Guide for MPSS for Linux (English)	February 2015	~2MB
SCIF_UserGuide.pdf	SCIF User guide	February 2015	~700KB
license.txt	INTEL SOFTWARE LICENSE AGREEMENT for Intel® Manycore Platform Software Stack (Intel® MPSS)	February 2015	~30KB

**MPSS 3.4.3 release for Microsoft* Windows**

MPSS version	Downloads available	Size	MD5 Checksum
mpss-3.4.3-windows.zip (released: February 20, 2015)	Microsoft* Windows	~310MB	588c1431fa0803f5b478aa771703efa2
Software for Coprocessor OS (k1om) (mpss-3.4.3-k1om.tar)		~700MB	85b4f4b6873a8ec21cc9e1d6d95cec04

Documentation link	Description	Last Updated On	Size
releaseNotes-windows.txt	English - release notes	February 2015	~25KB
readme-windows.pdf	English (includes installation instructions) for Microsoft* Windows	February 2015	~550KB
MPSS_Users_Guide-windows.pdf	User, Cluster and Advanced Configuration Guide for MPSS	February 2015	~2

N-3 release for MPSS 3.4.x series

MPSS 3.4.2 release for Linux

MPSS version	Downloads available	Size (range)	MD5 Checksum
mpss-3.4.2 (released: December 3, 2014)	Linux (mpss-3.4.2-linux.tar) for RedHat 6.3, RedHat 6.4, RedHat 6.5, RedHat 6.6, RedHat 7.0, SuSE SLES11 SP2, SuSE SLES11 SP3	~400MB	40896e317418fd20a758fd7ce2408aac
	Software for Coprocessor OS (k1om) (mpss-3.4.2-k1om.tar)	~700MB	27004c1423bb3e29010de2284577d024
	SOURCE (mpss-src-3.4.2.tar)	~270MB	b5031821ac8d4faaf12b4fbb1728e97a
	Download Cache (mpss-downloadcache-3.4.2.tar)	~1.1GB	4d937079b4ef2a8eef821e12f2e61ebd

Documentation link	Description	Last Updated On	Size (approx)
releaseNotes-linux.txt	English - Release Notes	December 2014	~75KB
readme.txt	Readme (includes installation instructions) for Linux (English)	December 2014	~20KB
MPSS_Users_Guide.pdf	Complete Users Guide for MPSS for Linux (English)	December 2014	~2MB
SCIF_UserGuide.pdf	SCIF User guide	December 2014	~700KB
license.txt	INTEL SOFTWARE LICENSE AGREEMENT for Intel® Manycore Platform Software Stack (Intel® MPSS)	September 2013	~30KB

**MPSS 3.4.2 release for Microsoft* Windows**

MPSS version	Downloads available	Size	MD5 Checksum
mpss-3.4.2-windows.zip (released: December 3, 2014)	Microsoft* Windows	~310MB	64b2bb347ce870098b2e8dafa10e5d67
Software for Coprocessor OS (k1om) (mpss-3.4.2-k1om.tar)		~700MB	27004c1423bb3e29010de2284577d024

Documentation link	Description	Last Updated On	Size
releaseNotes-windows.txt	English - release notes	December 2014	~30KB
readme-windows.pdf	English (includes installation instructions) for Microsoft* Windows	December 2014	~620KB
MPSS_Users_Guide-windows.pdf	User, Cluster and Advanced Configuration Guide for MPSS	December 2014	~2MB

N-4 release for MPSS 3.4.x series

MPSS 3.4.1 release for Linux

MPSS version	Downloads available	Size (range)	MD5 Checksum
mpss-3.4.1 (released: October 22 2014)	Linux (mpss-3.4.1-linux.tar) for RedHat 6.3, RedHat 6.4, RedHat 6.5, RedHat 6.6, RedHat 7.0, SuSE SLES11 SP2, SuSE SLES11 SP3	~400MB	e985afee031baf542090883d3752fcfa
	Software for Coprocessor OS (k1om) (mpss-3.4.1-k1om.tar)	~700MB	23d3db962c2abc659945598aa6793374
	SOURCE (mpss-src-3.4.1.tar)	~270MB	73ecb48cf74bd815ae8c3753868c80d8
	Download Cache (mpss-downloadcache-3.4.1.tar)	~1.1GB	3bdc15046dbd4b23a58cb1684d73e05f

Documentation link	Description	Last Updated On	Size (approx)
releasenotes-linux.txt	English - Release Notes	October 2014	~75KB
readme.txt	Readme (includes installation instructions) for Linux (English)	October 2014	~20KB
MPSS_Users_Guide.pdf	Complete Users Guide for MPSS for Linux (English)	October 2014	~2MB
SCIF_UserGuide.pdf	SCIF User guide	October 2014	~700KB
license.txt	INTEL SOFTWARE LICENSE AGREEMENT for Intel® Manycore Platform Software Stack (Intel® MPSS)	September 2013	~30KB

**MPSS 3.4.1 release for Microsoft* Windows**

MPSS version	Downloads available	Size	MD5 Checksum
mpss-3.4.1-windows.zip (released: October 22 2014)	Microsoft* Windows	~310MB	27b8c2ced28569b58c9d00255bc3219f
Software for Coprocessor OS (k1om) (mpss-3.4.1-k1om.tar)		~700MB	23d3db962c2abc659945598aa6793374

Documentation link	Description	Last Updated On	Size
releaseNotes-windows.txt	English - release notes	October 2014	~30KB
readme-windows.pdf	English (includes installation instructions) for Microsoft* Windows	October 2014	~620KB
MPSS_Users_Guide-windows.pdf	User, Cluster and Advanced Configuration Guide for MPSS	October 2014	~2MB

Intel Many Integrated Cores

Microsoft Windows* 10

Microsoft Windows* 8.x

Avanzato

Principiante

Intermedio

Architettura Intel® Many Integrated Core

↧

31S1P problems (MSI-X Enable-, or 4G Decoding, probably)

August 17, 2015, 2:10 am

Latest and popular articles on Intel Technologies

≫ Next: How to allocation MICs to all the MPI processors equally for AO?

≪ Previous: Intel(R) Manycore Platform Software Stack (MPSS) - Long-Term-Support Archive

Hello, everyone. I've been lurking on the forums for a few days now while I schemed up a cooling solution for my shiny new 31S1P.

I'm pretty sure I've conquered the cooling requirements. Check!

However, I cannot get the card to work correctly. I'm using a Z97-WS motherboard with "4G Decoding" enabled in the BIOS settings. The CPU is a Celeron G1820 which is a cheap little lga1150 socket CPU that seemed to be enough for this rig. I'm running the latest BIOS (2403, I believe from 2015-06-18 or thereabouts), latest version of CentOS 7.1, which is 7.1.1503 (Core).

I've followed all of the advice and forums I could find online about this issue, to little avail. Here is a piece of my console log showing the relevant information I am likely to be asked to provide if I don't do it here:

----------------------------------------------------------------------------------------

[root@x mpss-3.5.2]# dmesg | grep MSI
[    0.102438] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    0.408378] pcieport 0000:00:01.0: irq 40 for MSI/MSI-X
[    0.408786] pcieport 0000:01:00.0: irq 41 for MSI/MSI-X
[    0.408881] pcieport 0000:02:08.0: irq 42 for MSI/MSI-X
[    0.408972] pcieport 0000:02:10.0: irq 43 for MSI/MSI-X
[    0.409070] pcieport 0000:06:00.0: irq 44 for MSI/MSI-X
[    0.409184] pcieport 0000:07:01.0: irq 45 for MSI/MSI-X
[    0.409349] pcieport 0000:07:02.0: irq 46 for MSI/MSI-X
[    0.409465] pcieport 0000:07:03.0: irq 47 for MSI/MSI-X
[    0.409579] pcieport 0000:07:04.0: irq 48 for MSI/MSI-X
[    0.409692] pcieport 0000:07:05.0: irq 49 for MSI/MSI-X
[    0.409808] pcieport 0000:07:06.0: irq 50 for MSI/MSI-X
[    0.409920] pcieport 0000:07:07.0: irq 51 for MSI/MSI-X
[    0.452551] xhci_hcd 0000:00:14.0: irq 52 for MSI/MSI-X
[    0.518593] xhci_hcd 0000:10:00.0: irq 53 for MSI/MSI-X
[    0.518597] xhci_hcd 0000:10:00.0: irq 54 for MSI/MSI-X
[    0.518600] xhci_hcd 0000:10:00.0: irq 55 for MSI/MSI-X
[    0.710232] e1000e 0000:00:19.0: irq 56 for MSI/MSI-X
[    0.825566] igb 0000:0d:00.0: irq 57 for MSI/MSI-X
[    0.825570] igb 0000:0d:00.0: irq 58 for MSI/MSI-X
[    0.825573] igb 0000:0d:00.0: irq 59 for MSI/MSI-X
[    0.825577] igb 0000:0d:00.0: irq 60 for MSI/MSI-X
[    0.825581] igb 0000:0d:00.0: irq 61 for MSI/MSI-X
[    0.855040] igb 0000:0d:00.0: Using MSI-X interrupts. 2 rx queue(s), 2 tx queue(s)
[    0.984604] i915 0000:00:02.0: irq 62 for MSI/MSI-X
[    1.187177] ahci 0000:00:1f.2: irq 63 for MSI/MSI-X
[    1.189283] ahci 0000:0a:00.0: irq 64 for MSI/MSI-X
[    1.190251] ahci 0000:0f:00.0: irq 65 for MSI/MSI-X
[   12.487762] mei_me 0000:00:16.0: irq 66 for MSI/MSI-X
[   12.702815] snd_hda_intel 0000:00:03.0: irq 67 for MSI/MSI-X
[   12.702983] snd_hda_intel 0000:00:1b.0: irq 68 for MSI/MSI-X
[root@x mpss-3.5.2]# lspci | grep -i coproc
03:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 31S1 (rev 11)
[root@x mpss-3.5.2]# lspci -s 03:00.0 -vv
03:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 31S1 (rev 11)
   Subsystem: Intel Corporation Device 2500
   Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
   Interrupt: pin A routed to IRQ 255
   Region 0: Memory at <unassigned> (64-bit, prefetchable) [disabled] [size=8G]
   Region 4: Memory at bf200000 (64-bit, non-prefetchable) [disabled] [size=128K]
   Capabilities: [44] Power Management version 3
       Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot-,D3cold-)
       Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
   Capabilities: [4c] Express (v2) Endpoint, MSI 00
       DevCap:   MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
           ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
       DevCtl:   Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
           RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
           MaxPayload 256 bytes, MaxReadReq 512 bytes
       DevSta:   CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
       LnkCap:   Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <4us, L1 unlimited
           ClockPM- Surprise- LLActRep- BwNot-
       LnkCtl:   ASPM Disabled; RCB 64 bytes Disabled- CommClk-
           ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
       LnkSta:   Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
       DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
       DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
       LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
           Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
           Compliance De-emphasis: -6dB
       LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
           EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
   Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+
       Address: 0000000000000000 Data: 0000
   Capabilities: [98] MSI-X: Enable- Count=16 Masked-
       Vector table: BAR=4 offset=00017000
       PBA: BAR=4 offset=00018000
   Capabilities: [100 v1] Advanced Error Reporting
       UESta:   DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
       UEMsk:   DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
       UESvrt:   DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
       CESta:   RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
       CEMsk:   RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
       AERCap:   First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-

[root@x mpss-3.5.2]# dmesg | grep mic
[    0.000000] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[    0.061159] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[    0.068898] atomic64 test passed for x86-64 platform with CX8 and with SSE
[    0.089803] ACPI: Dynamic OEM Table Load:
[    0.091965] ACPI: Dynamic OEM Table Load:
[    0.093790] ACPI: Dynamic OEM Table Load:
[    0.387895] microcode: CPU0 sig=0x306c3, pf=0x2, revision=0x1c
[    0.387899] microcode: CPU1 sig=0x306c3, pf=0x2, revision=0x1c
[    0.387920] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
[    0.526732] mousedev: PS/2 mouse device common for all mice
[    0.710216] e1000e 0000:00:19.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[    3.071226] usb 5-2: ep 0x81 - rounding interval to 1024 microframes, ep desc says 2040 microframes
[    3.439111] usb 5-2.1: ep 0x81 - rounding interval to 64 microframes, ep desc says 80 microframes
[    3.439113] usb 5-2.1: ep 0x82 - rounding interval to 1024 microframes, ep desc says 2040 microframes
[root@x mpss-3.5.2]# micinfo
MicInfo Utility Log
Created Mon Aug 17 04:01:04 2015

   System Info
       HOST OS           : Linux
       OS Version       : 3.10.0-229.el7.x86_64
       Driver Version       : NotAvailable
       MPSS Version       : 3.5.2

Host Physical Memory : 16141 MB
micinfo: No devices found : host driver is not loaded: No such file or directory

[root@x j]# depmod
[root@x j]# modprobe mic
modprobe: FATAL: Module mic not found.
[root@x j]# service mpss start
Starting mpss (via systemctl): [ OK ]
[root@x j]# micctrl -s
[Error] micrasrelmond: State failed - non existent MIC device
[root@x j]#

----------------------------------------------------------------------------------------

If I can get it to show up in a dmesg | grep mic output again, I'll post it here. i've gotten that output of that to vary a little.

I don't have a special BIOS from ASUS, but as I said above, it is the latest available and it only came out a few weeks ago. Could this be an instance of what Frances was talking about here? https://software.intel.com/en-us/forums/topic/538897#comment-1811230

In other words, the fact that MSI-X doesn't appear to be operative for my 31S1P. I cannot for the life of me figure out how to force to to be enabled. Is this going to require recompiling my kernel?

If anyone has any ideas, I'm all ears/eyes.

Thanks!

↧

How to allocation MICs to all the MPI processors equally for AO?

August 19, 2015, 11:21 am

Latest and popular articles on Intel Technologies

≫ Next: Xeon Phi and offload from MATLAB MEX file

≪ Previous: 31S1P problems (MSI-X Enable-, or 4G Decoding, probably)

Hello,
Could you please take a look at this problem? My machine has 16 CPUs and 4 MICs (47 coprocessors each), and I run my program with 8 MPI processors (mpi_comm_size = 8) and want to use MKL routines with automatic offload (AO) mode. As you can see in the test code attached, I tried three different methods.
METHOD-1: I allocate the 4 MICs to the first 4 CPUs each and let the other CPUs run w/o MIC. In this case the program works well as expected and I got the following performance test result when solving zgemm for 5k*5k size of complex & dense matrices.
CPU_ID 0 1 2 3 4 5 6 7
time(s) 1.67 1.93 1.97 1.93 13.85 12.94 12.94 12.93

METHOD-2: Now, this is the problematic situation. I want all the 8 CPUs to share the 4 MICs equally expecting that the CPUs show a performance of about 4 seconds for the same zgemm problem as method-1. However, this method does not work well but gives error messages right away or after solving its first zgemm problem,
*** glibc detected *** ../../../bin/test: malloc(): memory corruption: 0x00007f59fc000010 ***
or
CPU_ID 0 1 2 3 4 5 6 7
time(s) 101 10 101 95 26 25 14 14
*** glibc detected *** ../../../bin/test: free(): corrupted unsorted chunks: 0x0000000009f47270 ***

METHOD-3: If I replace mkl_mic_set_workdivision() with mkl_mic_set_resource_limit(), then the program does not crash but there's no response at all. I see that the CPU and MIC usages are almost zero.

Please take a look at a piece of my code attached and give some advices.
Thank you.

Allegato	Dimensione
Download test.cpp	1.5 KB

↧

Xeon Phi and offload from MATLAB MEX file

August 21, 2015, 6:12 am

Latest and popular articles on Intel Technologies

≫ Next: Regarding sgemm benchmarks for MIC devices

≪ Previous: How to allocation MICs to all the MPI processors equally for AO?

Hello,

I am having a really hard time figuring out how to use the Xeon Phi offload mode from within MATLAB MEX files under Linux. I have managed to force MATLAB to use icc for compilation and verified that the mex files run fine. The problems start when using the offload pragma - as far as I can tell, nobody has tried that yet and I suspect this is some (fixable?) issue with libraries. Can someone here help me with this?

Consider the following simple code

int main()
{
  __attribute__((target(mic : 0))) int vsize;

#pragma offload target(mic:0)
  vsize = 10;
}

When I execute this with OFFLOAD_REPORT=3, I get the following output

$ ./test
[Offload] [HOST]          [State]           Initialize logical card 0 = physical card 0
[Offload] [MIC 0] [File]                    test.c
[Offload] [MIC 0] [Line]                    23
[Offload] [MIC 0] [Tag]                     Tag 0
[Offload] [HOST]  [Tag 0] [State]           Start target
[Offload] [HOST]  [Tag 0] [State]           Setup target entry: __offload_entry_test_c_23mainicc0101288930704RqbsVt
[Offload] [HOST]  [Tag 0] [State]           Host->target pointer data 0
[Offload] [HOST]  [Tag 0] [Signal]          signal : none
[Offload] [HOST]  [Tag 0] [Signal]          waits  : none
[Offload] [HOST]  [Tag 0] [State]           Host->target pointer data 0
[Offload] [HOST]  [Tag 0] [State]           Host->target copyin data 4
[Offload] [HOST]  [Tag 0] [State]           Execute task on target
[Offload] [HOST]  [Tag 0] [State]           Target->host pointer data 0
[Offload] [MIC 0] [Tag 0] [State]           Start target entry: __offload_entry_test_c_23mainicc0101288930704RqbsVt
[Offload] [MIC 0] [Tag 0] [Var]             vsize  INOUT
[Offload] [HOST]  [Tag 0] [CPU Time]        0.301827(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data]   4 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time]        0.000171(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data]   4 (bytes)

[Offload] [MIC 0] [Tag 0] [State]           Target->host copyout data   4

I have written a MEX file that does the same thing. FYI, a MEX file is essentially a dynamic .so library with one specific symbol exported. The result of running the MEX file under MATLAB is as follows

>> mictest
[Offload] [HOST]          [State]           Initialize logical card 0 = physical card 0
offload error: cannot load library to the device 0 (error code 5)

------------------------------------------------------------------------
       Segmentation violation detected at Fri Aug 21 14:57:31 2015
------------------------------------------------------------------------

[...]

I have looked around and tried to set the OFFLOAD_INIT=on_start variable before starting MATLAB. The results were VERY promising, but still some problems remain unsolved:

[Offload] [MIC 0] [File]                    mictest_mex.c
[Offload] [MIC 0] [Line]                    41
[Offload] [MIC 0] [Tag]                     Tag 0
[Offload] [HOST]  [Tag 0] [State]           Start target
[Offload] [HOST]  [Tag 0] [State]           Setup target entry: __offload_entry_mictest_mex_c_41mexFunctionicc0104735023118W8NJ2
[Offload] [HOST]  [Tag 0] [State]           Host->target pointer data 0
[Offload] [HOST]  [Tag 0] [Signal]          signal : none
[Offload] [HOST]  [Tag 0] [Signal]          waits  : none
[Offload] [HOST]  [Tag 0] [State]           Host->target pointer data 0
[Offload] [HOST]  [Tag 0] [State]           Host->target copyin data 4
[Offload] [HOST]  [Tag 0] [State]           Execute task on target
offload error: cannot create pipeline on the device 0 (error code 14)

So it seems that MIC is indeed doing something, but one last step is missing to make this work. The library paths and the whole bash environment is the same in both cases. I have also looked at output of the nm command and it seems that in both cases (C standalone and MATLAB MEX) the number and names of symbols that contain work 'offload' are the same/similar.

I think this can be solved: I have seen a document about MKL using offload inside MATLAB, alas for Windows. Does anybody have a clue where to start?

Thanks a lot!

Marcin Krotkiewski

↧

Regarding sgemm benchmarks for MIC devices

August 21, 2015, 1:04 pm

Latest and popular articles on Intel Technologies

≫ Next: Current status of OFED support

≪ Previous: Xeon Phi and offload from MATLAB MEX file

Hi Intel forums,

I've had difficulty reproducing the performance reported on the following page:

https://www-ssl.intel.com/content/www/us/en/benchmarks/server/xeon-phi/xeon-phi-sgemm-dgemm.html

Using the mkl sgemm routine on my 3120 series Xeon Phi, I haven't even approached the 1.7 TFLOP/S level claimed above. The best performance I achieve is ~0.7 TFLOP/S. Presumably, this is because I don't fully understand the threading and vectorization APIs, and I'm not using them optimally. I was wondering if anyone knows where to find the source & environment details used for Intel's official benchmark. Maybe I could compare "correct" usage with my code to better understand the tools.

Thanks,

Chris

↧

Current status of OFED support

August 24, 2015, 5:53 am

Latest and popular articles on Intel Technologies

≫ Next: How Xeon Phi divides address space with distributed L2

≪ Previous: Regarding sgemm benchmarks for MIC devices

Hello,

Could you please provide us with some matrix: what OFED should be used for what OS distribution in case of the latest MPSS (3.5.2, linux)? We need to know in what cases MPSS supports the OFED and in what cases better to use alternative OFED distribution.

Best regards,

Taras

↧

How Xeon Phi divides address space with distributed L2

August 24, 2015, 6:21 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Parallel Studio XE 2016: High Performance for HPC Applications and Big Data Analytics

≪ Previous: Current status of OFED support

Hello,

I have been working with Knights Corner platform for some time. Like they do with libnuma and DPDK, I have been wondering if I could write a cache and memory controller-aware memory allocation code for Xeon Phi. Last time I asked, I didn't get much information on the subject (https://software.intel.com/en-us/comment/1799811#comment-1799811), but then I came across this while browsing through the datasheet (http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/x...).

"Communication around the ring follows a Shortest Distance Algorithm (SDA). Coresident with each core structure is a portion of a distributed tag directory. These tags are hashed to distribute workloads across the enabled cores. Physical addresses are also hashed to distribute memory accesses across the memory controllers."

I believe a full description of this scheme is the answer I'm seeking.

If someone from Intel could provide me with a more detailed explanation, or where I could find one, I would be very grateful. More honestly, I NEED to know this.
For instance,
a) how does it hash physical addresses? Does it divide 40-bit physical address space by cache line size (64B) and distribute 0x400000000 (2^34) cache lines to DTD by performing a modulo operation on ordinal number of each cache line?
b) Is L2 address space segmentation in PA space somewhat preserved in VA space as well? For instance, would every 60th cache line belong in a specific core's L2 tag directory?
c) which one of the following does "enabled cores" mean? i) all on-board cores, ii) cores with any executing threads, iii) cores not disabled by some means I'm not aware of. If iii) is the case, how do you disable the core?

For your information, I am currently using Xeon Phi 5110P, and could possibly be purchasing/using more 31S1Ps.

Thank you for your attention.

Jun

↧

Intel® Parallel Studio XE 2016: High Performance for HPC Applications and Big Data Analytics

August 25, 2015, 9:14 am

Latest and popular articles on Intel Technologies

≫ Next: No Cost Options for Intel Parallel Studio XE, Support yourself, Royalty-Free

≪ Previous: How Xeon Phi divides address space with distributed L2

Intel® Parallel Studio XE 2016, launched on August 25, 2015, is the latest installment in our developer toolkit for high performance computing (HPC) and technical computing applications. This suite of compilers, libraries, debugging facilities, and analysis tools, targets Intel® architecture, including support for the latest Intel® Xeon® processors (codenamed Skylake) and Intel® Xeon Phi™ processors (codenamed Knights Landing). Intel® Parallel Studio XE 2016 helps software developers design, build, verify and tune code in Fortran, C++, C, and Java.

There are four things that I like to highlight when I describe this year's tool release:

Intel® Data Analytics Acceleration Library
Vectorization Advisor
MPI Performance Snapshot
High performance support for industry standards, the latest processors, operating systems and their related development environments.

Intel Data Analytics Acceleration Library (Intel® DAAL)

Data Scientists are finding Intel® DAAL very exciting because it helps speed big data analytics. It’s designed for use with popular data platforms including Hadoop, Spark, R, and Matlab, for highly efficient data access. We’ve seen Intel DAAL accelerate PCA by 4-7X ,and a customer that has seen 200X for the Alternating Least Square prediction algorithm, when compared with the latest open source Spark + MLlib. (details for both claims are in my blog about DAAL). Intel DAAL was created by the renowned team that creates the Intel® Math Kernel Library (Intel® MKL). Intel DAAL can be thought of as “Intel MKL for Big Data” – but it is actually much more! Many more details on Intel DAAL, including ways to download it today for free are in my blog about DAAL. Intel DAAL is available for Linux, OS X and Windows.

Vectorization Advisor

Vectorization is the process of using SIMD instructions in processors. In the quest to “modernize” application to get top performance out of any modern processor, a software developer needs to tackle multithreading, vectorization and fabric scaling. Intel® Advisor XE 2016 provides tools to help with multithreading and vectorization:

Vectorization Advisor is an analysis tool that helps identify loops that will benefit most from vectorization by identifying obstacles to vectorization that are particular to your program, explore the benefit of alternative data reorganizations, and increase the confidence that transformations, aimed to increase vectorization, will preserve the correctness of your original program.
Threading Advisor is a threading design and prototyping tool that lets you analyze, design, tune, and check threading design options rapidly.

Threading Advisor has gained a reputation in the past five years for helping find the right choice for multithreading an application more quickly and without costly oversights. The experience of refining this ‘advisor’ has helped us to create this new advisor for vectorization with our knowledge of the best ways to give advice based on a program analysis.

Vector Advisor cannot tell you anything we could not show you how to do yourself. However, when I teach ‘vectorization’ I tend to rattle off a list of things to check. Each item that I suggest to “check” involves using a tool in a particular way. Bringing that into one tool, make life easier and definitely makes the process faster and more efficient. One of the key Vectorization Advisor features is a Survey Report that offers integrated compiler report data and performance data all in one place, including GUI-embedded advice on how to fix vectorization issues specific to your code. This page augments that GUI-embedded advice with links to web-based vectorization resources.

An excellent 12 minute introduction to the Vectorization Advisor is available as a video online.

MPI Performance Snapshot

The MPI Performance Snapshot is a scalable lightweight performance tool for MPI applications. It collects a variety of MPI application statistics (such as communication, activity, and load balance) and presents it in an easy-to-read format. The tool is not available separately but is provided as part of the Intel® Parallel Studio XE 2016 Cluster Edition.

The MPI Performance Snapshot is trying to solve the following problems as it relates to analysis of MPI application when scaling out to thousands of ranks:

Cluster Sizes continue to grow so applications are getting more and more scalable
Large amounts of data are collected when doing profiling at larger scale - that can easily become unmanageable
It's hard to identify which are the key metrics to track when you gather so much data

By addressing these three items, MPI Performance Snapshot improves scaling to at least 32K ranks which is an order of magnitude above what is tolerable with the prior Intel Trace Analyzer and Collector. Therefore, we can now recommend when aiming to optimize a large scale run (anything above 1000 MPI ranks), we suggesting starting with the MPI Performance Snapshot capability first and figure out where you need to dig deeper (which processes are slowing you down, where are the peaks in your memory usage, etc.). Then, do another run with the Intel Trace Analyzer and Collector on a subset of selected ranks to get a more detailed per-process information in order to visualize how a communication algorithm is implemented and if see if there are apparent bottlenecks.

MPI Performance Snapshot combines lightweight statistics from the Intel® MPI Library with OS and hardware-level counters to provide you with high-level categorization of your application: MPI vs. OpenMP load imbalance info, memory usage, and a break-down of MPI vs. computation vs. serial time.

For more details, you should check out the full MPI Performance Snapshot User's Guide and Analyzing MPI Applications with MPI Performance Snapshot on the Intel Trace Analyzer and Collector documentation page.

High performance support for…

The latest processors...

are supported including support for the Skylake microarchitecture and Knight Landing microarchitecture.

The latest industry standards...

We take pride in having very strong support for industry standards – we aim to be a leader and maintain our reputation of being second-to-none.

Our Fortran support even includes a feature from the draft Fortran 2015 standard which can help MPI-3 users. The current status of features of Fortran can be found in Dr. Fortran’s blog “Intel® Fortran Compiler - Support for Fortran language standards.”

The current status of C/C++ standard support features can be found in Jennifer’s blogs “C++14 Features Supported by Intel® C++ Compiler” and “C11 Support in Intel C++ Compiler.”

Our OpenMP support is detailed in the latest user guide for the C/C++ compiler and the latest user guide for the Fortran compiler.

Operating system support includes Debian 7.0, 8.0; Fedora 21, 22; Red Hat Enterprise Linux 5, 6, 7; SuSE LINUX Enterprise Server 11,12; Ubuntu 12.04 LTS (64-bit only), 13.10, 14.04 LTS, 15.04; OS X 10.10; Windows 7 thru 10, Windows Server 2008-2012. These are just the versions we have tested, many additional operating systems should work (for instance, CentOS).

Learn More

There is a series of webinars being held starting in September 2015 which cover many topics related to Intel Parallel Studio XE 2016. The webinars can be attended live, and offer interactive question and answer time. The webinars will also be available for replay after the live webinar is held. The first webinar is on September 1 – “What’s New in Intel® Parallel Studio XE 2016?”

Many more ways to learn more are on the Intel® Parallel Studio XE 2016 website. A number of benchmarks illustrating performance measurements are online as well.

There are many new features that I did not dive into, including great new support for MPI+OpenMP tuning with Intel VTune Amplifier XE, as well as a number of enhancements to Intel® Threading Building Blocks including the incresingly popular flow graph capabilities and task arenas,

Download Intel® Parallel Studio XE 2016 today

An evaluation copy can be obtained by requesting an evaluation copy of Intel® Parallel Studio XE 2016. It is available for purchase worldwide.

Students, educators, academic researchers and open source contributors may qualify for some free tools.

The Intel Performance Libraries are also available via the Community Licensing for Intel Performance Libraries. Under this option, the library is free for any one who registers, with no royalties, and no restrictions on company or project size. The community licensing program offers the current versions of libraries without Intel Premier Support access (Intel Premier Support offers exclusive 1-on-1 support via an interactive and secure web site where you can submit questions or problems and monitor previously submitted issues. Intel® Premier Support requires registration after purchase of the software, or special qualification offered to students, educators, academic researchers and open source contributors.).

Immagine icona:

Big data

Elaborazione basata su cluster

Modernizzazione codici

Architettura Intel® Many Integrated Core

Ottimizzazione

Elaborazione parallela

Threading

Vettorizzazione

Intel® Cluster Ready

Intel® Streaming SIMD Extensions

Message Passing Interface

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 10

Microsoft Windows* 8.x

Includere in RSS:

Avanzato

Principiante

Intermedio

↧

No Cost Options for Intel Parallel Studio XE, Support yourself, Royalty-Free

August 25, 2015, 9:14 am

Latest and popular articles on Intel Technologies

≫ Next: Debugging Intel® Xeon Phi™ Applications on Linux* Host

≪ Previous: Intel® Parallel Studio XE 2016: High Performance for HPC Applications and Big Data Analytics

Intel® Parallel Studio XE is a very popular product from Intel that includes the Intel Compilers, Intel Performance Libraries, tools for analysis, debugging and tuning, tools for MPI and the Intel MPI Library. Did you know that some of these are available for free?

Here is a guide to “what is available free” from the Intel Parallel Studio XE suites.

Who	What is Free?	Information	Where?
Community Licenses for Everyone	Intel® Math Kernel Library Intel® Data Acceleration Library Intel® Threading Building Blocks Intel® Integrated Performance Primitives	Community Licensing for Intel Performance Libraries – free for all, registration required, no royalties, no restrictions on company or project size, current versions of libraries, no Intel Premier Support access. (Linux, Windows or OS X versions)	Community Licensing for Intel Performance Libraries
Evaluation Copies for Everyone	Compilers, libraries and analysis tools (most everything!)	Evaluation Copies – Try before you buy. (Linux, Windows or OS X versions)	Try before you buy
Use as an Academic Researcher	Linux, Windows or OS X versions of: Intel® Math Kernel Library Intel® Data Acceleration Library Intel® Threading Building Blocks Intel® Integrated Performance Primitives Intel® MPI Library (not available for OS X)	If you will use in conjunction with academic research at institutions of higher education. (Linux, Windows or OS X versions, expect Intel® MPI Library which is not supported on OS X)	Qualify for Use as an Academic Researcher
Student	Compilers, libraries and analysis tools (most everything!)	If you are a current student at a degree-granting institutions. (Linux, Windows or OS X versions)	Qualify for Use as a Student
Teacher	Compilers, libraries and analysis tools (most everything!)	If you will use in a teaching curriculum. (Linux, Windows or OS X versions)	Qualify for Use as an Educator
Use as an Open Source Contributor	Intel® Parallel Studio XE Professional Edition for Linux	If you are a developer actively contributing to a open source projects – and that is why you will utilize the tools. (Linux versions)	Qualify for Use as an Open Source Contributor

Free licenses for certain users has always been an important dimension in our offerings. One thing that really distinguishes Intel is that we sell excellent tools and provide second-to-none support for software developers who buy our tools. We provide multiple options - and we hope you will exactly what you need in one or our options.

Immagine icona:

Ricerca

Big data

Elaborazione basata su cluster

Modernizzazione codici

Debugging

Strumenti di sviluppo

Architettura Intel® Many Integrated Core

Ottimizzazione

Elaborazione parallela

Threading

Vettorizzazione

Intel® Cluster Ready

Message Passing Interface

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 10

Microsoft Windows* 8.x

Includere in RSS:

Avanzato

Principiante

Intermedio

↧

Debugging Intel® Xeon Phi™ Applications on Linux* Host

August 25, 2015, 10:26 am

Latest and popular articles on Intel Technologies

≫ Next: Debugging Intel® Xeon Phi™ Applications on Windows* Host

≪ Previous: No Cost Options for Intel Parallel Studio XE, Support yourself, Royalty-Free

Introduction

Intel® Xeon Phi™ coprocessor is a product based on the Intel® Many Integrated Core Architecture (Intel® MIC). Intel® offers a debug solution for this architecture that can debug applications running on an Intel® Xeon Phi™ coprocessor.

There are many reasons for the need of a debug solution for Intel^® MIC. Some of the most important ones are the following:

Developing native Intel^® MIC applications is as easy as for IA-32 or Intel^® 64 hosts. In most cases they just need to be cross-compiled (-mmic).
Yet, Intel^® MIC Architecture is different to host architecture. Those differences could unveil existing issues. Also, incorrect tuning for Intel^® MIC could introduce new issues (e.g. alignment of data, can an application handle more than hundreds of threads?, efficient memory consumption?, etc.)
Developing offload enabled applications induces more complexity as host and coprocessor share workload.
General lower level analysis, tracing execution paths, learning the instruction set of Intel^® MIC Architecture, …

Debug Solution for Intel® MIC

For Linux* host, Intel offers a debug solution for Intel® MIC which is based on GNU* GDB. It can be used on the command line for both host and coprocessor. There is also an Eclipse* IDE integration that eases debugging of applications with hundreds of threads thanks to its user interface. It also supports debugging offload enabled applications.

How to get it?

There are currently two ways to obtain Intel’s debug solution for Intel® MIC Architecture on Linux* host:

Intel® Parallel Studio XE 2016 (for C/C++ or Fortran):
https://software.intel.com/en-us/intel-parallel-studio-xe
Intel^® Manycore Platform Software Stack (MPSS):
http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss
(currently MPSS 3.5.2 for Linux* [3.4.5 for long term support])
First version became available with MPSS 2.1.

Both packages contain debug solutions for Intel® MIC Architecture!

Attention:
Never mix debugging tools from Intel® Parallel Studio XE with the ones from Intel® Manycore Platform Software Stack! Use all tools from the very same package. Different packages might have different debugger versions with different feature sets.

Note:
Intel® Composer XE 2013 SP1 contains GNU* GDB 7.5. With Intel® Parallel Studio XE 2015 GNU* GDB 7.7, and with Intel® Parallel Studio XE 2015 Update 2 GNU* GDB 7.8 (host only; 7.7 for coprocessor) is available. Intel® Parallel Studio XE 2016 contains GNU* GDB 7.8 for both host & coprocessor.
MPSS versions have different versions of GNU* GDB – please check the Release Notes of the individual MPPS releases.
There has been a change in product naming: Intel® Parallel Studio XE Composer Edition is the successor of Intel® Composer XE, starting with 2015.

Why use GNU* GDB provided by Intel?

New features/improvements offered back to GNU* community
Latest GNU* GDB versions in future releases
Improved C/C++ & Fortran support thanks to Project Archer and contribution through Intel
Increased support for Intel^® architecture (esp. Intel^® MIC)
Additional debugging capabilities – more later

Latest Intel related HW support and features are provided in the debug solution from Intel!

Why is Intel providing a Command Line and Eclipse* IDE Integration?

The command line with GNU* GDB has the following advantages:

Well known syntax
Lightweight: no dependencies
Easy setup: no project needs to be created
Fast for debugging hundreds of threads
Can be automatized/scripted

Using the Eclipse* IDE provides more features:

Comfortable user interface
Most known IDE in the Linux* space
Use existing Eclipse* projects
Simple integration of the Intel enhanced GNU* GDB
Works also with Photran* plug-in to support Fortran
Supports debugging of offload enabled applications
(not supported by command line)

Features

Intel’s GNU* GDB, starting with version 7.5, provides additional extensions that are available:

Support for Intel^® Many Integrated Core Architecture (Intel^® MIC Architecture):
Displays registers (zmmX & kX) and disassembles the instruction set
Support for Intel^® Transactional Synchronization Extensions (Intel^® TSX):
Helpers for Restricted Transactional Memory (RTM) model
(only for host)
Data Race Detection (pdbx):
Detect and locate data races for applications threaded using POSIX* thread (pthread) or OpenMP* models
Branch Trace Store (btrace):
Record branches taken in the execution flow to backtrack easily after events like crashes, signals, exceptions, etc.
(only for host)
Pointer Checker:
Assist in finding pointer issues if compiled with Intel^® C++ Compiler and having Pointer Checker feature enabled
(only for host)
Register support for Intel^® Memory Protection Extensions (Intel^® MPX) and Intel^® Advanced Vector Extensions 512 (Intel^® AVX-512):
Debugger is already prepared for future generations
And more...

The features for Intel® MIC Architecture highlighted above are described in the following.
Note that newer GNU* GDB versions with more features are already available, but those do not add anything in addition for Intel® MIC Architecture.

Register and Instruction Set Support

Compared to Intel® architecture on host systems, Intel® MIC Architecture comes with a different instruction and register set. Intel’s GNU* GDB comes with transparently integrated support for those. Use is no different than with host systems, e.g.:

Disassembling of instructions:
```
		(gdb) disassemble $pc, +10

		Dump of assembler code from 0x11 to 0x24:

		0x0000000000000011 <foobar+17>: vpackstorelps %zmm0,-0x10(%rbp){%k1}

		0x0000000000000018 <foobar+24>: vbroadcastss -0x10(%rbp),%zmm0

		⁞

		
```
In the above example the first ten instructions are disassembled beginning at the instruction pointer ($pc). Only first two lines are shown for brevity. The first two instructions are Intel® MIC specific and their mnemonic is correctly shown.

Listing of mask (kX) and vector (zmmX) registers:


		(gdb) info registers zmm

		k0   0x0  0

		     ⁞

		zmm31 {v16_float = {0x0 <repeats 16 times>},

		      v8_double = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},

		      v64_int8 = {0x0 <repeats 64 times>},

		      v32_int16 = {0x0 <repeats 32 times>},

		      v16_int32 = {0x0 <repeats 16 times>},

		      v8_int64 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},

		      v4_uint128 = {0x0, 0x0, 0x0, 0x0}}

Also registers have been extended by kX (mask) and zmmX (vector) register sets that come with Intel® MIC.

If you use the Eclipse* IDE integration you’ll get the same information in dedicated windows:

Disassembling of instructions:
Listing of mask (kX) and vector (zmmX) registers:

Data Race Detection

A quick excursion about what data races are:

A data race happens…
If at least two threads/tasks access the same memory location w/o synchronization and at least one thread/task is writing.

Example:
Imaging the two functions thread1()& thread2() are executed concurrently by different threads.


		int a = 1;

		int b = 2;

		                                         | t

		int thread1() {      int thread2() {     | i

		  return a + b;        b = 42;           | m

		}                    }                   | e

		                                         v

Return value of thread1() depends on timing: 3 vs. 43!
This is one (trivial) example of a data race.

What are typical symptoms of data races?

Data race symptoms:
- Corrupted results
- Run-to-run variations
- Corrupted data ending in a crash
- Non-deterministic behavior
Solution is to synchronize concurrent accesses, e.g.:
- Thread-level ordering (global synchronization)
- Instruction level ordering/visibility (atomics)
  Note:
  Race free but still not necessarily run-to-run reproducible results!
- No synchronization: data races might be acceptable

GDB data race detection points out unsynchronized data accesses. Not all of them might incur data races. It is the responsibility of the user to decide which ones are not expected and filter them (see next).
Due to technical limitations not all unsynchronized data accesses can be found, e.g.: 3rd party libraries or any object code not compiled with –debug parallel (see next).

How to detect data races?

Prepare to detect data races:
- Only supported with Intel^® C++/Fortran Compiler:
  Compile with -debug parallel (icc, icpc or ifort)
  Only objects compiled with-debug parallel are analyzed!
- Optionally, add debug information via –g

Enable data race detection (PDBX) in debugger:


		(gdb) pdbx enable

		(gdb) c

		data race detected

		1: write shared, 4 bytes from foo.c:36

		3: read shared, 4 bytes from foo.c:40

		Breakpoint -11, 0x401515 in L_test_..._21 () at foo.c:36

		*var = 42; /* bp.write */

Data race detection requires an additional library libpdbx.so.5:

Keeps track of the synchronizations
Part of Intel® C++ & Fortran Compiler
Copy to coprocessor if missing
(found at <install-dir>/compilers_and_libraries/linux/lib/mic/libpdbx.so)

Supported parallel programming models:

OpenMP*
POSIX* threads

Data race detection can be enabled/disabled at any time

Only memory access are analyzed within a certain period
Keeps memory footprint and run-time overhead minimal

There is finer grained control for minimizing overhead and selecting code sections to analyze by using filter sets.

More control about what to analyze with filters:

Add filter to selected filter set, e.g.:
```
		(gdb) pdbx filter line foo.c:36

		(gdb) pdbx filter code 0x40518..0x40524

		(gdb) pdbx filter var shared

		(gdb) pdbx filter data 0x60f48..0x60f50

		(gdb) pdbx filter reads # read accesses

		
```
Those define various filter on either instructions by specifying source file and line or the addresses (range), or variables using symbol names or addresses (range) respectively. There is also a filter to only report accesses that use (read) data in case of a data race.
There are two basic configurations, that are exclusive:
- Ignore events specified by filters (default behavior)
```
				(gdb) pdbx fset suppress

				
```
- Ignore events not specified by filters
```
				(gdb) pdbx fset focus

				
```
  The first one defines a white list, whilst the latter one blacklists code or data sections that should not be analyzed.
Get debug command help
```
		(gdb) help pdbx

		
```
This command will provide additional help on the commands.

Use cases for filters:

Focused debugging, e.g. debug a single source file or only focus on one specific memory location.
Limit overhead and control false positives. Detection involves some runtime and memory overhead at runtime. The more filters narrow down the scope of analysis, the more the overhead will be reduced. This can also be used to exclude false positives. Those can occur if real data races are detected, but without any impact on application’s correctness by design (e.g. results of multiple threads don’t need to be globally stored in strict order).
Exclude 3rd party code for analysis

Some additional hints using PDBX:

Optimized code (symptom):


		(gdb) run

		data race detected

		1: write question, 4 bytes from foo.c:36

		3: read question, 4 bytes from foo.c:40

		Breakpoint -11, 0x401515 in foo () at foo.c:36

		*answer = 42;

		(gdb)

Incident has to be analyzed further:
- Remember: data races are reported on memory objects
- If symbol name cannot be resolved: only address is printed
Recommendation:
Unoptimized code (-O0) makes it easier to understand due to removed/optimized away temporaries, etc.
Reported data races appear to be false positives:
- Not all data races are bad… user intended?
- OpenMP*: Distinct parallel sections using the same variable (same stack frame) can result in false positives

Note:
PDBX is not available for Eclipse* IDE and will only work for remote debugging of native coprocessor applications. See section Debugging Remotely with PDBX for more information on how to use it.

Debugging on Command Line

There are multiple versions available:

Debug natively on Intel^® Xeon Phi™ coprocessor
Execute GNU* GDB on host and debug remotely

Debug natively on Intel® Xeon Phi™ coprocessor
This version of Intel’s GNU* GDB runs natively on the coprocessor. It is included in Intel® MPSS only and needs to be made available on the coprocessor first in order to run it. Depending on the MPSS version it can be found at the provided location:

MPSS 2.1: /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdb
MPSS 3.*: included in gdb-7.*+mpss3.*.k1om.rpm as part of package mpss-3.*-k1om.tar
(for MPSS 3.1.2, please see Errata, for MPSS 3.1.4 use mpss-3.1.4-k1om-gdb.tar)

For MPSS 3.* the coprocessor native GNU* GDB requires debug information from some system libraries for proper operation. Please see Errata for more information.

Execute GNU* GDB on host and debug remotely
There are two ways to start GNU* GDB on the host and debug remotely using GDBServer on the coprocessor:

Intel^® MPSS:
- MPSS 2.1: /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gdb
- MPSS 3.*: <mpss_root>/sysroots/x86_64-mpsssdk-linux/usr/bin/k1om-mpss-linux/k1om-mpss-linux-gdb
- GDBServer:
  /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdbserver
  (same path for MPSS 2.1 & 3.*)
Intel® Parallel Studio XE Composer Edition:
- Source environment to start GNU* GDB:
```
				$ source compilervars.[sh|csh] [ia32|intel64]

				$ gdb-mic

				
```
- GDBServer:
  <install-dir>/debugger_2016/gdb/targets/mic/bin/gdbserver

The sourcing of the debugger environment is only needed once. If you already sourced the according compilervars.[sh|csh] script you can omit this step and gdb-mic should already be in your default search paths.

Attention: Do not mix GNU* GDB & GDBServer from different packages! Always use both from either Intel® MPSS or Intel® Parallel Studio XE Composer Edition!

Debugging Natively

Make sure GNU* GDB is already on the target by:
- Copy manually, e.g.:
  - MPSS 2.1:
```
						$ scp /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdb mic0:/tmp

						
```
  - MPSS 3.*: Install gdb-7.*+mpss3.*.k1om.rpm
- Add to the coprocessor image (see Intel® MPSS documentation)

Run GNU* GDB on the Intel® Xeon Phi™ coprocessor, e.g.:

MPSS 2.1:
```
				$ ssh –t mic0 /tmp/gdb

				
```
MPSS 3.*:
```
				$ ssh –t mic0 /usr/bin/gdb

				
```

Initiate debug session, e.g.:
- Attach:
```
				(gdb) attach <pid>
```
  <pid> is PID on the coprocessor
- Load & execute:
```
				(gdb) file <path_to_application>
```
  <path_to_application> is path on coprocessor

Some additional hints:

If native application needs additional libraries:
Set $LD_LIBRARY_PATH, e.g. via:
```
		(gdb) set env LD_LIBRARY_PATH=/tmp/

		
```
…or set the variable before starting GDB
If source code is relocated, help the debugger to find it:
```
		(gdb) set substitute-path <from> <to>
```
Change paths from <from> to<to>. You can relocate a whole source (sub-)tree with that.

Debugging is no different than on host thanks to a real Linux* environment on the coprocessor!

Debugging Remotely

Copy GDBServer to coprocessor, e.g.:

Intel® MPSS:


				$ scp /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdbserver mic0:/tmp

Intel® Parallel Studio XE Composer Edition:
```
				$ scp <install-dir>/debugger_2016/gdb/targets/mic/bin/gdbserver mic0:/tmp
```
During development you can also add GDBServer to your coprocessor image!

Start GDB on host, e.g.:
```
		$ source compilervars.[sh|csh] [ia32|intel64]

		$ gdb-mic

		
```
Note:
There are also versions named gdb-ia and gdb-ia-mic which are for IA-32/Intel® 64 only!
(Only for Intel® Parallel Studio XE 2015 Update 2 Composer Edition: gdb-ia is 7.8, gdb-ia-mic is 7.7)

Connect:


		(gdb) target extended-remote | ssh -T mic0 /tmp/gdbserver --multi –

Set sysroot from MPSS installation, e.g.:
```
		(gdb) set sysroot /opt/mpss/3.1.4/sysroots/k1om-mpss-linux/

		
```
If you do not specify this you won't get debugger support for system libraries.
Debug:
- Attach:
```
				(gdb) file <path_to_application>

				(gdb) attach <pid>
```
  <path_to_application> is path on host, <pid> is PID on the coprocessor
- Load & execute:
```
				(gdb) file <path_to_application>

				(gdb) set remote exec-file <remote_path_to_application>
```
  <path_to_application> is path on host, <remote_path_to_application> is path on the coprocessor

Some additional hints:

If remote application needs additional libraries:
Set $LD_LIBRARY_PATH, e.g. via:


		(gdb) target extended-remote | ssh mic0 LD_LIBRARY_PATH=/tmp/ /tmp/gdbserver --multi -

If source code is relocated, help the debugger to find it:
```
		(gdb) set substitute-path <from> <to>
```
Change paths from <from> to <to>. You can relocate a whole source (sub-)tree with that.
If libraries have different paths on host & target, help the debugger to find them:
```
		(gdb) set solib-search-path <lib_paths>
```
<lib_paths> is a colon separated list of paths to look for libraries on the host

Debugging is no different than on host thanks to a real Linux* environment on the coprocessor!

Debugging Remotely with PDBX

PDBX has some pre-requisites that must be fulfilled for proper operation. Use pdbx check command to see whether PDBX is working:

First step:
```
		(gdb) pdbx check

		checking inferior...failed.

		
```
Solution:
Start a remote application (inferior) and hit some breakpoint (e.g. b main& run)
Second step:
```
		(gdb) pdbx check

		checking inferior...passed.

		checking libpdbx...failed.

		
```
Solution:
Use set solib-search-path <lib_paths> to provide the path of libpdbx.so.5 on the host.
Third step:
```
		(gdb) pdbx check

		checking inferior...passed.

		checking libpdbx...passed.

		checking environment...failed.

		
```
Solution:
Set additional environment variables on the target for OpenMP*. Those need to be set with starting GDBServer (similar to setting $LD_LIBRARY_PATH).

$INTEL_LIBITTNOTIFY32=""
$INTEL_LIBITTNOTIFY64=""
$INTEL_ITTNOTIFY_GROUPS=sync

Debugging with Eclipse* IDE

Intel offers an Eclipse* IDE debugger plug-in for Intel® MIC that has the following features:

Seamless debugging of host and coprocessor
Simultaneous view of host and coprocessor threads
Supports multiple coprocessor cards
Supports both C/C++ and Fortran
Support of offload extensions (auto-attach to offloaded code)
Support for Intel^® Many Integrated Core Architecture (Intel^® MIC Architecture): Registers & Disassembly

The plug-in is part of both Intel® MPSS and Intel® Parallel Studio XE Composer Edition.

Pre-requisites

In order to use the provided plug-in the following pre-requisites have to be met:

Supported Eclipse* IDE version:
- 4.5 with Eclipse C/C++ Development Tools (CDT) 8.7 or later
- 4.4 with Eclipse C/C++ Development Tools (CDT) 8.3 or later
- 4.2 & 4.3 with Eclipse C/C++ Development Tools (CDT) 8.1 or later

We recommend: Eclipse* IDE for C/C++ Developers (4.5)

Java* Runtime Environment (JRE) 6.0 or later (7.0 for Eclipse* 4.4)
For Fortran optionally Photran* plug-in
Remote System Explorer (aka. Target Management) to debug native coprocessor applications
Only for plug-in from Intel® Parallel Studio XE Composer Edition, source compilervars.[sh|csh] for Eclipse* IDE environment!

Install Intel® C++ Compiler plug-in (optional):
Add plug-in via “Install New Software…”:

This Plug-in is part of Intel® Parallel Studio XE Composer Edition (<install-dir>/ide_support_2016/eclipse/compiler_xe/). It adds Intel® C++ Compiler support which is not mandatory for debugging. For Fortran the counterpart is the Photran* plug-in. These plug-ins are recommended for the best experience.

Note:
Uncheck “Group items by category”, as the list will be empty otherwise!
In addition, it is recommended to disable checking for latest versions. If not done, installation could take unnecessarily long and newer components might be installed that did not come with the vanilla Eclipse package. Those could cause problems.

Install Plug-in for Offload Debugging

Add plug-in via “Install New Software…”:

Plug-in is part of:

Intel^® MPSS:
- MPSS 2.1: <mpss_root>/eclipse_support/
- MPSS 3.*: /usr/share/eclipse/mic_plugin/
Intel® Parallel Studio XE Composer Edition:<install-dir>/ide_support_2016/eclipse/gdb_xe/

Configure Offload Debugging

Create a new debug configuration for “C/C++ Application”
Click on “Select other…” and select MPM (DSF) Create Process Launcher:
The “MPM (DSF) Create Process Launcher” needs to be used for our plug-in. Please note that this instruction is for both C/C++ and Fortran applications! Even though Photran* is installed and a “Fortran Local Application” entry is visible (not in the screenshot above!) don’t use it. It is not capable of using MPM.
In “Debugger” tab specify MPM script of Intel’s GNU* GDB:
- Intel^® MPSS:
  - MPSS 2.1: <mpss_root>/mpm/bin/start_mpm.sh
  - MPSS 3.*: /usr/bin/start_mpm.sh
    (for MPSS 3.1.1, 3.1.2 or 3.1.4, please see Errata)
- Intel® Parallel Studio XE Composer Edition:
  <install-dir>/debugger_2016/mpm/mic/bin/start_mpm.sh
  
  Here, you finally add Intel’s GNU* GDB for offload debugging (using MPM (DSF)). It is a script that takes care of setting up the full environment needed. No further configuration is required (e.g. which coprocessor cards, GDBServer & ports, IP addresses, etc.); it works fully automatic and transparent.

Start Offload Debugging

Debugging offload enabled applications is not much different than applications native for the host:

Create & build an executable with offload extensions (C/C++ or Fortran)
- Use offload pragmas/directives or OpenMP* 4.0
- Use MYO (_Cilk_shared, _Cilk_spawn, …)
  
  For getting started examples, please see:
  - <install-dir>/compilers_and_libraries/linux/samples/en/compiler_[c|f]/psxe/mic_samples.tar.gz
  - http://software.intel.com/en-us/articles/offload-programming-fortran-and-c-code-examples
  - Or search via Intel® Developer Zone’s Content Library:
    http://software.intel.com/en-us/search/site
Don’t forget to add debug information (-g) and reduce optimization level if possible (-O0)
Start debug session:
- Host & target debugger will work together seamlessly
- All threads from host & target are shown and described
- Debugging is same as used from Eclipse* IDE

This is an example (Fortran) of what offload debugging looks like. On the left side we see host & mic0 threads running. One thread (11) from the coprocessor has hit the breakpoint we set inside the loop of the offloaded code. Run control (stepping, continuing, etc.), setting breakpoints, evaluating variables/memory, … work as they used to.

Additional Requirements for Offload Debugging

For debugging offload enabled applications additional environment variables need to be set:

Intel® MPSS 2.1:
COI_SEP_DISABLE=FALSE
MYO_WATCHDOG_MONITOR=-1
Intel® MPSS 3.*:
AMPLXE_COI_DEBUG_SUPPORT=TRUE
MYO_WATCHDOG_MONITOR=-1

Set those variables before starting Eclipse* IDE!

Those are currently needed but might become obsolete in the future.

For MPSS 2.1, please be aware that the debugger cannot and should not be used in combination with Intel® VTune™ Amplifier XE (COI_SEP_DISABLE=FALSE). Hence, disabling SEP (as part of Intel® VTune™ Amplifier XE) is valid.

For MPSS 3.*, AMPLXE_COI_DEBUG_SUPPORT=TRUE extracts K1OM object code map files from fat SOs (with host & K1OM object code) and places it under /tmp/coi_procs/<card #>/<process ID>/load_lib/ on the coprocessor. This is not only required for Intel® VTune™ Amplifier XE but also for the debugger. Additionally, use the mic_extract tool to extract K1OM object code from fat SOs on the host (where Eclipse IDE* runs on). Otherwise the current debugger won’t find the K1OM object code on the host, e.g.:


	$ mic_extract libx.so

If libx.so contains K1OM object code as well, another file is created aside libx.so, like libxMIC.so. The latter contains the K1OM object code. See https://software.intel.com/en-us/node/524818 for more information.

In addition, the watchdog monitor must be disabled because a debugger can stop execution for an unspecified amount of time. Hence the system watchdog might assume that a debugged application, if not reacting anymore, is dead and will terminate it. For debugging we do not want that.

Note:
Do not set those variables for a production system!

For Intel® MPSS 3.2 and later:
MYO debug libraries are no longer installed with Intel MPSS 3.2 by default. This is a change from earlier Intel MPSS versions. Users must install the MYO debug libraries manually in order to debug MYO enabled applications using the Eclipse plug-in for offload debugging. For Intel MPSS 3.2 (and later) the MYO debug libraries can be found in the package mpss-myo-dbg-* which is included in the mpss-*.tar file.

MPSS 3.2 and 3.2.1 do not support offload debugging with Intel® Composer XE 2013 SP1, please see Errata for more information!

Configure Native Debugging

Configure Remote System Explorer
To debug native coprocessor applications we need to configure the Remote System Explorer (RSE).

Note:
Before you continue, make sure SSH works (e.g. via command line). You can also specify different credentials (user account) via RSE and save the password.

The basic steps are quite simple:

Show the Remote System window:
Menu Window->Show View->Other…
Select: Remote Systems->Remote Systems
Add a new system node for each coprocessor:

Context menu in window Remote Systems: New Connection…

Select Linux, press Next>
Specify hostname of the coprocessor (e.g. mic0), press Next>
In the following dialogs select:
- ssh.files
- processes.shell.linux
- ssh.shells
- ssh.terminals

Repeat this step for each coprocessor!

Transfer GDBServer
Transfer of the GDBServer to the coprocessor is required for remote debugging. We choose /tmp/gdberver as target on the coprocessor here (important for the following sections).

Copy GDBServer to coprocessor, e.g.:

Intel® MPSS:


		$ scp /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdbserver mic0:/tmp

Intel® Parallel Studio XE Composer Edition:
```
		$ scp <install-dir>/debugger_2016/gdb/targets/mic/bin/gdbserver mic0:/tmp
```
During development you can also add GDBServer to your coprocessor image!

Debug Configuration

To create a new debug configuration for a native coprocessor application (here: native_c++) create a new one for C/C++ Remote Application.

Set Connection to the coprocessor target configured with RSE before (here: mic0).

Specify the remote path of the application, wherever it was copied to (here: /tmp/native_c++). We’ll address how to manually transfer files later.

Set the flag for “Skip download to target path.” if you don’t want the debugger to upload the executable to the specified path. This can be meaningful if you have complex projects with external dependencies (e.g. libraries) and don’t want to manually transfer the binaries.
(for MPSS 3.1.2 or 3.1.4, please see Errata)

Note that we use C/C++ Remote Application here. This is also true for Fortran applications because there’s no remote debug configuration section provided by the Photran* plug-in!

In Debugger tab, specify the provided Intel GNU* GDB for Intel® MIC (here: gdb-mic).

In the above example, set sysroot from MPSS installation in .gdbinit, e.g.:


	set sysroot /opt/mpss/3.1.4/sysroots/k1om-mpss-linux/

You can use .gdbinit or any other command file that should be loaded before starting the debugging session. If you do not specify this you won't get debugger support for system libraries.

Note:
See section Debugging on Command Line above for the correct path of GDBServer, depending on the chosen package (Intel^® MPSS or Intel® Parallel Studio XE Composer Edition)!

In Debugger/Gdbserver Settings tab, specify the uploaded GDBServer (here: /tmp/gdbserver).

Build Native Application for the Coprocessor

Configuration depends on the installed plug-ins. For C/C++ applications we recommend to install the Intel® C++ Compiler XE plug-in that comes with Intel® Parallel Studio XE Composer Edition. For Fortran, install Photran* (3^rd party) and select the Intel® Fortran Compiler manually.

Make sure to use the debug configuration and provide options as if debugging on the host (-g). Optionally, disabling optimizations by –O0 can make the instruction flow comprehendible when debugging.

The only difference compared to host builds is that you need to cross-compile for the coprocessor: Use the –mmic option, e.g.:

After configuration, clean your build. This is needed because Eclipse* IDE might not notice all dependencies. And finally, build.

Note:
That the configuration dialog shown only exists for the Intel® C++ Compiler plug-in. For Fortran, users need to install the Photran* plug-in and switch the compiler/linker to ifort by hand plus adding -mmic manually. This has to be done for both the compiler & linker!

Start Native Debugging

Transfer the executable to the coprocessor, e.g.:

Copy manually (e.g. via script on the terminal)
Use the Remote Systems window (RSE) to copy files from host and paste to coprocessor target (e.g. mic0):

Select the files from the tree (Local Files) and paste them to where you want them on the target to be (e.g. mic0)
Use NFS to mirror builds to coprocessor (no need for update)
Use debugger to transfer (see earlier)

Note:
It is crucial that the executable can be executed on the coprocessor. In some cases the execution bits might not be set after copying.

Start debugging using the C/C++ Remote Application created in the earlier steps. It should connect to the coprocessor target and launch the specified application via the GDBServer. Debugging is the same as for local/host applications.

Note:
This works for coprocessor native Fortran applications the exact same way!

Documentation

More information can be found in the official documentation:

Intel^® MPSS:
- MPSS 2.1:
  <mpss_root>/docs/gdb/gdb.pdf
  <mpss_root>/eclipse_support/README-INTEL
- MPSS 3.*:
  <mpss_root>/sysroots/x86_64-mpsssdk-linux/usr/share/doc/gdb-<version>/GDB.pdf
  (not available for all; please see Errata)
Intel® Parallel Studio XE Composer Edition:
<install-dir>/documentation_2016/en/debugger/gdb-mic/gdb.pdf
<install-dir>/documentation_2016/en/debugger/ps2016/get_started.htm

The PDF gdb.pdf is the original GNU* GDB manual for the base version Intel ships, extended by all features added. So, this is the place to get help for new commands, behavior, etc.
README-INTEL from Intel® MPSS contains a short guide how to install and configure the Eclipse* IDE plug-in.
PDF eclmigdb_config_guide.pdf provides an overall step-by-step guide how to debug with the command line and with Eclipse* IDE.

Using Intel^® C++ Compiler with the Eclipse* IDE on Linux*:
http://software.intel.com/en-us/articles/intel-c-compiler-for-linux-using-intel-compilers-with-the-eclipse-ide-pdf/
The knowledgebase article (Using Intel® C++ Compiler with the Eclipse* IDE on Linux*) is a step-by step guide how to install, configure and use the Intel® C++ Compiler with Eclipse* IDE.

Errata

With the recent switch from MPSS 2.1 to 3.1 some packages might be incomplete or missing. Future updates will add improvements. Documentation for GNU* GDB is missing up to 3.2 (3.3 and later contain it).
For MPSS 3.1.2 and 3.1.4 the respective package mpss-3.1.[2|4]-k1om.tar is missing. It contains binaries for the coprocessor, like the native GNU* GDB for the coprocessor. It also contains /usr/libexec/sftp-server which is needed if you want to debug native applications on the coprocessor and require Eclipse* IDE to transfer the binary automatically. As this is missing you need to transfer the files manually (select “Skip download to target path.” in this case).
As a workaround, you can use mpss-3.1.1-k1om.tar from MPSS 3.1.1 and install the binaries from there. If you use MPSS 3.1.4, the native GNU* GDB is available separately via mpss-3.1.4-k1om-gdb.tar.
With MPSS 3.1.1, 3.1.2 or 3.1.4 the script <mpss_root>/mpm/bin/start_mpm.sh uses an incorrect path to the MPSS root directory. Hence offload debugging is not working. You can fix this by creating a symlink for your MPSS root, e.g. for MPSS 3.1.2:

$ ln -s /opt/mpss/3.1.2 /opt/mpss/3.1

Newer versions of MPSS correct this. This workaround is not required if you use the start_mpm.sh script from the Intel® Parallel Studio XE Composer Edition package.
For MPSS 3.* the coprocessor native GNU* GDB requires debug information from some system libraries for proper operation.
Beginning with MPSS 3.1, debug information for system libraries is not installed on the coprocessor anymore. If the coprocessor native GNU* GDB is executed, it will fail when loading/continuing with a signal (SIGTRAP).
Current workaround is to copy the .debug folders for the system libraries to the coprocessor, e.g.:

$ scp -r /opt/mpss/3.1.2/sysroots/k1om-mpss-linux/lib64/.debug root@mic0:/lib64/
MPSS 3.2 and 3.2.1 do not support offload debugging with Intel® Composer XE 2013 SP1.
Offload debugging with the Eclipse plug-in from Intel® Composer XE 2013 SP1 does not work with Intel MPSS 3.2 and 3.2.1. A configuration file which is required for operation by the Intel Composer XE 2013 SP1 package has been removed with Intel MPSS 3.2 and 3.2.1. Previous Intel MPSS versions are not affected. Intel MPSS 3.2.3 fixes this problem (there is no version of Intel MPSS 3.2.2!).

Intel(R) Xeon Phi(TM) Coprocessor

Architettura Intel® Many Integrated Core

URL:

Intel® Xeon Phi™ Coprocessor developer zone

Intel® Many Integrated Core Architecture Forum

↧

Debugging Intel® Xeon Phi™ Applications on Windows* Host

August 25, 2015, 10:26 am

Latest and popular articles on Intel Technologies

≫ Next: iconv issue

≪ Previous: Debugging Intel® Xeon Phi™ Applications on Linux* Host

Introduction

There are many reasons for the need of a debug solution for Intel^® MIC. Some of the most important ones are the following:

Developing native Intel^® MIC applications is as easy as for IA-32 or Intel^® 64 hosts. In most cases they just need to be cross-compiled (/Qmic).
Yet, Intel^® MIC Architecture is different to host architecture. Those differences could unveil existing issues. Also, incorrect tuning for Intel^® MIC could introduce new issues (e.g. alignment of data, can an application handle more than hundreds of threads?, efficient memory consumption?, etc.)
Developing offload enabled applications induces more complexity as host and coprocessor share workload.
General lower level analysis, tracing execution paths, learning the instruction set of Intel^® MIC Architecture, …

Debug Solution for Intel® MIC

For Windows* host, Intel offers a debug solution, the Intel® Debugger Extension for Intel® MIC Architecture Applications. It supports debugging offload enabled application as well as native Intel® MIC applications running on the Intel® Xeon Phi™ coprocessor.

How to get it?

To obtain Intel® Debugger Extension for Intel® MIC Architecture on Windows* host, you need the following:

Intel® Parallel Studio XE 2016 (for C/C++ or Fortran):
https://software.intel.com/en-us/intel-parallel-studio-xe
Only Intel® Composer XE contains the debug solution for Windows* host. However, Intel^® Manycore Platform Software Stack (Intel^® MPSS) is also required for this to work:
http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss
(currently 3.5.2 for Windows* [3.4.5 for long term support])

Debug Solution as Integration

Debug solution from Intel® based on GNU* GDB:

Full integration into Microsoft Visual Studio*, no command line version needed
Available with Intel® Composer XE 2013 SP1 and later
(Intel® Parallel Studio XE Composer Edition is the successor)

Note:
Pure native debugging on the coprocessor is also possible by using Intel’s version of GNU* GDB for the coprocessor. This is covered in the following article for Linux* host:
http://software.intel.com/en-us/articles/debugging-intel-xeon-phi-applications-on-linux-host

Why integration into Microsoft Visual Studio*?

Microsoft Visual Studio* is established IDE on Windows* host
Integration reuses existing usability and features
Fortran support added with Intel® Parallel Studio XE Composer Edition for Fortran (former Intel® Fortran Composer XE)

Components Required

The following components are required to develop and debug for Intel® MIC Architecture:

Intel® Xeon Phi™ coprocessor
Windows* Server 2008 RC2, Windows* 7 or later
Microsoft Visual Studio* 2012 or later
Support for Microsoft Visual Studio* 2013 was added with Intel® Composer XE 2013 SP1 Update 1. Microsoft Visual Studio* 2015 is supported with Intel® Parallel Studio XE 2016 Composer Edition and Intel® Parallel Studio XE 2015 Composer Edition Update 4, and later.
Intel® MPSS 3.1 or later
C/C++ development:
Intel® C++ Composer XE 2013 SP1 for Windows* or later
Fortran development:
Intel® Fortran Composer XE 2013 SP1 for Windows* or later

Configure & Test

It is crucial to make sure that the coprocessor setup is correctly working. Otherwise the debugger might not be fully functional.

Setup Intel^® MPSS:

Follow Intel^® MPSS readme-windows.pdf for setup
Verify that the Intel^® Xeon Phi™ coprocessor is running

Before debugging applications with offload extensions:

Use official examples from:
C:\Program Files (x86)\IntelSWTools\samples_2016\en
Verify that offloading code works

It is crucial to make sure that the coprocessor setup is correctly working. Otherwise the debugger might not be fully functional.

Prerequisite for Debugging

Debugger integration for Intel^® MIC Architecture only works when debug information is being available:

Compile in debug mode with at least the following option set:
/Zi (compiler) and /DEBUG (linker)
Optional: Unoptimized code (/Od) makes debugging easier
(due to removed/optimized away temporaries, etc.)

Applications can only be debugged in 64 bit

Set platform to x64
Verify that /MACHINE:x64 (linker) is set!

Debugging Applications with Offload Extension

Start Microsoft Visual Studio* IDE and open or create an Intel^® Xeon Phi™ project with offload extensions. Examples can be found in the Samples directory of Intel® Parallel Studio XE Composer Edition (former Intel® Composer XE), that is:

C:\Program Files (x86)\IntelSWTools\samples_2016\en

compiler_c\psxe\mic_samples.zip or
compiler_f\psxe\mic_samples.zip

We’ll use intro_SampleC from the official C++ examples in the following.

Compile the project with Intel^® C++/Fortran Compiler.

Characteristics of Debugging

Set breakpoints in code (during or before debug session):
- In code mixed for host and coprocessor
- Debugger integration automatically dispatches between host/coprocessor
Run control is the same as for native applications:
- Run/Continue
- Stop/Interrupt
- etc.
Offloaded code stops execution (offloading thread) on host
Offloaded code is executed on coprocessor in another thread
IDE shows host/coprocessor information at the same time:
- Breakpoints
- Threads
- Processes/Modules
- etc.
Multiple coprocessors are supported:
- Data shown is mixed:
  Keep in mind the different processes and address spaces
- No further configuration needed:
  Debug as you go!

Setting Breakpoints

Note the mixed breakpoints here:
The ones set in the normal code (not offloaded) apply to the host. Breakpoints on offloaded code apply to the respective coprocessor(s) only.
The Breakpoints window shows all breakpoints (host & coprocessor(s)).

Start Debugging

Start debugging as usual via menu (shown) or <F5> key:

While debugging, continue till you reach a set breakpoint in offloaded code to debug the coprocessor code.

Thread Information

Information of host and coprocessor(s) is mixed. In the example above, the threads window shows two processes with their threads. One process comes from the host, which does the offload. The other one is the process hosting and executing the offloaded code, one for each coprocessor.

Additional Requirements

For debugging offload enabled applications additional environment variables need to be set:

Intel® MPSS 2.1:
COI_SEP_DISABLE=FALSE
MYO_WATCHDOG_MONITOR=-1
Intel® MPSS 3.*:
AMPLXE_COI_DEBUG_SUPPORT=TRUE
MYO_WATCHDOG_MONITOR=-1

Set those variables before starting Visual Studio* IDE!

Those are currently needed but might become obsolete in the future. Please be aware that the debugger cannot and should not be used in combination with Intel® VTune™ Amplifier XE. Hence disabling SEP (as part of Intel® VTune™ Amplifier XE) is valid. The watchdog monitor must be disabled because a debugger can stop execution for an unspecified amount of time. Hence the system watchdog might assume that a debugged application, if not reacting anymore, is dead and will terminate it. For debugging we do not want that.

Note:
Do not set those variables for a production system!

Debugging Native Coprocessor Applications

Pre-Requisites

Create a native Intel^® Xeon Phi™ coprocessor application and transfer & execute the application to the coprocessor target:

Use micnativeloadex.exe provided by Intel^® MPSS for an application C:\Temp\mic-examples\bin\myApp, e.g.:

> "C:\Program Files\Intel\MPSS\bin\micnativeloadex.exe""C:\Temp\mic-examples\bin\myApp" -d 0
Option –d 0 specifies the first device (zero based) in case there are multiple coprocessors per system
The application is executed directly after transfer

micnativeloadex.exe transfers the specified application to the specified coprocessor and directly executes it. The command itself will be blocked until the transferred application terminates.
Using micnativeloadex.exe also takes care about dependencies (i.e. libraries) and transfers them, too.

Other ways to transfer and execute native applications are also possible (but more complex):

SSH/SCP
NFS
FTP
etc.

Debugging native applications with Start Visual Studio* IDE is only possible via Attach to Process…:

micnativeloadex.exe has been used to transfer and execute the native application

Make sure the application waits till attached, e.g. by:


		static int lockit = 1;

		while(lockit) { sleep(1); }

After having attached, set lockit to 0 and continue.
No Visual Studio* solution/project is required.

Only one coprocessor at a time can be debugged this way.

Configuration

Open the options via TOOLS/Options… menu:

It tells the debugger extension where to find the binary and sources. This needs to be changed every time a different coprocessor native application is being debugged.

The entry solib-search-path directories works the same as for the analogous GNU* GDB command. It allows to map paths from the build system to the host system running the debugger.

The entry Host Cache Directory is used for caching symbol files. It can speed up lookup for big sized applications.

Attach

Open the options via TOOLS/Attach to Process… menu:

Specify the Intel(R) Debugger Extension for Intel(R) MIC Architecture. Set the IP and port the GDBServer should be executed with. The usual port for GDBServer is 2000 but we recommend to use a non-privileged port (e.g. 16000).
After a short delay the processes of the coprocessor card are listed. Select one to attach.

Note:
Checkbox Show processes from all users does not have a function for the coprocessor as user accounts cannot be mapped from host to target and vice versa (Linux* vs. Windows*).

Documentation

More information can be found in the official documentation from Intel® Parallel Studio XE Composer Edition:
C:\Program Files (x86)\IntelSWTools\documentation_2016\en\debugger\ps2016\get_started.htm

Intel(R) Xeon Phi(TM) Coprocessor

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8.x

Architettura Intel® Many Integrated Core

Server

Desktop

URL

Argomenti sui compilatori

Per iniziare

Sviluppo multithread

URL:

Intel® Xeon Phi™ Coprocessor developer zone

Intel® Many Integrated Core Architecture Forum

↧

iconv issue

August 26, 2015, 7:16 am

Latest and popular articles on Intel Technologies

≫ Next: Questions about SCIF Driver

≪ Previous: Debugging Intel® Xeon Phi™ Applications on Windows* Host

hi all,

I'm trying to build something for the Phi that depends on iconv; the library routines are present , but the following application fails when run on the Phi:

#include <stdlib.h>
#include <iconv.h>

int main () {
  iconv_t cd;
  cd = iconv_open("latin1","UTF-8");
  if(cd == (iconv_t)(-1)) exit(1);
  iconv_close(cd);

  exit(0);
}

if I build this using "icc -o iconv_test iconv_test.c" and run it on the host it return no error (exit code 0).

However, if I build this for the Phi "icc -mmic -o iconv_test iconv_test.c" it always returns exitcode 1. An strace shows the following

open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = -1 ENOENT (No such file or directory)
brk(0)                                  = 0x714000
brk(0x735000)                           = 0x735000
open("/usr/lib64/gconv/gconv-modules", O_RDONLY) = -1 ENOENT (No such file or directory)
exit_group(1)

and indeed, those module files are missing - where can I find them?

↧

Questions about SCIF Driver

August 27, 2015, 10:18 am

Latest and popular articles on Intel Technologies

≫ Next: Knight's Landing + Java

≪ Previous: iconv issue

I have a system with 2 PHI cards installed running on redhat 7.0. I am able to run code on the cards as pure offload and I can ssh into the cards. I am trying to get symmetric mode to work.

1) Does symmetric mode require OFED, or is OFED only required when there is a physical Infiniband card?

2) What are the proper steps to verify that the SCIF driver is properly loaded? mic shows up as a driver but there is no indication of anything named SCIF.

[root@infinity ~]# lsmod
Module                  Size Used by
mic                   666166 16
vtsspp                372813 0
sep3_15               527535 0
pax                    13181 0
bridge                115385 0
stp                    12976 1 bridge
llc                    14552 2 stp,bridge
ipt_REJECT             12541 2
xt_comment             12504 2
nf_conntrack_ipv4      14862 2
nf_defrag_ipv4         12729 1 nf_conntrack_ipv4
xt_conntrack           12760 2
nf_conntrack          105702 2 xt_conntrack,nf_conntrack_ipv4
iptable_filter         12810 1
ip_tables              27239 1 iptable_filter
intel_powerclamp       18764 0
coretemp               13435 0
intel_rapl             18773 0
kvm                   461126 0
iTCO_wdt               13480 0
crct10dif_pclmul       14289 0
crc32_pclmul           13113 0
crc32c_intel           22079 0
ghash_clmulni_intel    13259 0
iTCO_vendor_support    13718 1 iTCO_wdt
cryptd                 20359 1 ghash_clmulni_intel
mei_me                 18646 0
sb_edac                26819 0
pcspkr                 12718 0
nfsd                  290215 13
mei                    82723 1 mei_me
edac_core              57650 1 sb_edac
lpc_ich                21073 0
mfd_core               13435 1 lpc_ich
i2c_i801               18135 0
auth_rpcgss            59343 1 nfsd
nfs_acl                12837 1 nfsd
lockd                  93977 1 nfsd
ipmi_si                53353 0
ipmi_msghandler        45603 1 ipmi_si
sunrpc                295293 15 nfsd,auth_rpcgss,lockd,nfs_acl
shpchp                 37032 0
ioatdma                67762 0
acpi_power_meter       18087 0
acpi_pad              116305 0
ext4                  562391 7
mbcache                14958 1 ext4
jbd2                  102940 1 ext4
raid10                 48128 2
sd_mod                 45499 12
crc_t10dif             12714 1 sd_mod
crct10dif_common       12595 2 crct10dif_pclmul,crc_t10dif
ast                    56119 1
syscopyarea            12529 1 ast
sysfillrect            12701 1 ast
sysimgblt              12640 1 ast
nvidia               8374856 0
drm_kms_helper         98226 1 ast
ttm                    93488 1 ast
drm                   311588 5 ast,ttm,drm_kms_helper,nvidia
igb                   192078 0
ahci                   29870 8
libahci                32009 1 ahci
ptp                    18933 1 igb
libata                218854 2 ahci,libahci
pps_core               19106 1 ptp
dca                    15130 2 igb,ioatdma
i2c_algo_bit           13413 2 ast,igb
i2c_core               40325 7 ast,drm,igb,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia
wmi                    19070 0
dm_mirror              22135 0
dm_region_hash         20862 1 dm_mirror
dm_log                 18411 2 dm_region_hash,dm_mirror
dm_mod                104038 25 dm_log,dm_mirror

↧

Knight's Landing + Java

August 28, 2015, 2:24 am

Latest and popular articles on Intel Technologies

≫ Next: Simple Offload Example Failing

≪ Previous: Questions about SCIF Driver

Dear Intel Staff,

I just got to know some details of your great presentation of Knight's Landing (KNL) at Hot Chips this year. Information about KNL on the website is still sparse. From your slides I understand that there will be a version of KNL that is socked and can be used as a primary CPU in a rack. However, this raises quite some questions that I cannot find satisfying answers.

Our scenario:
We have a research cluster that consists mostly of 2 socket systems with normal Ivy-Bridge Xeon CPUs. Our main application is a JVM based machine learning system that uses the MKL via JNI to accelerate computations. We intend to extend this cluster soon and would like to utilize Phi processors. But whether we can use them depends on a few things. (see below)

What I would like to know from you:

Is KNL (socketed) a fully featured x86_64 CPU? For me that means:
1. Can we run an off the shelf Linux on it? (e.g. Redhat Server, Ubuntu Server, etc.?)
2. Can we run an off the shelf JVM on it? (e.g. Oracle x64 JVM)
3. Are there any hardware restrictions that make native code invocation via JNI difficult/impossible?
4. Are there any restrictions for invoking MKL on matrices that reside in the main memory? (i.e. data is not stored in the MCDRAM, but in the DDR4 memory. On Knight's Corner this was terribly inefficient for small matrices because of the offloading-overhead for shipping the data back and forth.)
5. Will normal multi-threading via pThreads or the Java-Multithreading-Framework work? Will all (logical/physical) cores be accessible this way?
Will there be (cheaper?) versions with less than 16 GB MCDRAM?
What is the expected price range for KNL?
When will KNL become available?

Many thanks in advance,

Matt

↧

Simple Offload Example Failing

August 28, 2015, 4:23 pm

Latest and popular articles on Intel Technologies

≫ Next: Why MIC requires strict data alignment? How about auto vectorize of unaligned data?

≪ Previous: Knight's Landing + Java

Hello,

I'm attempting to run a simple offload example:

#include <stdio.h>
#include <omp.h>

int main(){
double sum; int i,n, nt;

   n=2000000000;
   sum=0.0e0;

   #pragma offload target(mic:0)
   {
    #pragma omp parallel for reduction(+:sum)
    for(i=1;i<=n;i++){
       sum = sum + i;
    }
    //nt = omp_get_max_threads();
    #pragma omp parallel
    {
       #pragma omp single
       nt = omp_get_num_threads();
    }

    #ifdef __MIC__
       printf("Hello MIC reduction %f threads: %d\n",sum,nt);
    #else
       printf("Hello CPU reduction %f threads: %d\n",sum,nt);
    #endif
   }
}

This program ran fine previously but we recently rebooted our Phi nodes in our cluster and since then this offloading example will not run. The native compiled MIC binaries still run without a problem since the reboot.

Before running I type:

. /usr/local/intel/ClusterStudioXE_2013/composer_xe_2013_sp1/bin/compilervars.sh intel64
make
export MIC_OMP_NUM_THREADS=120
export MIC_ENV_PREFIX=MIC
export OFFLOAD_REPORT=3

Here is my Makefile:

CC=icc
CFLAGS=-std=c99 -O3 -vec-report3 -openmp -offload
EXE=reduce_offload_mic

$(EXE) : reduce_omp_mic.c
	$(CC) -o $@ $< $(CFLAGS)

.PHONY: clean

clean:
	rm $(EXE)

However, when I run the program here is the output:

[frenchwr@vmp903 Offload]$ ./reduce_offload_mic
offload error: cannot offload to MIC - device is not available
[Offload] [HOST]  [State]   Unregister data tables

I have ensured that mpss is running and even restarted the service with:

sudo service mpss restart

but still the same error (even after re-building the executable).

All of my mic tests pass:

[frenchwr@vmp903 Offload]$ miccheck
MicCheck 3.4-r1
Copyright 2013 Intel Corporation All Rights Reserved

Executing default tests for host
  Test 0: Check number of devices the OS sees in the system ... pass
  Test 1: Check mic driver is loaded ... pass
  Test 2: Check number of devices driver sees in the system ... pass
  Test 3: Check mpssd daemon is running ... pass
Executing default tests for device: 0
  Test 4 (mic0): Check device is in online state and its postcode is FF ... pass
  Test 5 (mic0): Check ras daemon is available in device ... pass
  Test 6 (mic0): Check running flash version is correct ... pass
  Test 7 (mic0): Check running SMC firmware version is correct ... pass
Executing default tests for device: 1
  Test 8 (mic1): Check device is in online state and its postcode is FF ... pass
  Test 9 (mic1): Check ras daemon is available in device ... pass
  Test 10 (mic1): Check running flash version is correct ... pass
  Test 11 (mic1): Check running SMC firmware version is correct ... pass

Status: OK

Here's the output from micinfo:

[frenchwr@vmp903 Offload]$ micinfo
MicInfo Utility Log
Created Fri Aug 28 18:14:23 2015


	System Info
		HOST OS			: Linux
		OS Version		: 2.6.32-431.29.2.el6.x86_64
		Driver Version		: 3.4-1
		MPSS Version		: 3.4
		Host Physical Memory	: 132110 MB

Device No: 0, Device Name: mic0

	Version
		Flash Version 		 : 2.1.02.0390
		SMC Firmware Version	 : 1.16.5078
		SMC Boot Loader Version	 : 1.8.4326
		uOS Version 		 : 2.6.38.8+mpss3.4
		Device Serial Number 	 : ADKC42900304

	Board
		Vendor ID 		 : 0x8086
		Device ID 		 : 0x225c
		Subsystem ID 		 : 0x7d95
		Coprocessor Stepping ID	 : 2
		PCIe Width 		 : Insufficient Privileges
		PCIe Speed 		 : Insufficient Privileges
		PCIe Max payload size	 : Insufficient Privileges
		PCIe Max read req size	 : Insufficient Privileges
		Coprocessor Model	 : 0x01
		Coprocessor Model Ext	 : 0x00
		Coprocessor Type	 : 0x00
		Coprocessor Family	 : 0x0b
		Coprocessor Family Ext	 : 0x00
		Coprocessor Stepping 	 : C0
		Board SKU 		 : C0PRQ-7120 P/A/X/D
		ECC Mode 		 : Enabled
		SMC HW Revision 	 : Product 300W Passive CS

	Cores
		Total No of Active Cores : 61
		Voltage 		 : 1037000 uV
		Frequency		 : 1238095 kHz

	Thermal
		Fan Speed Control 	 : N/A
		Fan RPM 		 : N/A
		Fan PWM 		 : N/A
		Die Temp		 : 46 C

	GDDR
		GDDR Vendor		 : Samsung
		GDDR Version		 : 0x6
		GDDR Density		 : 4096 Mb
		GDDR Size		 : 15872 MB
		GDDR Technology		 : GDDR5
		GDDR Speed		 : 5.500000 GT/s
		GDDR Frequency		 : 2750000 kHz
		GDDR Voltage		 : 1501000 uV

Device No: 1, Device Name: mic1

	Version
		Flash Version 		 : 2.1.02.0390
		SMC Firmware Version	 : 1.16.5078
		SMC Boot Loader Version	 : 1.8.4326
		uOS Version 		 : 2.6.38.8+mpss3.4
		Device Serial Number 	 : ADKC42900319

	Board
		Vendor ID 		 : 0x8086
		Device ID 		 : 0x225c
		Subsystem ID 		 : 0x7d95
		Coprocessor Stepping ID	 : 2
		PCIe Width 		 : Insufficient Privileges
		PCIe Speed 		 : Insufficient Privileges
		PCIe Max payload size	 : Insufficient Privileges
		PCIe Max read req size	 : Insufficient Privileges
		Coprocessor Model	 : 0x01
		Coprocessor Model Ext	 : 0x00
		Coprocessor Type	 : 0x00
		Coprocessor Family	 : 0x0b
		Coprocessor Family Ext	 : 0x00
		Coprocessor Stepping 	 : C0
		Board SKU 		 : C0PRQ-7120 P/A/X/D
		ECC Mode 		 : Enabled
		SMC HW Revision 	 : Product 300W Passive CS

	Cores
		Total No of Active Cores : 61
		Voltage 		 : 1040000 uV
		Frequency		 : 1238095 kHz

	Thermal
		Fan Speed Control 	 : N/A
		Fan RPM 		 : N/A
		Fan PWM 		 : N/A
		Die Temp		 : 47 C

	GDDR
		GDDR Vendor		 : Samsung
		GDDR Version		 : 0x6
		GDDR Density		 : 4096 Mb
		GDDR Size		 : 15872 MB
		GDDR Technology		 : GDDR5
		GDDR Speed		 : 5.500000 GT/s
		GDDR Frequency		 : 2750000 kHz
		GDDR Voltage		 : 1501000 uV

From searching online I see a few other users who have run into the:

offload error: cannot offload to MIC - device is not available
[Offload] [HOST]  [State]   Unregister data tables

issue, but I don't see any good resolution (other than by restarting mpss, which does not resolve the issue for me).

↧

Why MIC requires strict data alignment? How about auto vectorize of unaligned data?

August 29, 2015, 6:23 am

Latest and popular articles on Intel Technologies

≫ Next: offload_transfer: array of variables?

≪ Previous: Simple Offload Example Failing

MIC requires strict 64Byte data alignment to utilize vpu, but why? I found Sparc also have such an requirement. But other multi-core CPU can handle unaligned data.

As MIC can automatically vectorize a for loop of data(with compiler optimization), what if the data is unaligned in this case? will the auto optimization still work? if yes, how?

↧

offload_transfer: array of variables?

August 31, 2015, 3:54 am

Latest and popular articles on Intel Technologies

≫ Next: No Cost Options for Intel Math Kernel Library (MKL), Support yourself, Royalty-Free

≪ Previous: Why MIC requires strict data alignment? How about auto vectorize of unaligned data?

Hello,

I would like to pre-allocate a number of buffers for later data transfers from CPU to MIC, using explicit offloading in C++.

It works nicely if each buffer corresponds to an explicit variable name, as e.g. in the double-buffering examples. However, I would like to have a configurable number of such buffers (more than 2), i.e. an array of buffers. (the buffers are used for asynchronous processing on the MIC, and I need quite a few of them).

I do have a workaround, i.e. allocate a single very big buffer, and cut it into pieces (by using offsets and 'into' for transfers), but as the buffers do not need to be to be contiguous, I'm afraid adding this constraint may cause problems to find a big block available at runtime. So I would prefer to have several smaller buffers if possible.

The code below will probably describe easily the issue. In the first part, it works fine with 2 variable names. But in the second part, with an array, I don't find how to proceed (or is it simply not possible?). I tried without success various syntaxes, but could not find one accepted by the compiler.

I would be glad if someone could help on this matter. Thanks in advance for any feedback on this!

cheers, Sylvain

#pragma offload_attribute (push,target(mic))
#include <stdio.h>
#pragma offload_attribute (pop)

#define ALLOC alloc_if(1) free_if(0)
#define FREE alloc_if(0) free_if(1)
#define REUSE alloc_if(0) free_if(0)

int main() {

  int size=100;      // size of buffer
  char input[size];  // buffer for input data on the CPU

  char *ptr1=NULL;  // reference to MIC buffer 1
  char *ptr2=NULL;  // reference to MIC buffer 2

  // pre-allocate MIC buffers
  #pragma offload_transfer target(mic:0) nocopy(ptr1 : length(size) ALLOC)
  #pragma offload_transfer target(mic:0) nocopy(ptr2 : length(size) ALLOC)

  // test use of buffer 1
  snprintf(input,size,"valPtr1");
  #pragma offload target(mic:0) in(input[0:size] : REUSE into(ptr1[0:size]))
  {
    printf("MIC: %p = %s\n",ptr1,ptr1);
  }

  // test use of buffer 2
  snprintf(input,size,"valPtr2");
  #pragma offload target(mic:0) in(input[0:size] : REUSE into(ptr2[0:size]))
  {
    printf("MIC: %p = %s\n",ptr2,ptr2);
  }


  // try to do same as above, but with an array instead of fixed variable names ptr1,ptr2
  // so that number of elements can be increased and iterated
  // e.g. instead of ptr1 and ptr2, use ptrX[1], ptrX[2] ... ptrX[N]

  // compiler does not seem to complain for the allocation
  // but it crashes at runtime
  char *ptrX[2]={NULL,NULL};
  for (int i=0;i<2;i++) {
    #pragma offload_transfer target(mic:0) nocopy(ptrX[i] : length(size) ALLOC)
  }

  // and then, how to use the buffers ???
  /*
  for (int i=0;i<2;i++) {
    snprintf(input,size,"valPtrX%d",i);
    #pragma offload target(mic:0) in(input[0:size] : REUSE into((???)[0:size]))
    {
      printf("MIC: %p = %s\n",???,???);
    }
  }
  */

  return 0;
}

↧

No Cost Options for Intel Math Kernel Library (MKL), Support yourself, Royalty-Free

August 31, 2015, 10:36 am

Latest and popular articles on Intel Technologies

≫ Next: performance difference between AO and CAO

≪ Previous: offload_transfer: array of variables?

The Intel® Math Kernel Library (Intel® MKL), the high performance math library for x86 and x86-64, is available for free for everyone (click here now to register and download). Purchasing is only necessary if you want access to Intel® Premier Support (direct 1:1 private support from Intel), older versions of the library or access to other tools in Intel® Parallel Studio XE. Intel continues to actively develop and support this very powerful library - and everyone can benefit from that!

Intel® Math Kernel Library (Intel® MKL) is a very popular library product from Intel that accelerates math processing routines to increase application performance. Intel® MKL includes highly vectorized and threaded Linear Algebra, Fast Fourier Transforms (FFT), Vector Math and Statistics functions. The easiest way to take advantage of all of that processing power is to use a carefully optimized computing math library; even the best compiler can’t compete with the level of performance possible from a hand-optimized library. If your application already relies on the BLAS or LAPACK functionality, simply re-link with Intel® MKL to get better performance on Intel and compatible architectures.

Intel® MKL is most often obtained with the Intel® Compilers and all the other Intel® Performance Libraries in various products from Intel. It can be obtained with tools for analysis, debugging and tuning, tools for MPI and the Intel® MPI Library by acquiring the Intel® Parallel Studio XE. Did you know that some of these are available for free?

Here is a guide to various ways to obtain the latest version of the Intel® Math Kernel Library (Intel® MKL) for free without access to Intel® Premier Support (get support by posting to the Intel Math Kernel Library forum). Anytime you want, the full suite of tools (Intel® Parallel Studio XE) with Intel® Premier Support and access to previous library versions can be purchased worldwide.

Who	What is Free?	Information	Where?
Community Licenses for Everyone	Intel® Math Kernel Library (Intel® MKL) Intel® Data Analytics Acceleration Library (Intel® DAAL) Intel® Threading Building Blocks (Intel® TBB) Intel® Integrated Performance Primitives (Intel® IPP)	Community Licensing for Intel® Performance Libraries – free for all, registration required, no royalties, no restrictions on company or project size, current versions of libraries, no Intel Premier Support access. (Linux, Windows or OS X* versions) Forums for discussion and support are open to everyone: Intel® MKL forum Intel® DAAL forum Intel® IPP forum Intel® TBB forum	Community Licensing for Intel Performance Libraries
Evaluation Copies for Everyone	Intel® Math Kernel Library (Intel® MKL) along with Compilers, libraries and analysis tools (most everything!)	Evaluation Copies – Try before you buy. (Linux, Windows or OS X versions)	Try before you buy
Use as an Academic Researcher	Linux, Windows or OS X versions of: Intel® Math Kernel Library Intel® Data Analytics Acceleration Library Intel® Threading Building Blocks Intel® Integrated Performance Primitives Intel® MPI Library (not available for OS X)	If you will use in conjunction with academic research at institutions of higher education. (Linux, Windows or OS X versions, except the Intel® MPI Library which is not supported on OS X)	Qualify for Use as an Academic Researcher
Student	Intel® Math Kernel Library (Intel® MKL) along with Compilers, libraries and analysis tools (most everything!)	If you are a current student at a degree-granting institutions. (Linux, Windows or OS X versions)	Qualify for Use as a Student
Teacher	Intel® Math Kernel Library (Intel® MKL) along with Compilers, libraries and analysis tools (most everything!)	If you will use in a teaching curriculum. (Linux, Windows or OS X versions)	Qualify for Use as an Educator
Use as an Open Source Contributor	Intel® Math Kernel Library (Intel® MKL) along with all of the Intel® Parallel Studio XE Professional Edition for Linux	If you are a developer actively contributing to a open source projects – and that is why you will utilize the tools. (Linux versions)	Qualify for Use as an Open Source Contributor