I try to perform an asynchronous data transfer to an Intel Xeon Phi. Note that asynchronous computation works as expected. If I try to combine data transfer and computation (in an offload statement) timing indicates that the data transfer is done synchronously while the following computation is done asynchronously.
A test example that illustrates the point is given below. The output is
0.928997 0.288048
which indicates that almost a second is spend in the asynchronous call while only 0.28 seconds are spend in waiting for that asynchronous call.
Any help would be appreciated.
#include <stdlib.h> #include <iostream> using namespace std; #include "timer.hpp" #define ALLOC alloc_if(1) #define FREE free_if(1) #define RETAIN free_if(0) #define REUSE alloc_if(0) int main() { int n = 1000*1000*100; double *p = (double*)malloc(sizeof(double)*n); int rep = 10; #pragma offload target(mic:0) in(p:length(n) ALLOC RETAIN) {} timer t1, t2; for(int i=0;i<rep;i++) { t1.start(); #pragma offload_transfer target(mic:0) out(p:length(n) REUSE RETAIN) signal(p) /* This works as expected #pragma offload_transfer target(mic:0) signal(p) { usleep(2e6); } */ t1.stop(); t2.start(); #pragma offload_transfer target(mic:0) wait(p) t2.stop(); } cout << t1.total() << ""<< t2.total() << endl; #pragma offload target(mic:0) nocopy(p:length(n) REUSE FREE) {} }
The hardware seems to work properly
MicCheck 3.4.3-r1 Copyright 2013 Intel Corporation All Rights Reserved Executing default tests for host Test 0: Check number of devices the OS sees in the system ... pass Test 1: Check mic driver is loaded ... pass Test 2: Check number of devices driver sees in the system ... pass Test 3: Check mpssd daemon is running ... pass Executing default tests for device: 0 Test 4 (mic0): Check device is in online state and its postcode is FF ... pass Test 5 (mic0): Check ras daemon is available in device ... pass Test 6 (mic0): Check running flash version is correct ... pass Test 7 (mic0): Check running SMC firmware version is correct ... pass Status: OK