Hi everyone,
When I tried to separate the offload process for axpy(y[i] = x[i] * a + y[i]) (allocate/copy memory for x/y to coprocessor(xeon phi)-> run the kernel on the coprocessor(xeon phi)-> get the result back from coprocessor to host(cpu) -> free the memory in coprocessor(xeon phi) ).
I found that the time of allocate/copy memory for x/y is longer than the whole process(all process running together with inout pragma for x/y)
Could anyone explain why this situation happens? Is there any better way to separate the offload process?(The purpose of separate offload process is to collect the time of every subprocess, not just for axpy, but other applications.)
Following is the performance. The attached file is the axpy.c
Thanks,
Jiawen
[liu@fornax Test_offomp]$ ./a.out
Checking for Intel(R) Xeon Phi(TM) (Target CPU) devices...
Number of Target devices installed: 2
Offload sections will execute on: Target CPU (offload mode)
Copy back to host successfully!
PASS axpy
Copy time = 0.01594615 sec
Kernel time = 0.00443697 sec
Free time = 0.00104403 sec
Total time for separate process = 0.02142906 sec
Total time for inout combined = 0.01055193 sec