Hi everyone,
I found that when I run the axpy(y[i] = x[i] * a + y[i]) with two separate set of similar data, I got the totally different execution time as following. The attached file is the sample code for axpy.
My assumption is that the first time to run the inout pragma has to spend the time to prepare/preconfigure/preheat the Xeon Phi Coprocessor. If so, is there any official explanation to explain this odd situation? If not, what is the reason? Is there any better way to make a improvement or avoid for this situation? It's really important for the benchmark. Because compare to NVIDIA/INTEL GPU/CPU, this situation never happens.
[liu@fornax Test_offomp]$ ./a.out
Total time for inout1 combined = 0.39732003 sec
Total time for inout2 combined = 0.01132083 sec
Best wishes,
Jiawen