Hi all,
I was looking to parallelize my code for speedup.
As xeon phi was a NUMA core I used the first touch placement of the data.
while xeon phi is performing better than xeon no doubt, the problem is that totaltime(time for first touch+looptime) is greater.
How do I resolve this issue?
This code when integrated into the main code(cannot post it here) will call state function many times from various different places. So is it possible that even if I dont first touch as I have in the code attached below this overhead is just a onetime problem?
The code attached below as state_test_offload is for MIC and state_test is for Xeon host.