Hi,
I am rewriting part of LibSVM code for vectorization in order to accelerate it on Xeon Phi. Unfortunately, the performance of my rewriting code is the same as, if not slightly worse than, the original performance on Xeon Phi. To figure out the reason, I profiled two programs using vTune hot spot analysis. According to the results, the target code segment takes significantly less time in the rewriting version, however, most of its time goes to __kmp_wait_yield_4, which only occupies a tiny amount of time in the original version.
Does anyone know what it means? I tried to Google it but got very little information.
BTW, my vectorization code gets about 20% improvement in total if running in the host.
Thanks in advance!