Hello,
While porting an image processing library to the Xeon Phi, I stumbled upon a strange behaviour: the processing is about 20% faster when I set the number of threads to precisely 103 (I ran the processing multiple times using between 95 and 118 threads).
I tried to make sense of this by comparing vtune collections (advanced-hotspots, memory bandwidth and general exploration) of a test case running on 102, 103, 104 threads. Each of these analysis yielded similar results (except for the runtime, which was still 20% faster for 103 threads) and I wasn't able to identify what caused this unexpected speedup.
My questions are: Does this sort of behaviour rings a bell to any of you ? Do you have any pointers concerning the possible origin of this effect ?
Some precisions about the library: it uses manual offload and the MKL DFTI functions alongside with computational loops parallelized with OpenMP (each of these treatments account for half the computation time and are equaly affected by the 20% speedup).
Regards
Pierre T.