Hi Intel forums,
I've had difficulty reproducing the performance reported on the following page:
https://www-ssl.intel.com/content/www/us/en/benchmarks/server/xeon-phi/xeon-phi-sgemm-dgemm.html
Using the mkl sgemm routine on my 3120 series Xeon Phi, I haven't even approached the 1.7 TFLOP/S level claimed above. The best performance I achieve is ~0.7 TFLOP/S. Presumably, this is because I don't fully understand the threading and vectorization APIs, and I'm not using them optimally. I was wondering if anyone knows where to find the source & environment details used for Intel's official benchmark. Maybe I could compare "correct" usage with my code to better understand the tools.
Thanks,
Chris