Hi,
I am trying to use MKL routines on MIC.
But I noticed that the performance is slower than my CPU version. And there are only 20 threads running.
Is that limited by MKL? I did some settings on the environment variables.
__attribute__(( target (mic) )) void offload_check(void) { #ifdef __MIC__ printf("Check Func: Run on MIC!\n"); #else printf("Check Func: Run on CPU...\n"); #endif } #pragma offload target(mic:0) \ in(A:length(A_m*A_n)) \ in(B:length(B_m*B_n)) \ inout(C:length(C_m*C_n)) { offload_check(); cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, A_m, B_n, B_m, \ alpha, A, A_n, B, B_n, beta, C, C_n); } }
What I'm doing here is matrix multiplication. The N, M, K for matrix is 2048.
Is it because my matrix is too small and data transmission takes a lot of time? I'm running it on a single node belonging to a cluster, where I could not use vtune to profile it...
Thanks for your help