Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

MKL BLAS function only run 20 threads on MIC

$
0
0

Hi,

I am trying to use MKL routines on MIC.

But I noticed that the performance is slower than my CPU version. And there are only 20 threads running.

Is that limited by MKL? I did some settings on the environment variables.

__attribute__(( target (mic) )) void offload_check(void) {
#ifdef __MIC__
        printf("Check Func: Run on MIC!\n");
#else
        printf("Check Func: Run on CPU...\n");
#endif
}


#pragma offload target(mic:0)           \
        in(A:length(A_m*A_n))           \
        in(B:length(B_m*B_n))           \
        inout(C:length(C_m*C_n))
        {
                offload_check();
                cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, A_m, B_n, B_m, \
                            alpha, A, A_n, B, B_n, beta, C, C_n);
        }
}

What I'm doing here is matrix multiplication. The N, M, K for matrix is 2048.

Is it because my matrix is too small and data transmission takes a lot of time? I'm running it on a single node belonging to a cluster, where I could not use vtune to profile it...

 

Thanks for your help


Viewing all articles
Browse latest Browse all 1347

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>