While trying to characterize some Phi performance over large numbers of OpenMP threads I've noticed strange behavior where programs hang with >256 threads.
I've pared things down to the following example:
#include <omp.h> #include <stdio.h> int main(int argc, char *argv[]) { int threads = 257; // omp_set_dynamic(1); #pragma omp parallel #pragma omp single { printf("single %d of %d\n", omp_get_thread_num(), omp_get_num_threads()); } #pragma omp parallel num_threads(threads) { for (int j=0; j<10; j++) { printf("thread %d iteration: %d\n", omp_get_thread_num(), j); } } }
If I compile/run this on my Phi host it works fine. If I compile and run this on the Phi, it prints out the 10 iterations then hangs. If you set threads <=256 then it works fine on the Phi. If you omp_set_dynamic(1) and set the threads >256 then it works fine.
The key seems to be having the first default parallel region followed by a second one that uses more than 256 threads. I haven't found a good description of the nuances of dynamic threads, but I can see how omp_set_dynamic might be required (though it didn't seem obvious to me that having different parallel regions with different numbers of threads was all that "dynamic" ;^)). I'm definitely not sure why simple hanging for >256 threads is the appropriate behavior.