Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

Can I see how '#pragma omp parallel for' translates internally

$
0
0

Hello,

I'm writing a simple lookup function using OpenMP/C++ programming model.

Below is the code block where I offload from host to my 5110P Phi device.

 

        #pragma offload target(mic:0) \
            in(batch_size) \
            nocopy(inputbuf:length(inputbuf_sz/sizeof(uint32_t)) \
                    free_if(0) alloc_if(0) align(CACHE_LINE_SIZE)) \
            nocopy(TBL24:length(TBL24_sz/sizeof(uint16_t)) \
                    free_if(0) alloc_if(0)) \
            nocopy(TBLlong:length(TBLlong_sz/sizeof(uint16_t)) \
                    free_if(0) alloc_if(0)) \
            nocopy(resultbuf:length(resultbuf_sz/sizeof(uint16_t)) \
                    free_if(0) alloc_if(0) align(CACHE_LINE_SIZE))
        {
            //#pragma omp parallel for private(index) num_threads(120)
            #pragma omp parallel for num_threads(120)
            for (int index = 0; index < batch_size; index += 16) {
            #ifdef __MIC__
                __m512i     v_daddr =           _mm512_load_epi32 (inputbuf + index);
                __m512i     v_daddr_shift8 =    _mm512_srli_epi32 (v_daddr, 8);
                __m512i     v_temp_dest =       _mm512_i32extgather_epi32 (v_daddr_shift8, TBL24,
                                                                        _MM_UPCONV_EPI32_UINT16, sizeof(uint16_t), _MM_HINT_NT);
                __m512i     v_ignored_ip =      _mm512_set_epi32 (REPEAT_16(IGNORED_IP));
                __m512i     v_zero =            _mm512_setzero_epi32 ();
                __mmask16   m_is_not_ignored =  _mm512_cmp_epu32_mask (v_daddr, v_ignored_ip, _MM_CMPINT_NE);
                __m512i     v_0x8000 =          _mm512_set_epi32 (REPEAT_16(0x8000));
                __m512i     v_0x7fff =          _mm512_set_epi32 (REPEAT_16(0x7fff));
                __m512i     v_0xff =            _mm512_set_epi32 (REPEAT_16(0xff));
                __mmask16   m_top_bit_set =     _mm512_cmp_epu32_mask (_mm512_and_epi32 (v_temp_dest, v_0x8000),
                                                                        v_zero, _MM_CMPINT_NE);
                __mmask16   m_both_cond_met =   _mm512_kand (m_is_not_ignored, m_top_bit_set);
                __m512i     v_index2 =          _mm512_add_epi32 (_mm512_slli_epi32 (_mm512_and_epi32 (v_temp_dest, v_0x7fff), 8),
                                                                _mm512_and_epi32 (v_daddr, v_0xff));
                __m512i     v_result =          _mm512_mask_i32extgather_epi32 (v_temp_dest, m_both_cond_met, v_index2,
                                                                TBLlong, _MM_UPCONV_EPI32_UINT16, sizeof(uint16_t), _MM_HINT_NT);
                                                _mm512_mask_extstore_epi32(resultbuf, m_is_not_ignored, v_result,
                                                                _MM_DOWNCONV_EPI32_UINT16, _MM_HINT_NT);
            #endif
            }
        }   

In sequential execution, the for loop in this code block should result in O(n),
where n is the total number of iterations (batch_size) in this case.

I, however, have used #pragma omp parallel for with num_threads(120) in order to trigger GPU-like SIMD behavior,
which should (approximately and hopefully) divide the number of sequential iterations by some figure up to 120,
and I used explicit vectorization intrinsics to further reduce it by factor of 16 (16 inputs per iteration, not 1).

Yet, I would up with kernel exec. time of around 90us, which is 30 times what it takes in NVIDIA GPU kernel.

So I started reading vectorization reports, which read:
(FYI, Line 145 is where the for statement is)

dryrun_shared.cc(145): (col. 4) remark: *MIC* loop was not vectorized: existence of vector dependence
dryrun_shared.cc(164): (col. 13) remark: *MIC* vector dependence: assumed FLOW dependence between resultbuf line 164 and inputbuf line 147
dryrun_shared.cc(147): (col. 26) remark: *MIC* vector dependence: assumed ANTI dependence between inputbuf line 147 and resultbuf line 164
dryrun_shared.cc(147): (col. 26) remark: *MIC* vector dependence: assumed ANTI dependence between inputbuf line 147 and resultbuf line 164
dryrun_shared.cc(164): (col. 13) remark: *MIC* vector dependence: assumed FLOW dependence between resultbuf line 164 and inputbuf line 147 

I wasn't really sure what the dependence was, but since I used explicit vectorization, I didn't really care.

However, when I read the disassembled the offloaded code block with icpc -S, I found the following lines:

..LN376:
      .loc    1  145  is_stmt 1
           addq      $16, %rcx                                     #145.44 c13

Line 145 col 44 is where the third of the three-part for segment is, i.e. index += 16.

I'm not 100% sure, but I took it to mean that the MIC is running this loop sequentially.

Given this situation, my questions are as follows:

1) Is there a way to 'make sure' each iteration in this for loop runs in 120 separate threads evenly distributed across 60 cores?

2) Also, is there a way to confirm how the code is executed corewise and threadwise?
(From what I know, the code only reflects what is executed, not where and how)

3) Since it's a very small piece of code per iteration, would it be not worth the extra overhead if I used pthread in device to run it?
What's the difference between calling pthreads API and OpenMP threading in terms of how they are executed internally?

4) When I traced the execution with OFFLOAD_REPORT env. variable set,
I observed that copying the results BACK from device is WAY costlier.
Also, copying back to host seemed to consume both host and device CPU time,
while writing to device cost no MIC CPU time at all.
I have implemented the same app in Intel OpenCL and it took 28us/5us to write-to/read-from device,
and in OpenMP it was 8us/51us, which is strange.
Is there any way to make it 8us/5us without changing the programming model?
(FYI, I assumed that both programming models use user-level SCIF API calls underneath,
and I am currently working on a new app with separate host and mic code,
where MIC code acts as a SCIF server daemon and host code acts as a client that connects to device
and copies blocks of data that needs to be processed.)

5) Why is the device showing suboptimal performance under num_threads(180) and num_threads(240)?
I tried the num_threads in multiples of 60 and 120 showed best performance.
From what I read, you should be running 2 threads per core to make sure there is no idle cycles, 
and there should not be any context switching overheads when running 2~4 threads per core.
Strange part is that when I monitored execution with micsmc, it showed full utilization in all cores when it was 240,
but it showed the worst performance in terms of execution time.
Later I realized that the core utilization percentage loosely translates to the number of running threads per core.
(25% if 1 thread, 50% if 2 threads, and so on) 

6) VTune XE GUI doesn't seem to support profiling of offloaded region.
(Which is another reason I'm writing separate codes for host and device)
Am I doing something wrong? Or has it always been?
I ran Knights Corner profiling to see how my code fares, but it didn't show the hotspots in offloaded region.
It only showed a bunch of scif_wait()s from the host.

Thank you for your attention, and I welcome any insights or information that might aid my situation. 

Jun

 


Viewing all articles
Browse latest Browse all 1347

Trending Articles