Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

Where statement and Vectorization

$
0
0

I would like to know, how where statement affects vectorization.

My belief is its bad for vectorization.

Here is a short part of the original code

 where ( LMASK )

            WORK1(:,:,kk) =  KAPPA_THIC(:,:,kbt,k,bid)  &
                           * SLX(:,:,kk,kbt,k,bid) * dz(k)
.
.
.
endwhere

Even the Optrpt seems to suggest the same. 

LOOP BEGIN at /storage/home/aketh/cesm/cases/B_f45_g37/exe/ocn/source/hmix_gm.F90(3641,13)
      remark #15389: vectorization support: reference work1 has unaligned access
      remark #15389: vectorization support: reference hmix_gm_mp_kappa_thic_ has unaligned access
      remark #15389: vectorization support: reference hmix_gm_submeso_share_mp_slx_ has unaligned access
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
      remark #15450: unmasked unaligned unit stride loads: 1
      remark #15475: --- begin vector loop cost summary ---
      remark #15476: scalar loop cost: 23
      remark #15477: vector loop cost: 55.000
      remark #15478: estimated potential speedup: 0.420
      remark #15479: lightweight vector operations: 13
      remark #15480: medium-overhead vector operations: 2
      remark #15481: heavy-overhead vector operations: 3
      remark #15488: --- end vector loop cost summary ---

But at the where statement it seemed to show some benefit. What does this mean. that the where loop runs faster?

LOOP BEGIN at /storage/home/aketh/cesm/cases/B_f45_g37/exe/ocn/source/hmix_gm.F90(3639,11)
   <Multiversioned v2>
      remark #15388: vectorization support: reference 4692 has aligned access
      remark #15388: vectorization support: reference lmask has aligned access
      remark #15300: LOOP WAS VECTORIZED
      remark #15448: unmasked aligned unit stride loads: 1
      remark #15449: unmasked aligned unit stride stores: 1
      remark #15475: --- begin vector loop cost summary ---
      remark #15476: scalar loop cost: 4
      remark #15477: vector loop cost: 0.750
      remark #15478: estimated potential speedup: 5.300
      remark #15479: lightweight vector operations: 3
      remark #15488: --- end vector loop cost summary ---
   LOOP END

I rewrote the code as 

WORK1(:,:,kk) = (1 + LMASK)*WORK1(:,:,kk) + (KAPPA_THIC(:,:,kbt,k,bid)  &
                                        * SLX(:,:,kk,kbt,k,bid) * dz(k)) * LMASK * -1

which seemed to work well as per optrpt.

 remark #15478: estimated potential speedup: 1.930
      remark #15479: lightweight vector operations: 20
      remark #15480: medium-overhead vector operations: 1
      remark #15487: type converts: 2
      remark #15488: --- end vector loop cost summary ---

However final timings per iteration are as follows (Xeon runs only)

Xeon unchanged code 9.1004E-004 seconds per iteration

Xeon changed code 2.0971E-003 seconds per iteration

Am I doing something wrong in optimization here? 

 

 


Viewing all articles
Browse latest Browse all 1347

Latest Images

Trending Articles



Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>