I would like to know, how where statement affects vectorization.
My belief is its bad for vectorization.
Here is a short part of the original code
where ( LMASK ) WORK1(:,:,kk) = KAPPA_THIC(:,:,kbt,k,bid) & * SLX(:,:,kk,kbt,k,bid) * dz(k) . . . endwhere
Even the Optrpt seems to suggest the same.
LOOP BEGIN at /storage/home/aketh/cesm/cases/B_f45_g37/exe/ocn/source/hmix_gm.F90(3641,13)
remark #15389: vectorization support: reference work1 has unaligned access
remark #15389: vectorization support: reference hmix_gm_mp_kappa_thic_ has unaligned access
remark #15389: vectorization support: reference hmix_gm_submeso_share_mp_slx_ has unaligned access
remark #15381: vectorization support: unaligned access used inside loop body
remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
remark #15450: unmasked unaligned unit stride loads: 1
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 23
remark #15477: vector loop cost: 55.000
remark #15478: estimated potential speedup: 0.420
remark #15479: lightweight vector operations: 13
remark #15480: medium-overhead vector operations: 2
remark #15481: heavy-overhead vector operations: 3
remark #15488: --- end vector loop cost summary ---
But at the where statement it seemed to show some benefit. What does this mean. that the where loop runs faster?
LOOP BEGIN at /storage/home/aketh/cesm/cases/B_f45_g37/exe/ocn/source/hmix_gm.F90(3639,11)
<Multiversioned v2>
remark #15388: vectorization support: reference 4692 has aligned access
remark #15388: vectorization support: reference lmask has aligned access
remark #15300: LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 1
remark #15449: unmasked aligned unit stride stores: 1
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 4
remark #15477: vector loop cost: 0.750
remark #15478: estimated potential speedup: 5.300
remark #15479: lightweight vector operations: 3
remark #15488: --- end vector loop cost summary ---
LOOP END
I rewrote the code as
WORK1(:,:,kk) = (1 + LMASK)*WORK1(:,:,kk) + (KAPPA_THIC(:,:,kbt,k,bid) &
* SLX(:,:,kk,kbt,k,bid) * dz(k)) * LMASK * -1
which seemed to work well as per optrpt.
remark #15478: estimated potential speedup: 1.930
remark #15479: lightweight vector operations: 20
remark #15480: medium-overhead vector operations: 1
remark #15487: type converts: 2
remark #15488: --- end vector loop cost summary ---
However final timings per iteration are as follows (Xeon runs only)
Xeon unchanged code 9.1004E-004 seconds per iteration
Xeon changed code 2.0971E-003 seconds per iteration
Am I doing something wrong in optimization here?