Where statement and Vectorization

I would like to know, how where statement affects vectorization.

My belief is its bad for vectorization.

Here is a short part of the original code

 where ( LMASK )

            WORK1(:,:,kk) =  KAPPA_THIC(:,:,kbt,k,bid)  &
                           * SLX(:,:,kk,kbt,k,bid) * dz(k)
.
.
.
endwhere

Even the Optrpt seems to suggest the same.

LOOP BEGIN at /storage/home/aketh/cesm/cases/B_f45_g37/exe/ocn/source/hmix_gm.F90(3641,13)
remark #15389: vectorization support: reference work1 has unaligned access
remark #15389: vectorization support: reference hmix_gm_mp_kappa_thic_ has unaligned access
remark #15389: vectorization support: reference hmix_gm_submeso_share_mp_slx_ has unaligned access
remark #15381: vectorization support: unaligned access used inside loop body
remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
remark #15450: unmasked unaligned unit stride loads: 1
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 23
remark #15477: vector loop cost: 55.000
remark #15478: estimated potential speedup: 0.420
remark #15479: lightweight vector operations: 13
remark #15480: medium-overhead vector operations: 2
remark #15481: heavy-overhead vector operations: 3
remark #15488: --- end vector loop cost summary ---

But at the where statement it seemed to show some benefit. What does this mean. that the where loop runs faster?

LOOP BEGIN at /storage/home/aketh/cesm/cases/B_f45_g37/exe/ocn/source/hmix_gm.F90(3639,11)
<Multiversioned v2>
remark #15388: vectorization support: reference 4692 has aligned access
remark #15388: vectorization support: reference lmask has aligned access
remark #15300: LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 1
remark #15449: unmasked aligned unit stride stores: 1
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 4
remark #15477: vector loop cost: 0.750
remark #15478: estimated potential speedup: 5.300
remark #15479: lightweight vector operations: 3
remark #15488: --- end vector loop cost summary ---
LOOP END

I rewrote the code as

WORK1(:,:,kk) = (1 + LMASK)*WORK1(:,:,kk) + (KAPPA_THIC(:,:,kbt,k,bid) &
* SLX(:,:,kk,kbt,k,bid) * dz(k)) * LMASK * -1

which seemed to work well as per optrpt.

remark #15478: estimated potential speedup: 1.930
remark #15479: lightweight vector operations: 20
remark #15480: medium-overhead vector operations: 1
remark #15487: type converts: 2
remark #15488: --- end vector loop cost summary ---

However final timings per iteration are as follows (Xeon runs only)

Xeon unchanged code 9.1004E-004 seconds per iteration

Xeon changed code 2.0971E-003 seconds per iteration

Am I doing something wrong in optimization here?