double precision vs. single precision and vector alignment on MIC

Hello everyone, I have a subroutine that serves as the force calculation in an n-body simulation shown below. There a couple of arrays that I will describe which will hopefully make the subroutine a little more clear.

1. array position. it contains positions of particles and atom types. it is of length number of particles np

type aos
    real*4 :: x,y,z
    integer :: type
end type aos

type(aos) :: position(np)

2. array neigh. it contains a list of pointers to particles that are within a certain distance of a given particle. it is of type integer.

3. array nnstart and nnfinish. nnstart(i) and nnfinish(i) gives the starting and ending locations for information for particle i in array neigh. If I am particle i, my neighbors are contained in the indices between nnstart(i) and nnfinish(i) in neigh. these are also integers

4. array q. array of real*4s. charges of given atom

5. lj1,lj2,lj3,lj4. these are all arrays of real*4s. they provide parameters for the atomic interactions.

6. array ff. this is the array that stores the forces for each particle. it is of the following type

type ff_data
     real*4 :: x,y,z
end type ff_data

all these arrays I have as globals and are defined for the mic as follows

!dir$ attributes offload:mic::position

My code was compiled using the -align array64byte flag. However, the inclusion of the !dir$ vector aligned statement on line 43 to assist vectorization of the inner loop produces a seg fault. I am very curious as to why, despite being compiled using the align array64byte, my arrays are not aligned. Any suggestions?

My second question is that I have ran this subroutine (shown below using single precision) using both full single precision and full double precision. In the single precision case, all array components are real*4 vs double precision. However, I only see about a 20% speed up. Now given that my bandwidth should double for SP vs. DP, and that my vector registers should be doing twice as many calculations, I expect a factor of two speed up. Is there something wrong with this reasoning, or is there something going on with this code?

Finally, does anyone have any suggestions for my current nearest integer function on line 54? I was given it as a suggestion on this forum, but don't know if there are faster alternatives.

subroutine lj_cut_coul_dsf_nonewton_SP(step,printflag)
    implicit none
    real*4 :: force,forcelj,forcecoul
    real*4 :: x1,y1,z1,x2,y2,z2
    real*4 :: dx,dy,dz,dr,dr2,dr2i,dr6i,dr12i
    double precision ::  ffx,ffy,ffz
    real*4 :: fudge_factor
    real*4 :: qtmp,r,prefactor,erfcc,erfcd,t
    real*4 :: boxdx,boxdy,boxdz
    real*4 :: scalelj, scalecoul
    integer :: i,j,l,step
    integer :: itype,jtype,neigh
    integer :: tid,num
    integer :: offset
    integer :: printflag
    integer :: T1,T2,clock_rate,clock_max

    call system_clock(T1,clock_rate,clock_max)

    !dir$ offload begin target(mic:0) in(position: alloc_if(.false.),free_if(.false.)),&
    !dir$ inout(ff: alloc_if(.false.), free_if(.false.)),&
    !dir$ in(nlist: alloc_if(.false.) free_if(.false.)),&
    !dir$ in(nnfirst: alloc_if(.false.) free_if(.false.)),&
    !dir$ in(nnlast: alloc_if(.false.) free_if(.false.)),&
    !dir$ in(q: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj1: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj2: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj3: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj4: alloc_if(.false.) free_if(.false.))

    !$omp parallel do schedule(dynamic) reduction(+:potential,e_coul,ffx,ffy,ffz) default(private),&
    !$omp& shared(box,ibox,rcut2,cut_coulsq,alpha,EWALD_P,MY_PIS,np,numAtomType),&
    !$omp& shared(A1,A2,A3,A4,A5),&
    !$omp& shared(e_shift,f_shift),&
    !$omp& shared(position,ff,nlist,nnfirst,nnlast,q,lj1,lj2,lj3,lj4)
    do i = 1 ,np
       !---load data for particle i: positions, type, and charge
       x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z; itype = position(i)%type
       qtmp = q(i)

       ffx = 0.0d0; ffy = 0.0d0; ffz = 0.0d0

       !dir$ vector aligned
       !dir$ simd reduction(+:potential,e_coul,ffx,ffy,ffz)
       !... loop over neighbors for particle i
       do j= nnfirst(i),nnlast(i)

          !---load neighboring particle index
          neigh = nlist(j)

          !---calculate distance between particles using a vectorizable nearest integer function
          dx = x1-position(neigh)%x; dy = y1-position(neigh)%y; dz = z1-position(neigh)%z
          boxdx = dx*ibox; boxdy = dy*ibox; boxdz = dz*ibox
          boxdx = (boxdx+sign(1/(epsilon(boxdx)),boxdx)) -sign(1/epsilon(boxdx),dx)
          boxdy = (boxdy+sign(1/(epsilon(boxdy)),boxdy)) -sign(1/epsilon(boxdy),dy)
          boxdz = (boxdz+sign(1/(epsilon(boxdz)),boxdz)) -sign(1/epsilon(boxdz),dz)
          dx = dx-box*boxdx; dy = dy-box*boxdy; dz = dz-box*boxdz
          dr2 = dx*dx + dy*dy + dz*dz

          scalelj = 0.0d0
          if(dr2.lt.rcut2)scalelj = 1.0d0

          dr2i = 1.0d0/dr2
          dr6i = dr2i*dr2i*dr2i

          offset  = numAtomType*(itype-1)+jtype
          forcelj = scalelj*dr6i*(lj1(offset)*dr6i-lj2(offset))
          potential = potential + scalelj*dr6i*(dr6i*lj3(offset)-lj4(offset))


          scalecoul = 0.0d0
          if(dr2.lt.cut_coulsq)scalecoul = 1.0d0

          r = sqrt(dr2)
          prefactor =  scalecoul*qtmp*q(neigh)/r
          erfcd = exp(-alpha*alpha*r*r)
          t = 1.0 / (1.0 + EWALD_P*alpha*r)
          erfcc = t * (A1+t*(A2+t*(A3+t*(A4+t*A5)))) * erfcd
          forcecoul = prefactor * (erfcc/r + 2.0*alpha/MY_PIS * erfcd +&
               r*f_shift) * r

          e_coul = e_coul + prefactor*(erfcc-r*e_shift-dr2*f_shift)


          force   = (forcecoul+forcelj)*dr2i


          ffx = ffx + dx*force
          ffy = ffy + dy*force
          ffz = ffz + dz*force

       enddo
       ff(i)%x = ffx; ff(i)%y = ffy; ff(i)%z = ffz
    enddo
    !$omp end parallel do
    !dir$ end offload

    potential = 0.50d0*potential
    e_coul = 0.50d0*e_coul

    call system_clock(T2,clock_rate,clock_max)

    if(printflag.eq.1)then
       print*,'elapsed time w/o newton SP',real(T2-T1)/real(clock_rate)
       print*,'calculated potentials:',potential*reps, e_coul*reps
    endif

  end subroutine lj_cut_coul_dsf_nonewton_SP