Hello everyone, I have a subroutine that serves as the force calculation in an n-body simulation shown below. There a couple of arrays that I will describe which will hopefully make the subroutine a little more clear.
1. array position. it contains positions of particles and atom types. it is of length number of particles np
type aos real*4 :: x,y,z integer :: type end type aos type(aos) :: position(np)
2. array neigh. it contains a list of pointers to particles that are within a certain distance of a given particle. it is of type integer.
3. array nnstart and nnfinish. nnstart(i) and nnfinish(i) gives the starting and ending locations for information for particle i in array neigh. If I am particle i, my neighbors are contained in the indices between nnstart(i) and nnfinish(i) in neigh. these are also integers
4. array q. array of real*4s. charges of given atom
5. lj1,lj2,lj3,lj4. these are all arrays of real*4s. they provide parameters for the atomic interactions.
6. array ff. this is the array that stores the forces for each particle. it is of the following type
type ff_data real*4 :: x,y,z end type ff_data
all these arrays I have as globals and are defined for the mic as follows
!dir$ attributes offload:mic::position
My code was compiled using the -align array64byte flag. However, the inclusion of the !dir$ vector aligned statement on line 43 to assist vectorization of the inner loop produces a seg fault. I am very curious as to why, despite being compiled using the align array64byte, my arrays are not aligned. Any suggestions?
My second question is that I have ran this subroutine (shown below using single precision) using both full single precision and full double precision. In the single precision case, all array components are real*4 vs double precision. However, I only see about a 20% speed up. Now given that my bandwidth should double for SP vs. DP, and that my vector registers should be doing twice as many calculations, I expect a factor of two speed up. Is there something wrong with this reasoning, or is there something going on with this code?
Finally, does anyone have any suggestions for my current nearest integer function on line 54? I was given it as a suggestion on this forum, but don't know if there are faster alternatives.
subroutine lj_cut_coul_dsf_nonewton_SP(step,printflag) implicit none real*4 :: force,forcelj,forcecoul real*4 :: x1,y1,z1,x2,y2,z2 real*4 :: dx,dy,dz,dr,dr2,dr2i,dr6i,dr12i double precision :: ffx,ffy,ffz real*4 :: fudge_factor real*4 :: qtmp,r,prefactor,erfcc,erfcd,t real*4 :: boxdx,boxdy,boxdz real*4 :: scalelj, scalecoul integer :: i,j,l,step integer :: itype,jtype,neigh integer :: tid,num integer :: offset integer :: printflag integer :: T1,T2,clock_rate,clock_max call system_clock(T1,clock_rate,clock_max) !dir$ offload begin target(mic:0) in(position: alloc_if(.false.),free_if(.false.)),& !dir$ inout(ff: alloc_if(.false.), free_if(.false.)),& !dir$ in(nlist: alloc_if(.false.) free_if(.false.)),& !dir$ in(nnfirst: alloc_if(.false.) free_if(.false.)),& !dir$ in(nnlast: alloc_if(.false.) free_if(.false.)),& !dir$ in(q: alloc_if(.false.) free_if(.false.)),& !dir$ nocopy(lj1: alloc_if(.false.) free_if(.false.)),& !dir$ nocopy(lj2: alloc_if(.false.) free_if(.false.)),& !dir$ nocopy(lj3: alloc_if(.false.) free_if(.false.)),& !dir$ nocopy(lj4: alloc_if(.false.) free_if(.false.)) !$omp parallel do schedule(dynamic) reduction(+:potential,e_coul,ffx,ffy,ffz) default(private),& !$omp& shared(box,ibox,rcut2,cut_coulsq,alpha,EWALD_P,MY_PIS,np,numAtomType),& !$omp& shared(A1,A2,A3,A4,A5),& !$omp& shared(e_shift,f_shift),& !$omp& shared(position,ff,nlist,nnfirst,nnlast,q,lj1,lj2,lj3,lj4) do i = 1 ,np !---load data for particle i: positions, type, and charge x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z; itype = position(i)%type qtmp = q(i) ffx = 0.0d0; ffy = 0.0d0; ffz = 0.0d0 !dir$ vector aligned !dir$ simd reduction(+:potential,e_coul,ffx,ffy,ffz) !... loop over neighbors for particle i do j= nnfirst(i),nnlast(i) !---load neighboring particle index neigh = nlist(j) !---calculate distance between particles using a vectorizable nearest integer function dx = x1-position(neigh)%x; dy = y1-position(neigh)%y; dz = z1-position(neigh)%z boxdx = dx*ibox; boxdy = dy*ibox; boxdz = dz*ibox boxdx = (boxdx+sign(1/(epsilon(boxdx)),boxdx)) -sign(1/epsilon(boxdx),dx) boxdy = (boxdy+sign(1/(epsilon(boxdy)),boxdy)) -sign(1/epsilon(boxdy),dy) boxdz = (boxdz+sign(1/(epsilon(boxdz)),boxdz)) -sign(1/epsilon(boxdz),dz) dx = dx-box*boxdx; dy = dy-box*boxdy; dz = dz-box*boxdz dr2 = dx*dx + dy*dy + dz*dz scalelj = 0.0d0 if(dr2.lt.rcut2)scalelj = 1.0d0 dr2i = 1.0d0/dr2 dr6i = dr2i*dr2i*dr2i offset = numAtomType*(itype-1)+jtype forcelj = scalelj*dr6i*(lj1(offset)*dr6i-lj2(offset)) potential = potential + scalelj*dr6i*(dr6i*lj3(offset)-lj4(offset)) scalecoul = 0.0d0 if(dr2.lt.cut_coulsq)scalecoul = 1.0d0 r = sqrt(dr2) prefactor = scalecoul*qtmp*q(neigh)/r erfcd = exp(-alpha*alpha*r*r) t = 1.0 / (1.0 + EWALD_P*alpha*r) erfcc = t * (A1+t*(A2+t*(A3+t*(A4+t*A5)))) * erfcd forcecoul = prefactor * (erfcc/r + 2.0*alpha/MY_PIS * erfcd +& r*f_shift) * r e_coul = e_coul + prefactor*(erfcc-r*e_shift-dr2*f_shift) force = (forcecoul+forcelj)*dr2i ffx = ffx + dx*force ffy = ffy + dy*force ffz = ffz + dz*force enddo ff(i)%x = ffx; ff(i)%y = ffy; ff(i)%z = ffz enddo !$omp end parallel do !dir$ end offload potential = 0.50d0*potential e_coul = 0.50d0*e_coul call system_clock(T2,clock_rate,clock_max) if(printflag.eq.1)then print*,'elapsed time w/o newton SP',real(T2-T1)/real(clock_rate) print*,'calculated potentials:',potential*reps, e_coul*reps endif end subroutine lj_cut_coul_dsf_nonewton_SP