Hey everyone, I have a loop structure that looks like the following.
do all atoms i numneigh(i) = 0 do all potential neighbors k do j = potential_neighbor(k) delx = x(j)-x(i); dely = y(j)-y(i); delz = z(j)-z(i) dr2 = delx**2 + dely**2 + delz**2 if(dr2.lt.rcut) numneigh(i) = numneigh(i)+1 neighbor(i)(numneigh(i)) = j endif end do enddo
Now I am reading a paper discussing how we can implement this code efficiently on a MIC. Clearly the above loop will not auto vectorize due to the loop dependence in lines 12 and 13. They mention that an efficient way to vectorize appending to slits in this manner is to use a packed store. They further mention: "For a SIMD register packed with req values, the result of a comparison with Rc+Rs is a W bit mask, and a packed store write a a subset of indices to contiguous memory based upon the mask." However as a Chemical Engineering PhD, I don't know really know whats going on here. I ran a decent google search, and the info was a little above my head. Could anyone explain this concept further to me. And if possible, modify the above code for me with this procedure so that I have an example to look at?