Hello, for my thesis I have run a simple code used to study a Lennard Jones system on a Xeon Phi coprocessor and I tried to vectorize it and study the variations on execution time.
The machine I used in particular has 61 cores with 32 kB of L1 cache and 512 kB of L2 cache, the vector register can memorize 512 bit.
I implemented the code with, and without, the cell-list method and used different numbers of particles, in particular from 512 to 16384, doubling it each time.
Positions and forces are memorized in three different vectors (rx,ry,rz and fx,fy,fz).
I have good results in the case without the cell-list but in the other one I have some strange results.
The dependence between the cell-list and the number of particle should be linear with the cell-list method implemented, indeed I obtained a straight line plotting the time over the number of particles, but with N=8192 and N=16384 the time of execution is much higher.
I tried to do some calculation with values of N near these values but the scaling is correct for each other number, only for those two there's a problem.
To make it clear I report some value:
N Time 512 6.14995 1024 11.1381 2048 23.1964 4096 51.9393 6144 78.1251 8192 389.724 10240 144.173 12288 167.772 14336 209.669 16384 822.131
I think is a technical problem but I really don't know exactly why this happens.
I also observed a really low variation using the vectorization, without the cell-list I observed a variation of a factor 4x, more or less, but with the cell-list it's only around 1.5x.
Questions:
Does anybody have an idea of what could the problem be? Why those particular values are strange and why the vectorization gain is so low?
My professor told me that can happen that some values show strange results on the execution, did anybody observe something like this?
Thank you very much.
Below I report the main loop in which are evaluated the forces, in few words the main part of the execution, implemented with the cell-list.
for(vcy=0; vcy<ncell; vcy++){ for(vcx=0; vcx<ncell; vcx++){ previouspartc=0; // Central cell index c=vcx*ncell+vcy; // Define previouspart for(p=1; p<=c; p++) previouspartc=previouspartc+npart[p-1]; // Loop over central cell's particles for(i=0; i<npart[c]-1; i++){ for(j=i+1; j<npart[c]; j++){ ftempx=0.; ftempy=0.; dx =rx1[previouspartc+i]-rx1[previouspartc+j]; dy =ry1[previouspartc+i]-ry1[previouspartc+j]; dx = (dx + 0.5*dy)*L; dy = dy*halfsq3*L; r2 = dx*dx + dy*dy; if(r2<r2cut) { rr2 = 1./r2; rr6 = rr2*rr2*rr2; enk+=(c12*rr6 -c6)*rr6 -ecut; vir=(cf12*rr6-cf6)*rr6*rr2; ftempx=vir*dx; ftempy=vir*dy; } fx1[previouspartc+i]+=ftempx; fy1[previouspartc+i]+=ftempy; fx1[previouspartc+j]-=ftempx; fy1[previouspartc+j]-=ftempy; } } // Create the two indexes vcx1, vcy1 of the neighbour cells (the one on the right and the three under) vcx1[0]=vcx+1; vcy1[0]=vcy; for(k=1; k<4; k++){ vcx1[k]=vcx-1+(k-1); vcy1[k]=vcy-1; } // Loop over near cells for(k=0; k<4; k++){ previouspartc1=0; // PBC shiftx=0.; shifty=0.; if(vcx1[k] <0){ shiftx= -1; vcx1[k]=ncell-1;} else if(vcx1[k] >=ncell){ shiftx= 1; vcx1[k]=0;} if(vcy1[k] <0){ shifty= -1; vcy1[k]=ncell-1;} else if(vcy1[k] >=ncell){ shifty= 1; vcy1[k]=0;} // Scalar cell index of neighbour cell c1=vcx1[k]*ncell+vcy1[k]; // Define previouspart for(p=1; p<=c1; p++) previouspartc1=previouspartc1+npart[p-1]; for(i=0; i<npart[c]; i++){ for(j=0; j<npart[c1]; j++){ ftempx=0.; ftempy=0.; dx =rx1[previouspartc+i]-(rx1[previouspartc1+j]+shiftx); dy =ry1[previouspartc+i]-(ry1[previouspartc1+j]+shifty); dx = (dx + 0.5*dy)*L; dy = dy*halfsq3*L; r2 = dx*dx + dy*dy; if(r2<r2cut) { rr2 = 1./r2; rr6 = rr2*rr2*rr2; enk+=(c12*rr6 -c6)*rr6 -ecut; vir=(cf12*rr6-cf6)*rr6*rr2; ftempx=vir*dx; ftempy=vir*dy; } fx1[previouspartc+i]+=ftempx; fy1[previouspartc+i]+=ftempy; fx1[previouspartc1+j]-=ftempx; fy1[previouspartc1+j]-=ftempy; } } } } }