Poor performance gain when no-vec, no-sim // Vtune and libiomp5.so

Hello everybody,

After three long days, I come here in search of help.

Context:
I am running a N-body code written in Fortran, using OMP directives (the only one used is a $omp parallel directive). I am running the code natively in the Xeon Phi. I am having no problem in executing the code. After much reading, I can not bring down the time execution of my code. It is fast, but not as fast as it should be. What I mean by that?

I have done a few test for two sets of particles 960 and 9600, with different flags/directives. The results show below correspond to calculate the coulomb repulsion of N particles 101 times. I averaged over the last 100 times (I through away the first run, to be sure that there is no initialization issues), I also compute the standard deviation of such 100 time executions. The results are:

1st case: Desactivate vectorizations and simd
flags = -mmic -w -O3 -opt-matmul -no-prec-div -ip -ipp -fpp -openmp -par-num-threads=240 -align array64byte -vec-report0 -no-vec -no-sim

Nparticles / average execution time / standard deviation
960 1.155452728271484E-003 5.689634014904931E-005
9600 6.604535579681396E-002 1.237208427569816E-004

2nd case: activate vectorization and sim
flags = -mmic -w -O3 -opt-matmul -no-prec-div -ip -ipp -fpp -openmp -par-num-threads=240 -align array64byte -vec-report0

Nparticles / average execution time / standard deviation
960 3.155016899108887E-004 7.613400033109344E-006
9600 2.110156297683716E-002 9.613492695550194E-005

9600 → time(-no-sim -no-vec) / time() = 3.13
960 → time(-no-sim -no-vec) / time() = 3.66

If I get it right, I should get a theoretical speed up of 8 (and not 3) in a Intel Xeon Phi (512Bits = 64Bytes at the time, that means 8 double at the time).

As you can see, I am using -align array64byte when compiling and using:

!dir$ attributes align:64

for all the long arrays (not for simple scalars)

I have read the different reports that can be generated, but they are a bit confusing for my level.

So first explicit question:
Could somebody comment on this results? Are they normal? The speed-up of a factor of 3 explained above seems ok?

Second question:
I was asking myself if I was being too picky, so I wanted to measure somehow the "flops" of my executable. And therefore I turned to Vtuen (first time using it).

I found different texts on the web of how to use it with the Xeon Phi and native applications. However none of the examples used openmp. My problem is the libiomp.so file. For example, when I execute:
/opt/intel/vtune_amplifier_xe/bin64/amplxe-cl -collect knc-lightweight-hotspots --search-dir all:/home/jofre/mic0fs -- ssh mic0 /home/jofre/mic0fs/a.out

I get an error:/home/jofre/mic0fs/a.out: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory

Obviously, the file is right there. When I execute normally, I have first set

export LD_LIBRARY_PATH=/home/jofre/mic0fs

otherwise I get the same error. Someone knows what I can do to solve it?

If you read until to here, you deserve a big "thank you"!
Jofre