Hello,
The question comes from following code:
float fa[128] __attribute__((align(64))); float fb[128] __attribute__((align(64))); for(j=0; j<100000000; j++) { for(k=0; k<128; k++) { fa[k]=a*fa[k]+fb[k]; } }
When i compile it with icc and -no-vec option it takes about 124 s to complete and with auto-vectorization it only needs 1.5 s. This means there is a speedup of about 80x even though the vector units can only process 16 Floats at once.
Doing the same on an Intel Xeon E5-1620 v2 @ 3.70GHz results in 5,6 s with -no-vec and 1.5 s with auto-vectorization.
All testswere done using only 1 core.
Why does the Xeon Phi speed up so good with Vector Instructions and the Xeon doesnt? Shouldnt the Xeon speed up 8 times, as the Vector registers are 256 bit?