Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

question on intrinsics

$
0
0

Dear forum,

Here is a simple pseudo-code I'm trying to implement
1. load 8 float64 numbers from addr to create vec_a
2. load next, adjacent 8 float64 numbers from (addr+64) to create vec_b
3. perform operation on the first element of vec_a and first element of vec_b and return a scalar f=f(vec_a[0], vec_b[0]) (f is several subtraction, division operations)
4. perform element by element operation vec_a[i] + f * vec_b[i], i=1...7

My implementation of the above is
1._mm512_load_pd
2. another _mm512_load_pd
3. use _mm512_store_pd to copy both vectors to two statically allocated arrays, and perform the scalar arithmetic f=f(vec_a_on_stack[0], vec_b_on_stack[0]). then use _mm512_set1_pd to broadcast the scalar float64 to form a vector
4. _mm512_fmadd_pd

The performance gain seems marginal. Could you please advise on step 3? Is there any elegant way to avoid copying the registers to local arrays but instead directly work on the register? I notice from the compiler manual that swizzle can be used for broadcast but it seems not able to broadcast a single element to all the 8 float64 slots.
I appreciate any advice.


Viewing all articles
Browse latest Browse all 1347

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>