Dear all,
I want to implement prefetching for sparse complex double precision data using Intrinsics.
A linear array contains the indexes of the sparse complex double elements like so {1,2,3,4,150,151,7000,7001,10000,10001}
As each of these elements are 16 contiguous bytes in memory, how should I use the prefetch intrinsic meant for single precision floats correctly?
Should I use _mm512_mask_prefetch_i32gather_ps() and explicitly prefetch each 4 byte piece of the 16 bytes?
Or can I expect that each element in the index register will cause 64 bytes to be prefetched into cache? In that case I could perform some modular arithmetic on the index values to only prefetch individual unique cache lines. (I have actually tried this approach with disappointing results)
Best regards,
Alastair