Hi all,
Recently, I have some doubts about why the support of arbitrary shuffle operations is impractical while reading the "Intel Xeon Phi Coprocessor System Software Developers Guide" (link: https://software.intel.com/sites/default/files/article/334766/intel-xeon...). On page 157, it mentions "To support fully arbitrary control sequences across all of the element muxes, however, would require 32 bits of immediate encoding on the shuffle instruction." Then, I know it is impractical because the immediate encoding bits are too long. However, I don't know why it is 32 bits. For my understanding, if there are 16 numbers in the 512-bit vector, the possible combinations should be 16^16=2^64 (including duplicate elements), which means 64 bits of immediate encoding are required. I am not sure how the 32 bits are derived?