I have a host with three phi card,and a big matrix(it is so large that it cannot be directly copoed to phi card) need be divided three part then offload to phi card,doing some processing,then each part of big matrix need transpose back to host.
how could I implement this using c++?