Hello everyone,
In the application I am developing a large matrix M exists of size nxm (n >> m) and a vector x of appropriate size (I will explain what I mean by appropriate). I have to perform two matrix-vector multiplications multiple times, but not with the whole matrix M. I have included a picture to help the discussion.
The application creates a binary tree, with the root node representing the whole matrix M. In the picture I assume that the matrix has n=20 rows, which are represented by the indices contained in the node. At each step the matrix is divided into two disjoint sets of rows (it is not necessary for the two sets to be of equal size). In order to keep track of this, each node contains a vector 'indices' and an integer field 'num_of_indices', that hold which rows of the original matrix belong to each node and how many these rows are. As it is obvious, the matrices represented in each node are not contiguous in memory and there is no special structure about which indices belong to which node (with the exception of the root node of course).
The operation I have to perform involves the rows of the M matrix that belong to a node. More specifically, I have to perform Mi*MiT*x, i.e., the matrix Mi for node i as expressed by 'indices' is multiplied by its transpose and by vector x (which has the appropriate size, i.e., 'num_of_indices' elements). In order to reduce the total number of operations I do Mi*(MiT*x), which results in two matrix-vector multiplications. This operation is executed until the method I use converges (could be up to 400 times with my current data set).
All this is performed in parallel. Each new node is created as an OpenMP task and in order to perform the above operation, each node uses an appropriate number of threads using OpenMP parallel and for directives. For example, nodes 1 and 2 would use 8/20 and 12/20 of the available cores of a Xeon Phi to perform the matrix-vector multiplications. After they finish, nodes 3, 4, 5 and 6 would use 5/20, 3/20, 5/20 and 7/20 of the available cores, etc.
Since the format of the matrix in each node does not follow a well known storage format (dense, CRS, etc), I cannot use MKL to perform the operation I need. Therefore, I have written my own code to do that, which is as follows (for the MiT*x part of the calculation):
#pragma omp for for (i = 0; i < node->num_of_indices; i++) { row = &M[node->indices[i] * m]; // allocated with malloc() as a 1D vector, need to translate 2D coordinates #pragma vector aligned nontemporal #pragma ivdep for (j = 0; j < m; j++) { x_2[j] += row[j] * x[i]; } }
The code for Mi*x_2 is similar. The problem I have is that the operation Mi*(MiT*x) requires over 85% of the total execution time and I cannot seem to optimize it further. I used VTune and from the statistics collected it seems that most of the time is spent in vmovapd instructions, i.e., reading data from memory. I understand that the operations performed per data item fetched from memory are very few in a matrix-vector multiplication, but I was wondering if there are any ideas about how I could further optimize this operation on the Xeon Phi with the specific representation of matrix M in each node.
I also tried to use some kind of prefetching in the code, but failed miserable. More specifically, I tried blocking the i loop by touching a few elements in each row (64 bytes away each time to go to the next cache line), touching as many rows so as to fill the L2 cache and then perform the required operations. This repeats as many times as necessary to finish with all rows of a node. I did it correctly, since I get the correct results and verified addresses being touched, but execution time almost doubled.
I use icc 16.0.0 and compile with -O3 -fopenmp -Wall -opt-report=5 (the optimization report indeed verifies that all inner loops are vectorized and all memory accesses are aligned).
If anyone has any ideas about this, I would appreciate his/her help.
Ioannis