Phi asynchronous offload from host openMP parallel region -II

This question, in a way, is continuation of question asked before (https://software.intel.com/en-us/forums/topic/509845). In my code, I need to use offload in slightly more obfuscated fashion. Main body in my code is a sequential for loop. This for loop has a variable workload distribution

over the iterations, hence only in cases where workload is large enough, it offloads part of computations to MIC. The output of computations is not immediately fetched back to host CPU, but postponed until iteration when it is absolutely essential to get certain portion of output on CPU. While whole code is very long, I am providing a small code that simulates what I'm trying to achieve.

#include <stdio.h>
#include <assert.h>
#include <math.h>
#include <omp.h>
#define ONCE alloc_if(1) free_if(1)
#define N 100
#define M 1000

int main_works()
{
    printf("Offload testing 102\n");
    double A[N],B[N],C[N],D[N];

    for (int i = 0; i < N; ++i)
    {
        A[i] = i;
        B[i] = i*i;
        C[i] =0.0;
    }

    int last_offload =-1;

    for (int k = 0; k < M; ++k)
    {
        /*offload for even numbers : Just a random condition for offload*/
        int offload_condition = (k%2==0);

        if (offload_condition)
        {
            /*wait for any previous offload to finish before starting new offload*/
            if (last_offload!=-1)
            {
                #pragma offload_wait target(mic:1) wait(last_offload)

                last_offload =-1;
            }

            printf("Offloading in \t%d th iteration \n",k );
            last_offload =k;

            #pragma offload target(mic:1) signal(last_offload) in(k) in(A[0:N] :ONCE) in(B[0:N] : ONCE) inout(C[0:N] : ONCE)
            {
                for (int i = 0; i < N; ++i)
                {
                    C[i] += (A[i]+B[i]);
                }
            }

        }

        int fetch_condition = ( (double) rand()/(double) RAND_MAX >0.5);

        if (fetch_condition)
        {
            printf("Fetching in  \t%d th iteration\n",k);
            if (last_offload!=-1)
            {

                #pragma offload_wait target(mic:1) wait(last_offload)
            }


            last_offload =-1;

        }

        #pragma omp parallel default(shared)
        {

            // adding some dummy openMP work here
            #pragma omp for schedule(dynamic) nowait
            for (int i = 0; i < N; ++i)
            {
                D[i] -= A[i]*A[i]-1;
            }

            #pragma omp for schedule(dynamic)
            for (int i = 0; i < N; ++i)
            {
                D[i] -= A[i]*A[i]-1;
            }



        }

    }

    if (last_offload!=-1)
    {
        #pragma offload_wait target(mic:1) wait(last_offload)

        last_offload =-1;

    }


    for (int i = 0; i < N; ++i)
    {

        if( fabs(C[i] - (A[i]*(A[i]+1)* (double)( M/2 )) ) > 0.001 )
        {
            printf(" i %d C[i] %lf A[i] %lf A[i]*(A[i]+1) %lf, M/2 %d   \n",i,C[i],A[i], A[i]*(A[i]+1), M/2 );
            // exit(0);
        }


    }
    printf("Returned successfully\n");
    return 0;
}

int main()
{
    printf("Offload testing 102\n");
    double A[N],B[N],C[N],D[N];

    for (int i = 0; i < N; ++i)
    {
        A[i] = i;
        B[i] = i*i;
        C[i] =0.0;
    }


    int s1 = 1;

    int last_offload =-1;

    for (int k = 0; k < M; ++k)
    {
        /*offload for even numbers : Just a random condition
        to simulate  my code*/
        int offload_condition = (k%2==0);

        if (offload_condition)
        {
            /*wait for any previous offload to finish before starting new */
            if (last_offload!=-1)
            {
                #pragma omp parallel default(shared)
                {
                    #pragma omp master
                    {
                        #pragma offload_wait target(mic:1) wait(last_offload)
                    }
                }
                last_offload =-1;
            }

            printf("Offloading in \t%d th iteration \n",k );
            last_offload =k;

            #pragma omp parallel default(shared)
            {
                #pragma omp master
                {
                    #pragma offload target(mic:1) signal(last_offload) in(k) in(A[0:N] :ONCE) in(B[0:N] : ONCE) inout(C[0:N] : ONCE)
                    {
                        for (int i = 0; i < N; ++i)
                        {
                            C[i] += (A[i]+B[i]);
                        }
                    }
                }
            }
        }


        #pragma omp parallel default(shared)
        {

            #pragma omp master
            {
                /*on some random iterations get back C*/
                int fetch_condition = ( (double) rand()/(double) RAND_MAX >0.5);

                if (fetch_condition)
                {
                    printf("Fetching in  \t%d th iteration\n",k);
                    if (last_offload!=-1)
                    {
                        #pragma offload_wait target(mic:1) wait(last_offload)
                    }
                    last_offload =-1;
                }

            }

            // adding some dummy work
            #pragma omp for schedule(dynamic) nowait
            for (int i = 0; i < N; ++i)
            {
                D[i] -= A[i]*A[i]-1;
            }

            #pragma omp for schedule(dynamic)
            for (int i = 0; i < N; ++i)
            {
                D[i] -= A[i]*A[i]-1;
            }

        }

    }

    if (last_offload!=-1)
    {
        #pragma offload_wait target(mic:1) wait(last_offload)

        last_offload =-1;

    }


    for (int i = 0; i < N; ++i)
    {

        if( fabs(C[i] - (A[i]*(A[i]+1)* (double)( M/2 )) ) > 0.001 )
        {
            printf(" i %d C[i] %lf A[i] %lf A[i]*(A[i]+1) %lf, M/2 %d   \n",i,C[i],A[i], A[i]*(A[i]+1), M/2 );
            // exit(0);
        }


    }
    printf("Returned successfully\n");
    return 0;
}

There are two main functions main and main_works . First function is a working reference code that demonstrate what I'm trying to achieve.

Let's first look at main_works() functions.

for (int k = 0; k < M; ++k)

is the main sequential for loop in my code. On certain iterations it offloads the computations

int offload_condition = (k%2==0);

When offload_condition is 1 , then first, it waits for all previous offloads and then sets the last_offload variable to present iteration and offloads the computation to MIC. last_offload variable is used to keep track of when was the computation offloaded to MIC. Intention of doing (i.e. using last_offload) is to maintain maximum amount of asynchronicity between cpu and MIC.

On some of the iterations, it fetches back data from MIC to host CPU (simulated here as follows)

int fetch_condition = ( (double) rand()/(double) RAND_MAX >0.5);

The above code works as expected. . However, I intend to do all the offload calls from an openMP parallel region.

The problem arises in main() function, where the offloads are done from omp parallel regions. While the code given here doesn't illustrate why is it important to offload from a parallel region, but idea is that while master thread is busy offloading, sending data and waiting for output for MIC, rest of the threads are busy in work-sharing construct such as openmp for or other useful work. In this case I get following error

offload error: device 1 does not have a pending signal for wait((nil))
Aborted (core dumped)

In the previous question, the solution suggested was to use of "#pragma omp master" to do the MIC offloads. However, for some reasons that trick doesn't work here. I would like to understand why does it happens and how can I resolve this issue.

It is also quite possible that there might be a better way to do what I try to do here using "last_offload", if that is the case then I would like to know, if can be done differently while achieving this asynchronicity between cpu and MIC.

Sincerely,

Piyush

Latest Images

Trending Articles

Latest Images