Quantcast
Channel: Intel® Many Integrated Core Architecture
Viewing all articles
Browse latest Browse all 1347

Parallel offload and memory retention in OpenMP 4.0

$
0
0

Hi,

I have two questions:

1) Is there any way to have memory retention with #pragma omp target device(0), similar to #pragma offload target(mic:0) in(p: length(size) align(align) alloc_if(i==0) free_if(i==reps-1))?

2) Why parallel offload without memory retention is serialized?

Let's take a look at the following code:

#include <stdio.h>
#include <omp.h>

#if TEST==1
#define OFFLOAD offload_transfer target(mic:m) in(p: length(size) align(align) alloc_if(i==0) free_if(i==reps-1))
#elif TEST==2
#define OFFLOAD offload_transfer target(mic:m) in(p: length(size) align(align))
#elif TEST==3
#define OFFLOAD omp target device(m) map(to:p[0:size])
#endif

int main(int argv, char** argc){

    int mics = 1;
    if (argv>1) mics = atoi(argc[1]);

    int reps = 3;
    if (argv>2) reps = atoi(argc[2]);

    int align = 64;
    if (argv>3) align = atoi(argc[3]);

    for (size_t size = 1L; size < 1L<<34; size *= 2){
        char * data[mics];
        for(int m=0; m<mics; m++)
            data[m]=(char*) _mm_malloc(size, align);
        for (int i = 0; i<reps ; i++){
            {
                for(int m=0; m<mics; m++)
                    data[m][0:size] = i;
                double time = 0.0;
                double bw = 1.0;
#pragma omp parallel for reduction(+:bw) reduction(max:time)
                for(int m = 0; m < mics; m++){
                    char * p = data[m];
                    const double t1 = omp_get_wtime();
#pragma OFFLOAD
                            { }
                    time = omp_get_wtime() - t1;
                    bw = (size / time)/ (1L<<20);
                }
                printf("out: %6d\t%6d\t%12ld\t%9.6f\t%9.3f\n",
                        mics, i, size, time, bw);
            }
        }
        for(int m = 0; m < mics; m++)
            _mm_free(data[m]);
    }
}

I'm getting the following result for serial offload to one Intel Xeon Phi coprocessor:

 

First offload with both #pragma offload and #pragma omp target is very slow, with the maximum bandwidth close to only 0.5GB/s, which can be improved by using huge 2MB pages: export MIC_USE_2MB_BUFFERS=0. This will increase the bandwidth to 1.2GB/s (see the next plot).

Regular offload pragmas allocate and deallocate memory on Xeon Phi coprocessor on each offload call. Maximum bandwidth for second and all later offload calls is 2.2GB/s.

By using memory retention we can get to the PCIe v2 bandwidth limit, which is ~6.3GB/s.

Using 2MB pages we can improve initial offload of new data for both #pragma offload and #pragma omp target:

But I think the most interesting part is parallel offload. Let's take a look at scaling offload bandwidth if we use 1, 2, 3, and 4 Intel Xeon Phi coprocessors. Data transfer should run in parallel, because we using up to 4 OpenMP threads on the host system.

This result is a little bit confusing. Regular parallel #pragma offload and #pragma omp target without memory retention showed almost the same bandwidth as serial offload to one device. And only offload with memory retention scales linearly with the number of devices.

Using modified code of Intel PCM tool (https://software.intel.com/en-us/articles/intel-performance-counter-monitor) and also monitoring the memory of Intel Xeon Phi coprocessors, I've been able to plot the timeline of memory utilization and data transfer over PCIe buses on Socket 0 and Socket 1 of the host system. 8GB of data transfered to each of 4 Intel Xeon Phi coprocessor 5 times. We comparing regular parallel offload with #pragma offload and #pragma omp target (top), and parallel offload with memory retention (bottom):

It is obvious from this timeline plot, that for the regular parallel #pragma offload and #pragma omp target data transfer to each Xeon Phi coprocessor is serialized. And only if we use memory retention with alloc_if(0)/free_if(0) clauses we observe simultaneous data transfer to each device. Therefore, I'm wondering if it possible to have memory retention with OpenMP 4.0 target pragma. And why we observing this serialization at all? May be there is some implementation limits in offload runtime library? Or may be it's a bug? Can anyone explain what is going on behind the scenes? Any help will be appreciated. Thank you!

 

AllegatoDimensione
Scaricaoffload_1.cpp3.17 KB
Scaricafig1.png116.65 KB
Scaricafig2_3.png119.27 KB
Scaricafig3a.png181.37 KB
Scaricatimeline2MB.png135.56 KB

Viewing all articles
Browse latest Browse all 1347

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>